High-quality data is the fuel that powers AI algorithms. Without a continual flow of labeled data, bottlenecks can occur and the algorithm will slowly get worse and add risk to the system.
It’s why labeled data is so critical for companies like Zoox, Cruise and Waymo, which use it to train machine learning models to develop and deploy autonomous vehicles. That need is what led to the creation of Scale AI, a startup that uses software and people to process and label image, lidar and map data for companies building machine learning algorithms. Companies working on autonomous vehicle technology make up a large swath of Scale’s customer base, although its platform is also used by Airbnb, Pinterest and OpenAI, among others.
The COVID-19 pandemic has slowed, or even halted, that flow of data as AV companies suspended testing on public roads — the means of collecting billions of images. Scale is hoping to turn the tap back on, and for free.
The company, in collaboration with lidar manufacturer Hesai, launched this week an open-source data set called PandaSet that can be used for training machine learning models for autonomous driving. The data set, which is free and licensed for academic and commercial use, includes data collected using Hesai’s forward-facing PandarGT lidar with image-like resolution, as well as its mechanical spinning lidar known as Pandar64. The data was collected while driving urban areas in San Francisco and Silicon Valley before officials issued stay-at-home orders in the area, according to the company.
“AI and machine learning are incredible technologies with an incredible potential for impact, but also a huge pain in the ass,” Scale CEO and co-founder Alexandr Wang told TechCrunch in a recent interview. “Machine learning is definitely a garbage in, garbage out kind of framework — you really need high-quality data to be able to power these algorithms. It’s why we built Scale and it’s also why we’re using this data set today to help drive forward the industry with an open-source perspective.”
The goal with this lidar data set was to give free access to a dense and content-rich data set, which Wang said was achieved by using two kinds of lidars in complex urban environments filled with cars, bikes, traffic lights and pedestrians.
“The Zoox and the Cruises of the world will often talk about how battle-tested their systems are in these dense urban environments,” Wang said. “We wanted to really expose that to the whole community.”
The data set includes more than 48,000 camera images and 16,000 lidar sweeps — more than 100 scenes of 8s each, according to the company. It also includes 28 annotation classes for each scene and 37 semantic segmentation labels for most scenes. Traditional cuboid labeling, those little boxes placed around a bike or car, for instance, can’t adequately identify all of the lidar data. So, Scale uses a point cloud segmentation tool to precisely annotate complex objects like rain.
Open sourcing AV data isn’t entirely new. Last year, Aptiv and Scale released nuScenes, a large-scale data set from an autonomous vehicle sensor suite. Argo AI, Cruise and Waymo were among a number of AV companies that have also released data to researchers. Argo AI released curated data along with high-definition maps, while Cruise shared a data visualization tool it created called Webviz that takes raw data collected from all the sensors on a robot and turns that binary code into visuals.
Scale’s efforts are a bit different; for instance, Wang said the license to use this data set doesn’t have any restrictions.
“There’s a big need right now and a continual need for high-quality labeled data,” Wang said. “That’s one of the biggest hurdles overcome when building self-driving systems. We want to democratize access to this data, especially at a time when a lot of the self-driving companies can’t collect it.”
That doesn’t mean Scale is going to suddenly give away all of its data. It is, after all a for-profit enterprise. But it’s already considering collecting and open sourcing fresher data later this year.