Why AI needs a steady diet of synthetic data

November 22, 2022 12:00 PM

A sample of Parallel Domain’s synthetic data showing a map view of its virtual world capabilities.

Image Credit: Parallel Domain

Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and achieve efficiency by upskilling and scaling citizen developers. Watch now.

Artificial intelligence (AI) may be eating the world as we know it, but experts say AI itself is also starving — and needs to change its diet. One company says synthetic data is the answer.

“Data is food for AI, but AI today is underfed and malnourished,” said Kevin McNamara, CEO and cofounder of synthetic data platform provider, Parallel Domain, which just raised $30 million in a series B round led by March Capital. “That’s why things are growing slowly. But if we can feed that AI better, models will grow faster and in a healthier way. Synthetic data is like nourishment for training AI.”

Research has shown that about 90% of AI and machine learning (ML) deployments fail. A Datagen report from earlier this year pointed out that a lot of failure is due to the lack of training data. It found that 99% of computer vision professionals say they have had an ML project axed specifically because of the lack of data to see it through. Even the projects that aren’t fully canceled for lack of data experience significant delays, knocking them off track, 100% of respondents reported.

In that vein, Gartner predicts synthetic data will increasingly be used as a supplement for AI and ML training purposes. The research giant projects that by 2024 synthetic data will be used to accelerate 60 % of AI projects.

Event

Intelligent Security Summit

Learn the critical role of AI & ML in cybersecurity and industry specific case studies on December 8. Register for your free pass today.

Synthetic data is generated by machine learning algorithms that ingest real data to train on behavioral patterns and create simulated data that retains the statistical properties of the original dataset. The resulting data replicates real-world circumstances, but unlike standard anonymized datasets, it’s not vulnerable to the same flaws as real data.

Pulling AI out of the ‘Stone Age’

It may sound unusual to hear that a technology as advanced as AI is stuck in a “Stone Age” of sorts, but that’s what McNamara sees — and without adoption of synthetic data, it will stay that way, he says.

“Right now AI development is kind of the way computer programming was in the ‘60s or ‘70s when people used punch card programming — a manual, labor-intensive process,” he said. “Well, the world eventually moved away from this and to digital programming. We want to do that for AI development.”

The three biggest bottlenecks keeping AI in the Stone Age are the following, according to McNamara :

Collecting real-world data — which is not always feasible. Even for something like jaywalking, which happens fairly often in cities around the world, if you need millions of examples to train your algorithm, that quickly becomes unattainable for companies to go out and get from the real world.
Labeling — which often requires thousands of hours of human time and can be inaccurate because, well, humans make errors.
Iterating on the data once it is labeled — which requires you to adjust sensor configurations etc. and then apply it to actually begin to train your AI.

“That whole process is so slow,” McNamara said. “If you can change those things really fast, you can actually discover better setups and better ways to develop your AI in the first place.”

Enter stage right: Synthetic data

Parallel Domain works by generating virtual worlds based off of maps, which it dubs “digital cousins” of real-world scenarios and geographies. These worlds can be altered and manipulated to, for instance, have more jaywalking or rain, to aid with training autonomous vehicles.

A sample of Parallel Domain’s synthetic data showing a map view of its virtual world capabilities.

Because the worlds are digital cousins and not digital twins, customization can simulate the sometimes harder-to-obtain — but essential for training — data that companies normally would have to go out and get themselves. The platform allows users to tailor it to their needs via an API, so they can move or manipulate factors precisely the way they want. This accelerates the AI training process and removes roadblocks of time and labor.

The company claims that in a matter of hours it can provide training datasets that are ready for its customers to use — customers that include the Toyota Research Institute, Google, Continental and Woven Planet.

“Customers can go into the simulated world and make things happen or pull data from that world,” McNamara said. “We have knobs for different kinds of categories of assets and scenarios that could happen, as well as ways for customers to plug in their own logic for what they see, where they see it and how those things behave.”

Then, customers need a way to pull data from that world into the configuration that matches their setup, he explained.

“Our sensor configuration tools and label configuration tools allow us to replicate the exact camera setup or the exact lidar and radar and labeling setup that a customer would see,” he said.

Synthetic data, generative AI

Not only is synthetic data useful for AI and ML model training, it can be applied to make generative AI — an already rapidly growing use of the technology — develop even faster.

Parallel Domain is eyeing the field as the company enters 2023 with fresh capital. It hopes to multiply the data that generative AI needs to train, so it can become an even more powerful tool for content creation. Its R&D team is focusing on the variety and detail in the synthetic data simulations it can provide.

“I’m excited about generative AI in our space,” McNamara said. “We’re not here to create an artistic interpretation of the world. We’re here to actually create a digital cousin of the world. I think generative AI is really powerful in looking at examples of images from around the world, then pulling those in and creating interesting examples and novel information inside of synthetic data. Because of that, generative AI will be a large part of the technology advancements that we’re investing in in the coming year.”

The value of synthetic data isn’t limited to AI. Given the vast amount of data needed to create realistic virtual environments, it’s also the only practical approach to move the metaverse forward.

Parallel Domain is part of the fast-growing synthetic data startup sector, which Crunchbase previously reported is seeing a swath of funding. Datagen, Gretel AI and Mostly AI are some of its competitors that have also raised multiple millions in the last year.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.