Is ‘fake data’ the real deal when training algorithms?

Show caption

One of Synthesis AI’s digital avatars at the wheel. Photograph: Synthesis AI

The use of synthetic data is a cost‑effective way to teach AI about human responses. But can it help eliminate bias and make self‑driving cars safer?

Sat 18 Jun 2022 14.00 BST

You’re at the wheel of your car but you’re exhausted. Your shoulders start to sag, your neck begins to droop, your eyelids slide down. As your head pitches forward, you swerve off the road and speed through a field, crashing into a tree.

But what if your car’s monitoring system recognised the tell-tale signs of drowsiness and prompted you to pull off the road and park instead? The European Commission has legislated that from this year, new vehicles be fitted with systems to catch distracted and sleepy drivers to help avert accidents. Now a number of startups are training artificial intelligence systems to recognise the giveaways in our facial expressions and body language.

These companies are taking a novel approach for the field of AI. Instead of filming thousands of real-life drivers falling asleep and feeding that information into a deep-learning model to “learn” the signs of drowsiness, they’re creating millions of fake human avatars to re-enact the sleepy signals.

“Big data” defines the field of AI for a reason. To train deep learning algorithms accurately, the models need to have a multitude of data points. That creates problems for a task such as recognising a person falling asleep at the wheel, which would be difficult and time-consuming to film happening in thousands of cars. Instead, companies have begun building virtual datasets.

Synthesis AI and Datagen are two companies using full-body 3D scans, including detailed face scans, and motion data captured by sensors placed all over the body, to gather raw data from real people. This data is fed through algorithms that tweak various dimensions many times over to create millions of 3D representations of humans, resembling characters in a video game, engaging in different behaviours across a variety of simulations.

In the case of someone falling asleep at the wheel, they might film a human performer falling asleep and combine it with motion capture, 3D animations and other techniques used to create video games and animated movies, to build the desired simulation. “You can map [the target behaviour] across thousands of different body types, different angles, different lighting, and add variability into the movement as well,” says Yashar Behzadi, CEO of Synthesis AI.

Using synthetic data cuts out a lot of the messiness of the more traditional way to train deep learning algorithms. Typically, companies would have to amass a vast collection of real-life footage and low-paid workers would painstakingly label each of the clips. These would be fed into the model, which would learn how to recognise the behaviours.

The big sell for the synthetic data approach is that it’s quicker and cheaper by a wide margin. But these companies also claim it can help tackle the bias that creates a huge headache for AI developers. It’s well documented that some AI facial recognition software is poor at recognising and correctly identifying particular demographic groups. This tends to be because these groups are underrepresented in the training data, meaning the software is more likely to misidentify these people.

Niharika Jain, a software engineer and expert in gender and racial bias in generative machine learning, highlights the notorious example of Nikon Coolpix’s “blink detection” feature, which, because the training data included a majority of white faces, disproportionately judged Asian faces to be blinking. “A good driver-monitoring system must avoid misidentifying members of a certain demographic as asleep more often than others,” she says.

The typical response to this problem is to gather more data from the underrepresented groups in real-life settings. But companies such as Datagen say this is no longer necessary. The company can simply create more faces from the underrepresented groups, meaning they’ll make up a bigger proportion of the final dataset. Real 3D face scan data from thousands of people is whipped up into millions of AI composites. “There’s no bias baked into the data; you have full control of the age, gender and ethnicity of the people that you’re generating,” says Gil Elbaz, co-founder of Datagen. The creepy faces that emerge don’t look like real people, but the company claims that they’re similar enough to teach AI systems how to respond to real people in similar scenarios.

There is, however, some debate over whether synthetic data can really eliminate bias. Bernease Herman, a data scientist at the University of Washington eScience Institute, says that although synthetic data can improve the robustness of facial recognition models on underrepresented groups, she does not believe that synthetic data alone can close the gap between the performance on those groups and others. Although the companies sometimes publish academic papers showcasing how their algorithms work, the algorithms themselves are proprietary, so researchers cannot independently evaluate them.

In areas such as virtual reality, as well as robotics, where 3D mapping is important, synthetic data companies argue it could actually be preferable to train AI on simulations, especially as 3D modelling, visual effects and gaming technologies improve. “It’s only a matter of time until… you can create these virtual worlds and train your systems completely in a simulation,” says Behzadi.

This kind of thinking is gaining ground in the autonomous vehicle industry, where synthetic data is becoming instrumental in teaching self-driving vehicles’ AI how to navigate the road. The traditional approach – filming hours of driving footage and feeding this into a deep learning model – was enough to get cars relatively good at navigating roads. But the issue vexing the industry is how to get cars to reliably handle what are known as “edge cases” – events that are rare enough that they don’t appear much in millions of hours of training data. For example, a child or dog running into the road, complicated roadworks or even some traffic cones placed in an unexpected position, which was enough to stump a driverless Waymo vehicle in Arizona in 2021.

Synthetic faces made by Datagen.

With synthetic data, companies can create endless variations of scenarios in virtual worlds that rarely happen in the real world. “Instead of waiting millions more miles to accumulate more examples, they can artificially generate as many examples as they need of the edge case for training and testing,” says Phil Koopman, associate professor in electrical and computer engineering at Carnegie Mellon University.

AV companies such as Waymo, Cruise and Wayve are increasingly relying on real-life data combined with simulated driving in virtual worlds. Waymo has created a simulated world using AI and sensor data collected from its self-driving vehicles, complete with artificial raindrops and solar glare. It uses this to train vehicles on normal driving situations, as well as the trickier edge cases. In 2021, Waymo told the Verge that it had simulated 15bn miles of driving, versus a mere 20m miles of real driving.

An added benefit to testing autonomous vehicles out in virtual worlds first is minimising the chance of very real accidents. “A large reason self-driving is at the forefront of a lot of the synthetic data stuff is fault tolerance,” says Herman. “A self-driving car making a mistake 1% of the time, or even 0.01% of the time, is probably too much.”

In 2017, Volvo’s self-driving technology, which had been taught how to respond to large North American animals such as deer, was baffled when encountering kangaroos for the first time in Australia. “If a simulator doesn’t know about kangaroos, no amount of simulation will create one until it is seen in testing and designers figure out how to add it,” says Koopman. For Aaron Roth, professor of computer and cognitive science at the University of Pennsylvania, the challenge will be to create synthetic data that is indistinguishable from real data. He thinks it is plausible that we’re at that point for face data, as computers can now generate photorealistic images of faces. “But for a lot of other things,” – which may or may not include kangaroos – “I don’t think that we’re there yet.”