Artificial Intelligence Is Now Shockingly Good At Sounding Human

Synthetic voices have become ubiquitous. They feed us directions in the morning, shepherd us through phone calls by day and broadcast the news on smart speakers at night. And as the technology used to make them improves, these voices are becoming more and more human-sounding. This is the final frontier in synthetic speech: replicating not just what we say but how we say it.

Rupal Patel heads a research group at Northeastern University that studies speech prosody—the changes in pitch, loudness and duration that we use to convey intent and emotion through voice. “Sometimes people think of it as the icing on the cake,” she explains. “You have the message, and now it’s how you modulate that message, but I really think it’s the scaffolding that gives meaning to the message itself.”

Patel says she grew interested in prosody after finding it was the only element of vocal communication that seemed to be available to people with some kinds of severe speech disorders. These patients were able to make expressive sounds even if they could not speak clearly. In 2014 Patel founded a company to build custom synthetic voices for nonspeaking individuals. VocaliD has since expanded to commercial brands and influencers.

Synthetic speech has come a long way over the years. At age nine, Siri is the oldest virtual assistant—but in the world of speaking machines, she’s a baby. People have been trying to synthesize speech since at least the 18th century, when an Austro-Hungarian inventor built a crude replica of the human vocal tract that could articulate entire phrases (albeit in a monotone).

Current machine-learning techniques can model human speech, complete with awkward pauses and lip smacks. Still, training on thousands of samples per second is prohibitively expensive for most real-world systems. Researchers, including those at VocaliD, are continually implementing newer and more efficient methods.

But even as the remaining gaps between human and synthetic speech are steadily closing, truly lifelike prosody continues to elude even the most sophisticated systems. Maybe what’s still missing requires machines not only to mimic humans but also to feel like us.