Microsoft’s VALL-E AI can Simulate Your Voice in 3 Seconds

An AI text-to-speech model called VALL-E can copy a speaker’s tone and environment and have them say anything, starting with only a 3-second long sample.

Microsoft discusses VALL-E on its github demo sIte.

Its creators at Microsoft suggest that VALL-E could be used for high-quality text-to-speech (TTS) applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn’t), and audio content creation when combined with other generative AI models like GPT-3.

Microsoft calls VALL-E a “neural codec language model,” and it builds off of a technology called EnCodec, which Meta announced in October 2022. Most text-to-speech methods synthesize speech by manipulating waveforms. VALL-E generates audio codecs from text and acoustic prompts. It basically analyzes how a person sounds, breaks that information into discrete components, then compares that to it’s “training” samples. Them it uses the training samples to generate new speech.

VALL-E offers the same kind of context-based learning capabilities as OpenAI’s ChatGPT platform.

Microsoft trained VALL-E’s speech-synthesis capabilities on a Meta audio library that contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. The results are mixed, with some sounding machine-like and others being surprisingly realistic, sometimes in the same output.

Fortunately, for VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data, so there are voices that the system can’t match yet—but someday soon it may.

Perhaps owing to VALL-E’s ability to potentially fuel mischief or outright deception, Microsoft has not provided VALL-E code for others to experiment with, so we couldn’t test VALL-E’s capabilities. The researchers do seem aware of the potential widespread harm that this technology could bring. In the paper’s conclusion, they write:

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E.”

One possible use for VALL-E could be to narrate audiobooks. Just last week, Apple published a series of audiobooks narrated by an AI voice via its Books app.

For now, Microsoft hasn’t said if, or when, it plans to release a public version of VALL-E. The technology in general is already pretty far along. For example, voice copying software is already being used to create the official voice of Darth Vader. The rise of “creative” AIs like DALL-E, ChatGPT, combined with the availability of various deepfake algorithms, suggest we are at an inflection point in AI penetrating the day-to-day real world.

There are so many possibilities. When anything and anyone, including you and I, can be copied and faked in seconds, it’s time to regulate AI while we can still do it.

-30-

“A robot stealing a human’s voice, in the style of Picasso.”
Generated by DALL-E

David Raiklen

David Raiklen wrote, directed and scored his first film at age 9. He began studying keyboard and composing at age 5. He attended, then taught at UCLA, USC and CalArts. Among his teachers are John Williams and Mel Powel.
He has worked for Fox, Disney and Sprint. David has received numerous awards for his work, including the 2004 American Music Center Award. Dr. Raiklen has composed music and sound design for theater (Death and the Maiden), dance (Russian Ballet), television (Sing Me a Story), cell phone (Spacey Movie), museums (Museum of Tolerance), concert (Violin Sonata ), and film (Appalachian Trail).
His compositions have been performed at the Hollywood Bowl and the first Disney Hall. David Raiken is also host of a successful radio program, Classical Fan Club.