Vall-E

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2021 May 10 14:43
Editor
Edited
Edited
2023 Sep 5 4:13
 
 
 
 
VALL-E
VALL-E X can synthesize personalized speech in another language for a monolingual speaker. Taking the phoneme sequences derived from the source and target text, and the source acoustic tokens derived from an audio codec model as prompts, VALL-E X is able to produce the acoustic tokens in the target language, which can be then decompressed to the target speech waveform. Thanks to its powerful in-context learning capabilities, VALL-E X does not require cross-lingual speech data of the same speakers for training and can perform various zero-shot cross-lingual speech generation tasks, such as cross-lingual text-to-speech synthesis and speech-to-speech translation.
VALL-E: Microsoft's new zero-shot text-to-speech model can duplicate everyone's voice in three seconds
Since the release of the first text-to-speech (TTS) model, researchers have been looking for ways to improve the way these systems generate speech. The latest model from Microsoft, VALL-E, is a significant step forward in this regard. VALL-E is a transformer-based TTS model that can generate speech in any voice after only hearing a three-second sample of that voice.
VALL-E: Microsoft's new zero-shot text-to-speech model can duplicate everyone's voice in three seconds
VALL-E
Chengyi Wang*, &nbsp Sanyuan Chen*, &nbsp Yu Wu*, &nbsp Ziqiang Zhang, &nbsp Long Zhou, &nbsp Shujie Liu, &nbsp Zhuo Chen, &nbsp Yanqing Liu, &nbsp Huaming Wang, &nbsp Jinyu Li, &nbsp Lei He, &nbsp Sheng Zhao, &nbsp Furu Wei Microsoft Abstract. We introduce a language modeling approach for text to speech synthesis (TTS).
 
 

Recommendations