VALL-E
VALL-E X can synthesize personalized speech in another language for a monolingual speaker.
Taking the phoneme sequences derived from the source and target text, and the source acoustic tokens
derived from an audio codec model as prompts, VALL-E X is able to produce the acoustic tokens in the
target language, which can be then decompressed to the target speech waveform. Thanks to its powerful
in-context learning capabilities, VALL-E X does not require cross-lingual speech data of the same
speakers for training and can perform various zero-shot cross-lingual speech generation tasks, such as
cross-lingual text-to-speech synthesis and speech-to-speech translation.
https://plachtaa.github.io/
VALL-E: Microsoft's new zero-shot text-to-speech model can duplicate everyone's voice in three seconds
Since the release of the first text-to-speech (TTS) model, researchers have been looking for ways to improve the way these systems generate speech. The latest model from Microsoft, VALL-E, is a significant step forward in this regard. VALL-E is a transformer-based TTS model that can generate speech in any voice after only hearing a three-second sample of that voice.
https://mpost.io/vall-e-microsofts-new-zero-shot-text-to-speech-model-can-duplicate-everyones-voice-in-three-seconds/

VALL-E
Chengyi Wang*,   Sanyuan Chen*,   Yu Wu*,   Ziqiang Zhang,   Long Zhou,   Shujie Liu,   Zhuo Chen,   Yanqing Liu,   Huaming Wang,   Jinyu Li,   Lei He,   Sheng Zhao,   Furu Wei Microsoft Abstract. We introduce a language modeling approach for text to speech synthesis (TTS).
https://valle-demo.github.io/

Seonglae Cho