Universal Speech Model
A collection of large speech models pre-trained with a 2B parameter Conformer encoder using 12M hours of unlabeled multilingual speech and 28B sentences of text. It achieves SOTA on multilingual ASR and AST benchmarks with only 1/7 of the labeled data compared to Whisper v2.
- Unsupervised Pre-training: BEST-RQ (BERT-based Speech pre-Training with Random projection Quantizer)
- MOST (Multi-Objective Supervised pre-Training): Learning joint speech and text representations by injecting text (BEST-RQ + text-injection)
- Supervised ASR Training
Chunk-wise attention for robust long-form speech recognition
The model also demonstrates scalability, allowing quick fine-tuning for specific domains and languages with adapters that add only 2% more parameters. Additionally, performance on ultra-low-resource languages is further enhanced using the Noisy Student approach.