Google USM

Universal Speech Model

A collection of large speech models pre-trained with a 2B parameter

Conformer encoder using 12M hours of unlabeled multilingual speech and 28B sentences of text. It achieves SOTA on multilingual ASR and AST benchmarks with only 1/7 of the labeled data compared to Whisper v2.

Unsupervised Pre-training: BEST-RQ (BERT-based Speech pre-Training with Random projection Quantizer)

MOST (Multi-Objective Supervised pre-Training): Learning joint speech and text representations by injecting text (BEST-RQ + text-injection)

Supervised ASR Training

Chunk-wise attention for robust long-form speech recognition

The model also demonstrates scalability, allowing quick fine-tuning for specific domains and languages with adapters that add only 2% more parameters. Additionally, performance on ultra-low-resource languages is further enhanced using the Noisy Student approach.

arxiv.org

https://arxiv.org/pdf/2303.01037

Google USM

Universal Speech Model

Recommendations