Google USM

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jul 4 22:40
Editor
Edited
Edited
2025 Jul 4 22:45
Refs
Refs

Universal Speech Model

A collection of large speech models pre-trained with a 2B parameter
Conformer
encoder using 12M hours of unlabeled multilingual speech and 28B sentences of text. It achieves SOTA on multilingual ASR and AST benchmarks with only 1/7 of the labeled data compared to Whisper v2.
  1. Unsupervised Pre-training: BEST-RQ (BERT-based Speech pre-Training with Random projection Quantizer)
  1. MOST (Multi-Objective Supervised pre-Training): Learning joint speech and text representations by injecting text (BEST-RQ + text-injection)
  1. Supervised ASR Training
Chunk-wise attention for robust long-form speech recognition
The model also demonstrates scalability, allowing quick fine-tuning for specific domains and languages with adapters that add only 2% more parameters. Additionally, performance on ultra-low-resource languages is further enhanced using the Noisy Student approach.
 
 
 
 

Recommendations