Unified Speech Recognition

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 18 16:7
Editor
Edited
Edited
2025 Dec 18 16:15
Refs
Refs
When solving ASR, a unified design combines both CTC approach + Attention (encoder-decoder) approach within a single model for joint training/utilization
branch simply means "output head/path attached on top of a shared encoder"
Use CTC as a teacher to generate attention-side pseudo-labels in parallel (without AR)
 
 
 
 
 
 
 

USR 2.0

Existing Unified Speech Recognition (USR) is slow to train due to attention branch's autoregressive pseudo-labeling (AR PL), and trains CTC/attention in a decoupled manner, making AR errors prone to self-reinforcement on long sentences, noise, and new domains (OOD).
USR 2.0: CTC-driven teacher forcing. Uses teacher's greedy CTC output (merge&collapse) as decoder input to generate attention PL in parallel with a single forward pass (removing AR). CTC and attention PL are aligned to the same length so the student predicts both simultaneously, allowing the decoder to absorb both CTC's robustness + attention's expressiveness.
Mixed sampling (default 0.5) alternates between CTC-driven mode and AR mode at each training step to reduce train–test mismatch (training only with CTC input creates mismatch with inference AR). Achieves ~2× faster training (removes AR PL bottleneck) + more robust on OOD (significant WER improvements on long utterances/noise/other datasets), while maintaining SOTA-level performance on in-domain data.
 
 

Recommendations