Unified Speech Recognition

When solving ASR, a unified design combines both CTC approach + Attention (encoder-decoder) approach within a single model for joint training/utilization

branch simply means "output head/path attached on top of a shared encoder"

Use CTC as a teacher to generate attention-side pseudo-labels in parallel (without AR)

USR 2.0

Existing Unified Speech Recognition (USR) is slow to train due to attention branch's autoregressive pseudo-labeling (AR PL), and trains CTC/attention in a decoupled manner, making AR errors prone to self-reinforcement on long sentences, noise, and new domains (OOD).

USR 2.0: CTC-driven teacher forcing. Uses teacher's greedy CTC output (merge&collapse) as decoder input to generate attention PL in parallel with a single forward pass (removing AR). CTC and attention PL are aligned to the same length so the student predicts both simultaneously, allowing the decoder to absorb both CTC's robustness + attention's expressiveness.

Mixed sampling (default 0.5) alternates between CTC-driven mode and AR mode at each training step to reduce train–test mismatch (training only with CTC input creates mismatch with inference AR). Achieves ~2× faster training (removes AR PL bottleneck) + more robust on OOD (significant WER improvements on long utterances/noise/other datasets), while maintaining SOTA-level performance on in-domain data.

openreview.net

https://openreview.net/pdf?id=sSbEEHNEsL

Unified Speech Recognition

USR 2.0

Recommendations