Connectionist Temporal Classification
A training and modeling approach designed to learn sequences of different lengths without alignment labels, such as "speech frame sequences (long) → text (short)" where two sequences have different lengths
- Strong alignment capability: Learns "which frame corresponds to which character" without explicit alignment labels by summing over all possible alignments.
- Stable and parallelizable training: Easy to compute frame-by-frame (linear + softmax on top of encoder output).
- However, there are drawbacks: Since each frame prediction is treated almost independently (though the encoder does see context), linguistic context utilization is weak, making it less capable than encoder-decoder models at correcting awkward spelling/word errors.

Seonglae Cho