To mitigate the limitations of the Delay Pattern, CSM introduces Compute Amortization. The backbone predicts the zeroth codebook (basic semantic information) for all frames, while the decoder learns to predict the remaining N-1 stages by sampling only random 1/16 frames. This enables fast learning with significantly reduced memory and computational burden without loss of voice quality. This approach is similar to how RNN limitations were addressed by making it an Autoregressive Model with Next Token Prediction.
Compute amortization
Creator
Creator

Created
Created
2025 Jun 15 22:53Editor
Editor

Edited
Edited
2025 Jun 15 22:54Refs
Refs