Continuous Autoregressive Model

LLM 효율 한계는 “짧은 discrete 토큰을 하나씩 예측”하는 AR 패러다임 때문이고, 토큰당 정보량은 vocab 32k–256k에서도 15–18 bit 수준이라 더 키우려면 softmax vocab 폭발이 난다

길이 K 토큰을 고충실도 autoencoder로 하나의 연속 벡터 z로 압축하고 여기에 연속형 autoregressive LM을 학습.

CALM
calm
shaochenze • Updated 2025 Nov 12 21:48

Techniques to make the latent space "smooth and robust": A simple AE reconstructs well, but the latent space is extremely brittle—even slight errors can decode completely wrong tokens. To prevent this, VAE is used.

Since it's a continuous implicit model, log-likelihood / Perplexity cannot be used. Without likelihood, a new evaluation metric BrierLM is proposed:

When predicting the next latent, the model receives small random noise together to generate a continuous state

Diffusion Model

LM loss to regression (KL on VAE)

autoregressive generation of the latent sequence step by step, and at the end, decoding all latents together to produce the final text

arxiv.org

https://arxiv.org/pdf/2510.27688v1

Continuous Autoregressive Language Models | Chenze Shao

from discrete next-token prediction to continuous next-vector prediction

https://shaochenze.github.io/blog/2025/CALM/

Continuous Autoregressive Model

CALM calmshaochenze • Updated 2025 Nov 12 21:48

Recommendations

CALM
calm
shaochenze • Updated 2025 Nov 12 21:48