Continuous Autoregressive Model

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Nov 12 23:3
Editor
Edited
Edited
2025 Nov 12 23:13
LLM 효율 한계는 “짧은 discrete 토큰을 하나씩 예측”하는 AR 패러다임 때문이고, 토큰당 정보량은 vocab 32k–256k에서도 15–18 bit 수준이라 더 키우려면 softmax vocab 폭발이 난다
길이 K 토큰을 고충실도 autoencoder로 하나의 연속 벡터 z로 압축하고 여기에 연속형 autoregressive LM을 학습.
 
 
 
 

CALM
calm
shaochenzeUpdated 2025 Nov 12 21:48

Techniques to make the latent space "smooth and robust": A simple AE reconstructs well, but the latent space is extremely brittle—even slight errors can decode completely wrong tokens. To prevent this, VAE is used.
Since it's a continuous implicit model, log-likelihood / Perplexity cannot be used. Without likelihood, a new evaluation metric BrierLM is proposed:
When predicting the next latent, the model receives small random noise together to generate a continuous state
Diffusion Model
LM loss to regression (KL on VAE)
autoregressive generation of the latent sequence step by step, and at the end, decoding all latents together to produce the final text
 

Recommendations