LLM 효율 한계는 “짧은 discrete 토큰을 하나씩 예측”하는 AR 패러다임 때문이고, 토큰당 정보량은 vocab 32k–256k에서도 15–18 bit 수준이라 더 키우려면 softmax vocab 폭발이 난다
길이 K 토큰을 고충실도 autoencoder로 하나의 연속 벡터 z로 압축하고 여기에 연속형 autoregressive LM을 학습.
CALM calmshaochenze • Updated 2025 Nov 12 21:48
calm
shaochenze • Updated 2025 Nov 12 21:48
Techniques to make the latent space "smooth and robust": A simple AE reconstructs well, but the latent space is extremely brittle—even slight errors can decode completely wrong tokens. To prevent this, VAE is used.
Since it's a continuous implicit model, log-likelihood / Perplexity cannot be used. Without likelihood, a new evaluation metric BrierLM is proposed:
When predicting the next latent, the model receives small random noise together to generate a continuous state Diffusion Model
LM loss to regression (KL on VAE)
autoregressive generation of the latent sequence step by step, and at the end, decoding all latents together to produce the final text

Seonglae Cho