General Language Model
2021
4.7
Image generation
A discrete auto-regressive based image generation model that combines an AR generator with a diffusion decoder in a hybrid architecture. The AR part (9B) is initialized from GLM-4-9B-0414 and jointly trained on text-to-image and image-to-image tasks, using semantic-VQ tokens (from the X-Omni tokenizer family) to enhance controllability/semantic correlation. It also improves quality at high resolutions through progressive generation (low-resolution 256-token layout → high-resolution tokens) and weight adjustments.
The diffusion decoder (7B) uses a CogView4-style single-stream DiT with flow matching, taking semantic-VQ tokens generated by the AR model as conditions to restore and refine high-frequency details. Glyph-byT5 is additionally used to enhance text rendering. For editing, VAE latents from the reference are also provided as conditions, with block-causal attention reducing computational costs. Post-training separates AR and decoder optimization using GRPO/flow-GRPO respectively: AR focuses on semantic alignment and aesthetics (OCR, VLM, HPSv3, etc.), while the decoder enhances details (LPIPS, OCR, hand scoring, etc.).

Seonglae Cho