GLM

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Jan 10 16:24
Editor
Edited
Edited
2026 Jan 14 18:37
Refs
Refs

General Language Model

 
 
 
 
 
 
2021
4.7

Image generation

A discrete auto-regressive based image generation model that combines an AR generator with a diffusion decoder in a hybrid architecture. The AR part (9B) is initialized from GLM-4-9B-0414 and jointly trained on text-to-image and image-to-image tasks, using semantic-VQ tokens (from the X-Omni tokenizer family) to enhance controllability/semantic correlation. It also improves quality at high resolutions through progressive generation (low-resolution 256-token layout → high-resolution tokens) and weight adjustments.
The diffusion decoder (7B) uses a CogView4-style single-stream DiT with flow matching, taking semantic-VQ tokens generated by the AR model as conditions to restore and refine high-frequency details. Glyph-byT5 is additionally used to enhance text rendering. For editing, VAE latents from the reference are also provided as conditions, with block-causal attention reducing computational costs. Post-training separates AR and decoder optimization using GRPO/flow-GRPO respectively: AR focuses on semantic alignment and aesthetics (OCR, VLM, HPSv3, etc.), while the decoder enhances details (LPIPS, OCR, hand scoring, etc.).
 
 

Backlinks

MoE Model

Recommendations