Byte Latent Transformer

Creator
Creator
Seonglae Cho
Created
Created
2024 Dec 18 15:54
Editor
Edited
Edited
2025 Jan 8 22:11
Refs
Refs

Flop controlled scaling by raw bytes without a fixed vocabulary

For fixed inference costs, BLT shows significantly better scaling than tokenization-based models/
BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation (token). Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it.

Scaling

notion image

Architecture

notion image

Monotonicity constraint

Empirically, they find that using entropy patching yields progressively larger patches in structured content which are often very repetitive. These variations are caused by lower entropy on the repeated content found in the entropy model context. They reset the entropy context with new lines and use approximate Mgonontonicity constraint as it suffers less from "entropy drift" from changes in context length.
 
 
 
 
 
 

Recommendations