Byte Latent Transformer

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Dec 18 15:54
Editor
Edited
Edited
2025 Aug 9 22:31

Flop controlled scaling by raw bytes without a fixed vocabulary

For fixed inference costs, BLT shows significantly better scaling than tokenization-based models/
BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation (token). Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it.

Scaling

notion image

Architecture

notion image

Monotonicity constraint

Empirically, they find that using entropy patching yields progressively larger patches in structured content which are often very repetitive. These variations are caused by lower entropy on the repeated content found in the entropy model context. They reset the entropy context with new lines and use approximate Mgonontonicity constraint as it suffers less from "entropy drift" from changes in context length.
Byte-level encoding requires approximately 20 times more training for equivalent scaling performance. The Latent Transformer operates on patches rather than traditional tokens. Adding hierarchy, by adding fixed strides and average pooling with projection. For segmentation, entropy for the next byte patch. Space patching verseus entorpy patching, the entropy patching was better close to llama 3 architecture. It improves over previous byte LMs.scales better and better models data long tail..
What represenetaiton spae
We cannn’t go beyond 1BOE token ~ 4 tbytes due to the
Zipf’s law
This model use data format and loss landscape using bytesm while model representation space uses the patch based( on two methods Latent Transformer) .
Disentangle the represetnstiaon space on which models learn and opear which leads better scaling.
 
 
 
 
 
 
 

Recommendations