Flop controlled scaling by raw bytes without a fixed vocabulary
For fixed inference costs, BLT shows significantly better scaling than tokenization-based models/
BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation (token). Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it.
Scaling
Architecture
Monotonicity constraint
Empirically, they find that using entropy patching yields progressively larger patches in structured content which are often very repetitive. These variations are caused by lower entropy on the repeated content found in the entropy model context. They reset the entropy context with new lines and use approximate Mgonontonicity constraint as it suffers less from "entropy drift" from changes in context length.