Byte Latent Transformer

Flop controlled scaling by raw bytes without a fixed vocabulary

For fixed inference costs, BLT shows significantly better scaling than tokenization-based models/

BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation (token). Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it.

Scaling

Architecture

Monotonicity constraint

Empirically, they find that using entropy patching yields progressively larger patches in structured content which are often very repetitive. These variations are caused by lower entropy on the repeated content found in the entropy model context. They reset the entropy context with new lines and use approximate Mgonontonicity constraint as it suffers less from "entropy drift" from changes in context length.

Byte Latent Transformer: Patches Scale Better Than Tokens | Research - AI at Meta

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at...

https://ai.meta.com/research/publications/byte-latent-transformer-patches-scale-better-than-tokens/

Byte Latent Transformer

Flop controlled scaling by raw bytes without a fixed vocabulary

Scaling

Architecture

Monotonicity constraint

Backlinks

Recommendations