Overview
Bidirectional language models are called "bi-directional" because they are mathematically equivalent to the sum of two unidirectional representations. However, this approach may impose excessive restrictions on Emergent ability, since the computation includes predicting previous words by connecting forward and backward contexts, yet humans don't naturally read text backwards.
Encoder Models (Auto-Encoding Architecture)
Encoder models use the Transformer Encoder block. As Andrej Karpathy notes, "All it means that it is an encoder block is that you will delete this diagonal line of code." These models are characterized by bi-directional attention, which allows them to:
- Transform text or images into condensed numerical representations called embeddings
- Encode input sentences into vectors while preserving their semantic meaning in a form that's easier for the model to process
- Map input sequences to Latent space through auto-encoding
Key Characteristics
- Bi-directional Attention: The model processes input sequences token by token, but uses Self-Attention to calculate similarity between each position and all other positions in the sequence. Positions with higher similarity receive greater weight in the embedding vectors, enabling comprehensive context understanding.
- Pretraining Objective: These models are typically pretrained by corrupting input sentences and tasking the model with reconstructing the original text.
- Scaling Limitations: The bi-directional approach induces stricter constraints for scaling compared to Causal language model.
Technical Components
Token Type Embedding
In BERT models, when two sentences are provided as input, tokens from the first sentence are assigned a value of 0, while tokens from the second sentence are assigned a value of 1 to distinguish between the two segments.
Masked Language Modeling (MLM)
This approach is called "Masked Language Model" because it predicts masked tokens within the sequence (different from autoregressive masking).
MLM Strategy:
- Uses 80-10-10 corruption strategy for masking
- MLM takes slightly longer to converge because it only predicts 15% of tokens

Seonglae Cho
