Bidirectional LM

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Mar 6 0:13
Editor
Edited
Edited
2025 Oct 27 12:39

Overview

Bidirectional language models are called "bi-directional" because they are mathematically equivalent to the sum of two unidirectional representations. However, this approach may impose excessive restrictions on
Emergent ability
, since the computation includes predicting previous words by connecting forward and backward contexts, yet humans don't naturally read text backwards.

Encoder Models (Auto-Encoding Architecture)

Encoder models use the Transformer Encoder block. As
Andrej Karpathy
notes, "All it means that it is an encoder block is that you will delete this diagonal line of code." These models are characterized by bi-directional attention, which allows them to:
  • Transform text or images into condensed numerical representations called embeddings
  • Encode input sentences into vectors while preserving their semantic meaning in a form that's easier for the model to process

Key Characteristics

  1. Bi-directional Attention: The model processes input sequences token by token, but uses Self-Attention to calculate similarity between each position and all other positions in the sequence. Positions with higher similarity receive greater weight in the embedding vectors, enabling comprehensive context understanding.
  1. Pretraining Objective: These models are typically pretrained by corrupting input sentences and tasking the model with reconstructing the original text.
  1. Scaling Limitations: The bi-directional approach induces stricter constraints for scaling compared to
    Causal language model
    .

Technical Components

Token Type Embedding

In BERT models, when two sentences are provided as input, tokens from the first sentence are assigned a value of 0, while tokens from the second sentence are assigned a value of 1 to distinguish between the two segments.

Masked Language Modeling (MLM)

This approach is called "Masked Language Model" because it predicts masked tokens within the sequence (different from autoregressive masking).
MLM Strategy:
  • Uses 80-10-10 corruption strategy for masking
  • MLM takes slightly longer to converge because it only predicts 15% of tokens
 
 
 
 
 
 

Recommendations