Core Principle: Pull Positive Pairs Together, Push Others Apart
Step 1: Embed Both Text Batches
emb1 = model(anchor_texts) # [B, D] - anchor embeddings emb2 = model(positive_texts) # [B, D] - positive embeddings
Step 2: Compute All Pairwise Similarities
logits = emb1 @ emb2.T / temperature # [B, B] similarity matrix
Step 3: Define Positive and Negative Pairs
Diagonal entries represent positive pairs (emb1[i] matches emb2[i]), while off-diagonal entries serve as in-batch negatives.
labels = [0, 1, 2, ..., B-1] # diagonal indices
Step 4: Optimize with Cross Entropy Loss
The loss maximizes diagonal similarities (positive pairs) while minimizing off-diagonal similarities (negative pairs).
loss = CrossEntropy(logits, labels)
Key Characteristics
- In-batch negatives: Other samples in the batch automatically serve as negatives
- Symmetric loss: Computed bidirectionally (anchor→positive and positive→anchor)
- Learnable threshold: The model learns to distinguish between positive and negative pairs
Example Similarity Matrix
For batch [(a1,p1), (a2,p2), (a3,p3)]:
p1 p2 p3 a1 [0.9] 0.2 0.1 ← a1-p1 is positive (maximize) a2 0.1 [0.8] 0.2 ← a2-p2 is positive (maximize) a3 0.2 0.1 [0.9] ← a3-p3 is positive (maximize) Goal: Maximize diagonal, minimize off-diagonal

Seonglae Cho