Transformer Attention

Creator

Seonglae Cho

Created

2023 Apr 3 15:23

Editor

Seonglae Cho

Edited

2025 Oct 22 22:15

Refs

In LLM attention blocks, Q/K/V have no bias, and only the output projection (O) has bias.

The reason for Q/K/V is that the input mean is 0 right after LayerNorm, so bias has almost no effect, while O is added to the residual connection and trains well학습된다

Transformer Attentions

Encoder-Decoder Attention

Encoder Self-Attention

Masked Self-Attention

Backlinks

Reversing Transformer

Recommendations

/////////