Binomial attention head is effective for scaling model

Creator

Seonglae Cho

Created

2024 Nov 22 11:34

Editor

Seonglae Cho

Edited

2025 Feb 27 21:32

Specific

Refs

Computable

If attention head acts as separate module and virtual attention head acts like another attention head as Anthropic et al notated. We might prove the in-context learning ability by with comparing a combination. The most effective attention head combination per the number of parameter

Recommendations

////