Loading views...

Binomial attention head is effective for scaling model

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Nov 22 11:34
Editor
Edited
Edited
2025 Feb 27 21:32
Specific
Specific
Specific
Refs
Refs
Computable
Computable
Computable
If attention head acts as separate module and virtual attention head acts like another attention head as Anthropic et al notated. We might prove the in-context learning ability by with comparing a combination. The most effective attention head combination per the number of parameter
 
 
 
 
 
 
 
 
 

Recommendations