Scaled Attention

Creator
Creator
Alan JoAlan Jo
Created
Created
2024 Feb 28 3:55
Editor
Editor
Alan JoAlan Jo
Edited
Edited
2024 Feb 28 4:0
Refs
Refs

Control the variance to avoid softmax output would be like one hot vector

important normalization to have
Because k, q variance is much smaller than value
dividing by squared dimension size
 
 
 

Difference

notion image
notion image
 
 
 
 
 
 

Recommendations