Control the variance to avoid softmax output would be like one hot vectorimportant normalization to haveBecause k, q variance is much smaller than valuedividing by squared dimension size Difference