FLOPS

Creator
Creator
Seonglae Cho
Created
Created
2024 Mar 8 14:7
Editor
Edited
Edited
2024 Mar 8 14:14
Refs
Refs

Floating Point
Operations Per Second

the weight w generates exactly 6 FLOPs combined in the forward and backward pass
  1. The unit i multiplies its output h(i) by w to send it to the unit j.
  1. The unit j adds the unit i’s contribution to its total input a(j).
  1. The unit j multiplies the incoming loss gradient dL/da(j) by w to send it back to the unit i.
  1. The unit i adds the unit j’s contribution to its total loss gradient dL/dh(i).
  1. The unit j multiplies its loss gradient dL/da(j) by the unit i’s output h(i) to compute the loss gradient dL/dw for the given example.
  1. (The sneakiest FLOP, IMHO) The weight w adds the contribution from step 5 to its loss gradient accumulator dL/dw that aggregates gradients for all examples.
notion image
2QT means Self Attention Key matrix and Query matrix
2QT means Self Attention Key matrix and Query matrix
 

Appendix B

 
 

Recommendations