Floating Point Operations Per Second
the weight w generates exactly 6 FLOPs combined in the forward and backward pass
- The unit i multiplies its output h(i) by w to send it to the unit j.
- The unit j adds the unit i’s contribution to its total input a(j).
- The unit j multiplies the incoming loss gradient dL/da(j) by w to send it back to the unit i.
- The unit i adds the unit j’s contribution to its total loss gradient dL/dh(i).
- The unit j multiplies its loss gradient dL/da(j) by the unit i’s output h(i) to compute the loss gradient dL/dw for the given example.
- (The sneakiest FLOP, IMHO) The weight w adds the contribution from step 5 to its loss gradient accumulator dL/dw that aggregates gradients for all examples.


Appendix B
arxiv.org
https://arxiv.org/pdf/2204.02311.pdf
The FLOPs Calculus of Language Model Training
Extremely large language models like the famous GPT-3 by OpenAI are all the rage. Many of us are now trying to get a sense of scale of the…
https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4


Seonglae Cho