FLOPS

Creator

Creator

Created

Created

2024 Mar 8 14:7

Editor

Editor

Edited

Edited

2024 Mar 8 14:14

Refs

Refs

Floating Point Operations Per Second

the weight w generates exactly 6 FLOPs combined in the forward and backward pass

The unit i multiplies its output h(i) by w to send it to the unit j.

The unit j adds the unit i’s contribution to its total input a(j).

The unit j multiplies the incoming loss gradient dL/da(j) by w to send it back to the unit i.

The unit i adds the unit j’s contribution to its total loss gradient dL/dh(i).

The unit j multiplies its loss gradient dL/da(j) by the unit i’s output h(i) to compute the loss gradient dL/dw for the given example.

(The sneakiest FLOP, IMHO) The weight w adds the contribution from step 5 to its loss gradient accumulator dL/dw that aggregates gradients for all examples.

notion image

2QT means Self Attention Key matrix and Query matrix

2QT means Self Attention Key matrix and Query matrix

Appendix B

https://arxiv.org/pdf/2204.02311.pdf

The FLOPs Calculus of Language Model Training

Extremely large language models like the famous GPT-3 by OpenAI are all the rage. Many of us are now trying to get a sense of scale of the…

The FLOPs Calculus of Language Model Training

https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4

The FLOPs Calculus of Language Model Training

Recommendations

/////