Floating Point Operations Per Second
the weight w generates exactly 6 FLOPs combined in the forward and backward pass
- The unit i multiplies its output h(i) by w to send it to the unit j.
- The unit j adds the unit i’s contribution to its total input a(j).
- The unit j multiplies the incoming loss gradient dL/da(j) by w to send it back to the unit i.
- The unit i adds the unit j’s contribution to its total loss gradient dL/dh(i).
- The unit j multiplies its loss gradient dL/da(j) by the unit i’s output h(i) to compute the loss gradient dL/dw for the given example.
- (The sneakiest FLOP, IMHO) The weight w adds the contribution from step 5 to its loss gradient accumulator dL/dw that aggregates gradients for all examples.