In Reward Maximization, we only add future rewards diagonally forward
This means removing previous rewards from the equation, which is possible because the expectation is 0
However, since the variance is not 0, this has the effect of reducing variance.
Two notation
Reward to go
true expected reward to-go
estimated expected reward to-go
Better estimate of Q → better gradient → use true