In
Reward Maximization, we only add future rewards diagonally forward
This means removing previous rewards from the equation, which is possible because the expectation is 0
However, since the variance is not 0, this has the effect of reducing variance.
Two notation
Rt(τi)=Q^ti
Reward to go
true expected reward to-go
Q(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]estimated expected reward to-go
Qπθ(sti,ati)=∑t′=tTr(st′,at′)≈Q^tiBetter estimate of Q → better gradient → use true
Q ∇θJ(θ)≈N1∑i=1N∑t=0T−1∇θlogπθ(ait∣sit)(∑t′=tT−1r(sit′,ait′))