Reward to Go

Reward Maximization, we only add future rewards diagonally forward

This means removing previous rewards from the equation, which is possible because the expectation is 0

However, since the variance is not 0, this has the effect of reducing variance.

R_t(\tau^i) = \hat Q_t^i

true expected reward to-go

Q(s_t, a_t) = \sum_{t'=t} ^T E_{\pi_\theta}[r(s_{t'}, a_{t'}) | s_t, a_t]

estimated expected reward to-go

Q^{\pi_\theta}(s_t^i, a_t^i) = \sum_{t' = t}^Tr(s_{t'}, a_{t'}) \approx \hat Q_t^i

Better estimate of Q → better gradient → use true

Q

\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta}(a_{it}| s_{it})(\sum_{t^{\prime}=t}^{T-1} r(s_{it'}, a_{it'}))