Reward to Go

Creator
Creator
Seonglae Cho
Created
Created
2024 Mar 20 1:49
Editor
Edited
Edited
2025 May 29 22:47
In
Reward Maximization
, we only add future rewards diagonally forward
This means removing previous rewards from the equation, which is possible because the expectation is 0
However, since the variance is not 0, this has the effect of reducing variance.

Two notation

Rt(τi)=Q^tiR_t(\tau^i) = \hat Q_t^i
 

Reward to go

true expected reward to-go
Q(st,at)=t=tTEπθ[r(st,at)st,at]Q(s_t, a_t) = \sum_{t'=t} ^T E_{\pi_\theta}[r(s_{t'}, a_{t'}) | s_t, a_t]
estimated expected reward to-go
Qπθ(sti,ati)=t=tTr(st,at)Q^tiQ^{\pi_\theta}(s_t^i, a_t^i) = \sum_{t' = t}^Tr(s_{t'}, a_{t'}) \approx \hat Q_t^i
Better estimate of Q → better gradient → use true QQ
θJ(θ)1Ni=1Nt=0T1θlogπθ(aitsit)(t=tT1r(sit,ait))\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta}(a_{it}| s_{it})(\sum_{t^{\prime}=t}^{T-1} r(s_{it'}, a_{it'}))
 
 
 
 
 

Recommendations