GAE

Created
Created
2024 Mar 20 2:8
Editor
Creator
Creator
Seonglae Cho
Edited
Edited
2025 Feb 4 10:21
Refs
Refs

Generalized advantage estimator, N-step return

AGAEπ(st,at)=t=tT1(γλ)ttδt,A_{GAE}^{\pi}(s_t, a_t)=\sum_{t^{\prime}=t}^{T-1}(\gamma \lambda)^{t^{\prime}-t} \delta_{t^{\prime}},AGAEπ(st,at)=δt+γλAGAEπ(st+1,at+1) A_{GAE}^{\pi}(s_t, a_t)=\delta_{t}+\gamma \lambda A_{GAE}^{\pi}(s_{t+1}, a_{t+1})
  • Normalizing magnitude of advantage
  • Recently, reward/return normalization is preferred over advantage normalization
    • Sign is important in advantage because it determines training, so advantage normalization would change sign of A value.
notion image
 
 

to find a sweet spot through n-step

A^nπ(st,at)=t=tt+n1r(st,at)+Vπ(st+n)Vπ(st)\hat A_n^\pi(s_t,a_t) = \sum_{t'=t}^{t+n-1} r(s_{t'}, a_{t'}) + V^\pi (s_{t+n}) - V^\pi(s_t)Q^nπ(st,at)=t=tt+n1r(st,at)+Vπ(st+n)\hat Q_n^\pi(s_t, a_t) = \sum_{t'=t}^{t+n-1} r(s_{t'}, a_{t'}) + V^\pi(s_{t+n})
  • Problem: It’s hard to know which is good for advantage estimation
  • Solution: Use exponentially-weighted for future rewards average of n-step returns!
A^GAE(st,at)=n=1wnA^nπ(st,at),wnλn1\hat A_{GAE} (s_t, a_t) = \sum_{n=1}^{\infty} w_n \hat A_n^\pi(s_t,a_t), w_n \propto \lambda^{n-1}
New hyperparameter discounting factor λ\lambda (λ=0.95\lambda = 0.95 typically works well)
 
 
 

Discount factor
γ\gamma and GAE’s λ\lambda

The lambda parameter determines a trade-off between more bias (low lambda) and more variance (high lambda).
 
 
 

Recommendations