Policy Gradient Baseline

Created
Created
2024 Mar 20 1:55
Editor
Creator
Creator
Seonglae Cho
Edited
Edited
2024 Apr 30 8:1
Refs
Refs

Baseline should depends only on sts_t

보통 network 만들어서 따로 학습시키는 critic (observation → value)
You cannot simply use state-action dependent baseline for unbiased policy gradient estimates.
  • We use value function not state-action value
  • If independent, multiplication of expectation can be separated
 
 
 

Derivation

Expθ(x)θlogpθ(x)=pθ(x)θlogpθ(x)dx=θpθ(x)dx=θpθ(x)dx=θ1=0E_{x \sim p_\theta(x)} \nabla_\theta log p_\theta(x) = \int p_\theta (x) \nabla_\theta log p_\theta (x)dx = \int \nabla_\theta p_\theta (x)dx = \nabla_\theta \int p_\theta(x)dx = \nabla_\theta 1 = 0
notion image

gradient variance가 낮아진다는 건

notion image
두번째 항은 일정하고, 첫번째 항은 baseline 커질수록 작아지니 variance 작아지는 거임
notion image
 
 
 

Recommendations