Policy Gradient Baseline

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 20 1:55
Editor
Edited
Edited
2024 Apr 30 8:1
Refs
Refs

Baseline should depends only on

보통 network 만들어서 따로 학습시키는 critic (observation → value)
You cannot simply use state-action dependent baseline for unbiased policy gradient estimates.
  • We use value function not state-action value
  • If independent, multiplication of expectation can be separated
 
 
 

Derivation

notion image

gradient variance가 낮아진다는 건

notion image
두번째 항은 일정하고, 첫번째 항은 baseline 커질수록 작아지니 variance 작아지는 거임
notion image
 
 
 

Recommendations