Policy Gradient Baseline

Creator

Creator

Seonglae Cho

Created

Created

2024 Mar 20 1:55

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Apr 30 8:1

Refs

Refs

Baseline should depends only on

보통 network 만들어서 따로 학습시키는 critic (observation → value)

You cannot simply use state-action dependent baseline for unbiased policy gradient estimates.

We use value function not state-action value

If independent, multiplication of expectation can be separated

independent because of
Markov Decision Process

Derivation

notion image

gradient variance가 낮아진다는 건

notion image

두번째 항은 일정하고, 첫번째 항은 baseline 커질수록 작아지니 variance 작아지는 거임

notion image

Recommendations

////////