Expected return of a policy is the expected return over all possible trajectories
학습의 안정성을 위해 variance 줄이는 방향으로 발전한다. (batch gradient, diagonal sum of reward)
sequence 평균해서 reward예상하기도 하고, sequence가 아니라 한 점에 대해서 예측하기도 한다. reward합을 줄이는 것도 있고 하나하나 줄이기도 한다
diagonal로 action 이후 reward만 더해주면 초반 action에만 coefficient 커져서 incentive해주는것 같다. b빼줘도 gradient는 같기 때문인데 normalization은 안함
Markov Property 로 세번째줄 넘어갈 때 적용댐


- Produce a high-variance (of reward in action) gradient
- Reward can drastically change with a minor change in actions
- Hard to find optimum, hard to optimize
- Require on-policy data
- The derivation of the policy gradients assume data come from policy rollouts
Online Learning



Baseline

Why variance matters

Policy Gradient Theorem Notion
Turing 2024 (Richard Sutton)
- Andrew Barto
Andrew Barto and Richard Sutton are the recipients of the 2024 ACM A.M. Turing Award for developing the conceptual and algorithmic foundations of reinforcement learning.
For developing the conceptual and algorithmic foundations of reinforcement learning.
https://awards.acm.org/about/2024-turing

proceedings.neurips.cc
https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf
Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients
As I stated in my last blog post, I am feverishly trying to read moreresearch papers. One category of papers that seems to be coming up a lotrecently are tho...
https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/
04. Policy Gradient methods - Part1
https://julien-vitay.net/deeprl/PolicyGradient.html#sec:policy-gradient-methods https://lilianweng…
https://wikidocs.net/164397


Seonglae Cho