Policy, State-value function 동시에 학습
Policy (Actor) acts and Critic (Value function) evaluate so Policy is improved based on criteria. (how advantage is an action compared to the policy?)
- Sample trajectory point
- Update or by like GAE
Properties
- no need to collect full trajectories for update by sampling efficiently
- Actor-critic is sample efficient than REINFORCE
Value-based actor critic
- PPO같이 semi on-policy같은 거 말고 value iteration하면서 off-policy데이터 사용하는 것
- GAE 사용 안하고 q model 도 학습하면 ddpg나 sac같이 Value
How to update Actor
- train parameters like standard deviation of shared normal or categorical distribution
Actor Critic Algorithms