Simultaneous learning of Policy and State-value function
Policy (Actor) acts and Critic (Value function) evaluate so Policy is improved based on criteria. (how advantage is an action compared to the policy?)
- Sample trajectory point
- Update or by like GAE
Properties
- no need to collect full trajectories for update by sampling efficiently
- Actor-critic is sample efficient than REINFORCE
Value-based actor critic
- Unlike semi on-policy methods like PPO, this approach uses off-policy data while performing value iteration
- Instead of using GAE, it also learns a Q model, similar to algorithms like DDPG or SAC
How to update Actor
- train parameters like standard deviation of shared normal or categorical distribution
Actor Critic Algorithms