Actor Critic

Creator
Creator
Seonglae Cho
Created
Created
2023 Sep 10 7:54
Editor
Edited
Edited
2025 Jul 6 0:36

Simultaneous learning of Policy and
State-value function

Policy (Actor) acts and Critic (Value function) evaluate so Policy is improved based on criteria. (how advantage is an action compared to the policy?)
  1. Sample trajectory point {si,ai,ri}\{s_i, a_i, r_i\}
  1. Update Vϕπθ(s)V_\phi^{\pi_\theta}(s) or Qϕπθ(s,a)Q_\phi^{\pi_\theta}(s, a) by like
    GAE
  1. θJ(θ)1Niθlogπθ(aisi)Aϕπθ(si,ai)\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i \nabla_\theta log \pi_\theta (a_i|s_i) A_\phi^{\pi_\theta} (s_i, a_i)
  1. θθ+αθJ(θ)\theta' \leftarrow \theta + \alpha\nabla_\theta J(\theta)

Properties

  • no need to collect full trajectories for update by sampling efficiently
  • Actor-critic is sample efficient than REINFORCE

Value-based actor critic

  • Unlike semi on-policy methods like PPO, this approach uses off-policy data while performing value iteration
  • Instead of using GAE, it also learns a Q model, similar to algorithms like DDPG or SAC

How to update Actor

  • train parameters like standard deviation of shared normal or categorical distribution
Actor Critic Algorithms
 
 
 
 
 

 

Recommendations