Actor Critic

Creator

Creator

Created

Created

2023 Sep 10 7:54

Editor

Editor

Edited

Edited

2025 Jul 6 0:36

Refs

Refs

Policy Gradient Learning

Value-Based Learning

leggedrobotics • Updated 2025 Jan 19 11:51

Simultaneous learning of Policy and
State-value function

Policy (Actor) acts and Critic (Value function) evaluate so Policy is improved based on criteria. (how advantage is an action compared to the policy?)

Sample trajectory point $\{s_i, a_i, r_i\}$

Update $V_\phi^{\pi_\theta}(s)$ or $Q_\phi^{\pi_\theta}(s, a)$ by like
GAE

$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i \nabla_\theta log \pi_\theta (a_i|s_i) A_\phi^{\pi_\theta} (s_i, a_i)$

$\theta' \leftarrow \theta + \alpha\nabla_\theta J(\theta)$

Properties

no need to collect full trajectories for update by sampling efficiently

Actor-critic is sample efficient than REINFORCE

Value-based actor critic

Unlike semi on-policy methods like PPO, this approach uses off-policy data while performing value iteration

Instead of using GAE, it also learns a Q model, similar to algorithms like DDPG or SAC

How to update Actor

train parameters like standard deviation of shared normal or categorical distribution

Actor Critic Algorithms

Soft actor-critic (SAC)

A3C

NLPO

Generalized Policy Iteration

AHAC

Backlinks

DDPG Actor Critic Dreamer

Recommendations

///////