DDPG

Creator

Creator

Created

Created

2023 Sep 10 8:50

Editor

Editor

Edited

Edited

2024 May 3 14:56

Refs

Refs

Deep deterministic Policy Gradient

Value based,

Actor Critic continuous version of

DQN (differentiable actor network helps to evaluate Q values with continuous action space)

Not used any more because it is sensitive to hyper parameters

soft target update using
Polyak average

DDPG uses target networks for both policy and value function

continuous → differentiable → find greedy policy $\pi_\theta$ via $\max_\theta E_{s \sim B} [Q_\phi (s, \pi_\theta(s))]$

Entropy Bonus

Off policy actor-critic

Finding greedy policy with continuous actions

\pi(s) = argmax_aQ_\phi (s,a)

notion image

Approximate

\pi(s)

with a deterministic policy

\pi_\theta(s)

Soft target update

Target policy smoothing making exploiting errors in Q-function harder

\phi^- \leftarrow \rho\phi^- + (1- \rho)\phi

Polyak average with

\rho \approx 0.99

How to overcome overestimation weakness of DDPG because of deterministic policy

Deterministic policy can quickly overfit to noisy target Q-function because it overestimating target value

Learning 2 Q-functions and choosing the minimum as target (2 is enough)

Smoothing target policy $a' = \pi_{\theta^-}(s') + \epsilon, \epsilon \sim N(0, \sigma)$

TD3 (Twin Delayed DDPG)

Improves training stability using clipped double q-learning, delayed policy updates and target policy smoothing using clipping

notion image

Twin Delayed DDPG — Spinning Up documentation

While DDPG can achieve great performance sometimes, it is frequently brittle with respect to hyperparameters and other kinds of tuning. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function. Twin Delayed DDPG (TD3) is an algorithm that addresses this issue by introducing three critical tricks:

https://spinningup.openai.com/en/latest/algorithms/td3.html

Backlinks

Soft actor-critic (SAC)

Recommendations

////////