DDPG

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Sep 10 8:50
Editor
Edited
Edited
2024 May 3 14:56
Refs
Refs

Deep deterministic Policy Gradient

Value based,
Actor Critic
continuous version of
DQN
(differentiable actor network helps to evaluate Q values with continuous action space)
  • Not used any more because it is sensitive to hyper parameters
  • DDPG uses target networks for both policy and value function
  • continuous → differentiable → find greedy policy via
 

Off policy actor-critic

Finding greedy policy with continuous actions
notion image
Approximate with a deterministic policy
 
 

Soft target update

Target policy smoothing making exploiting errors in Q-function harder
 
 

How to overcome overestimation weakness of DDPG because of deterministic policy

Deterministic policy can quickly overfit to noisy target Q-function because it overestimating target value
  • Learning 2 Q-functions and choosing the minimum as target (2 is enough)
  • Smoothing target policy
 
 
 

TD3 (Twin Delayed DDPG)

Improves training stability using clipped double q-learning, delayed policy updates and target policy smoothing using clipping
notion image
 
 
 
 
Twin Delayed DDPG — Spinning Up documentation
While DDPG can achieve great performance sometimes, it is frequently brittle with respect to hyperparameters and other kinds of tuning. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function. Twin Delayed DDPG (TD3) is an algorithm that addresses this issue by introducing three critical tricks:
 
 

Recommendations