DDPG

Creator
Creator
Seonglae Cho
Created
Created
2023 Sep 10 8:50
Editor
Edited
Edited
2024 May 3 14:56
Refs
Refs

Deep deterministic Policy Gradient

Value based,
Actor Critic
continuous version of
DQN
(differentiable actor network helps to evaluate Q values with continuous action space)
  • Not used any more because it is sensitive to hyper parameters
  • DDPG uses target networks for both policy and value function
  • continuous → differentiable → find greedy policy πθ\pi_\theta via maxθEsB[Qϕ(s,πθ(s))]\max_\theta E_{s \sim B} [Q_\phi (s, \pi_\theta(s))]
 

Off policy actor-critic

Finding greedy policy with continuous actions
π(s)=argmaxaQϕ(s,a)\pi(s) = argmax_aQ_\phi (s,a)
notion image
Approximate π(s)\pi(s) with a deterministic policy πθ(s)\pi_\theta(s)
 
 

Soft target update

Target policy smoothing making exploiting errors in Q-function harder
ϕρϕ+(1ρ)ϕ\phi^- \leftarrow \rho\phi^- + (1- \rho)\phi
Polyak average
with ρ0.99\rho \approx 0.99
 
 

How to overcome overestimation weakness of DDPG because of deterministic policy

Deterministic policy can quickly overfit to noisy target Q-function because it overestimating target value
  • Learning 2 Q-functions and choosing the minimum as target (2 is enough)
  • Smoothing target policy a=πθ(s)+ϵ,ϵN(0,σ)a' = \pi_{\theta^-}(s') + \epsilon, \epsilon \sim N(0, \sigma)
 
 
 

TD3 (Twin Delayed DDPG)

Improves training stability using clipped double q-learning, delayed policy updates and target policy smoothing using clipping
notion image
 
 
 
 
 
 

Recommendations