Deep deterministic Policy Gradient
Value based, Actor Critic continuous version of DQN (differentiable actor network helps to evaluate Q values with continuous action space)
- Not used any more because it is sensitive to hyper parameters
- soft target update using Polyak average
- DDPG uses target networks for both policy and value function
- continuous → differentiable → find greedy policy via
Off policy actor-critic
Finding greedy policy with continuous actions

Approximate with a deterministic policy
Soft target update
Target policy smoothing making exploiting errors in Q-function harder
How to overcome overestimation weakness of DDPG because of deterministic policy
Deterministic policy can quickly overfit to noisy target Q-function because it overestimating target value
- Learning 2 Q-functions and choosing the minimum as target (2 is enough)
- Smoothing target policy
TD3 (Twin Delayed DDPG)
Improves training stability using clipped double q-learning, delayed policy updates and target policy smoothing using clipping

Twin Delayed DDPG — Spinning Up documentation
While DDPG can achieve great performance sometimes, it is frequently brittle with respect to hyperparameters and other kinds of tuning. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function. Twin Delayed DDPG (TD3) is an algorithm that addresses this issue by introducing three critical tricks:
https://spinningup.openai.com/en/latest/algorithms/td3.html

Seonglae Cho