Deep deterministic Policy Gradient
Value based, Actor Critic continuous version of DQN (differentiable actor network helps to evaluate Q values with continuous action space)
- Not used any more because it is sensitive to hyper parameters
- soft target update using Polyak average
- DDPG uses target networks for both policy and value function
- continuous → differentiable → find greedy policy via
Off policy actor-critic
Finding greedy policy with continuous actions
Approximate with a deterministic policy
Soft target update
Target policy smoothing making exploiting errors in Q-function harder
Polyak average with
How to overcome overestimation weakness of DDPG because of deterministic policy
Deterministic policy can quickly overfit to noisy target Q-function because it overestimating target value
- Learning 2 Q-functions and choosing the minimum as target (2 is enough)
- Smoothing target policy
TD3 (Twin Delayed DDPG)
Improves training stability using clipped double q-learning, delayed policy updates and target policy smoothing using clipping