DQN

Mainly for discrete action space

Correlate samples in online Q-learning through replay buffer

Moving targets through target network

Exploration through Epsilon greedy

Epsilon Greedy when sampling

With probability

\epsilon

, select a random action

a_t

to explore new action. Otherwise choose greedy action

a_t = max_aQ(s_t,a)

. We change epsilon across training like learning rate.

그냥 다양한 action sampling을 위한 것

Replay Buffer for sampled state management

for correlated samples in online Q-learning

연속적인 샘플들은 서로 강한 상관관계를 가지게 되는데, 이는 학습 과정에서의 분산을 증가시키고, 최적화 과정을 불안정하게 할 수 있어서, 전체 buffer 저장한 후 무작위로 샘플을 추출하여 사용한다

to prevent

Correlated samples / consecutive samples

Target network $\phi$

Moving target을 방지하기 위해 Delayed updates

Compute targets with target network which don’t change in inner loop

y = r + \lambda Q_{\phi'}(s', argmax_{a'}Q_{\phi'} (s', a'))

However, the DQN always expect larger Q values than true because of above max operator

Double DQN (DDQN)

DoubleDQN

Use the current network for action selection and the target network for action evaluation (simple but effective)

y = r + \lambda Q_{\phi'}(s', argmax_{a'}Q_\phi (s', a'))

Decorrelate errors in action selection and evaluation

N-step returns

GAE, we use n-step target for Q-learning

Less biased when Q-values are inaccurate but correct only when learning on-policy

Prioritized experience replay (PER)

Transitions are uniformly sampled → new experience is not frequently sampled as replay buffer gets larger (prioritize sampling using more high TD-error transitions)

Transitions are uniformly sampled because experience replay using replay buffer makes training data

iid. Prioritized sampling more with high TD-error

||Q(s_i,a_i)- y_i||^2

Loss shrinks very quickly

Only sample a few transitions with high error; thus, prone to overfit

Evaluating TD-error is very expensive overhead so only updated when loss is computed not for every Q update

DQN