DQN

Creator
Creator
Seonglae Cho
Created
Created
2023 Jun 17 17:18
Editor
Edited
Edited
2024 Nov 21 11:36
Refs
Refs
  • Mainly for discrete action space
  • Correlate samples in online Q-learning through replay buffer
  • Moving targets through target network
  • Exploration through Epsilon greedy
 

Epsilon Greedy
when sampling

With probability ϵ\epsilon, select a random action ata_t to explore new action. Otherwise choose greedy action at=maxaQ(st,a)a_t = max_aQ(s_t,a). We change epsilon across training like learning rate.
그냥 다양한 action sampling을 위한 것
 

Replay Buffer
for sampled state management

for correlated samples in online Q-learning
연속적인 샘플들은 서로 강한 상관관계를 가지게 되는데, 이는 학습 과정에서의 분산을 증가시키고, 최적화 과정을 불안정하게 할 수 있어서, 전체 buffer 저장한 후 무작위로 샘플을 추출하여 사용한다
notion image
to prevent
  • Correlated samples / consecutive samples
 

Target network ϕ\phi

Moving target을 방지하기 위해 Delayed updates
Compute targets with target network which don’t change in inner loop
notion image
y=r+λQϕ(s,argmaxaQϕ(s,a))y = r + \lambda Q_{\phi'}(s', argmax_{a'}Q_{\phi'} (s', a'))
 
 
notion image
However, the DQN always expect larger Q values than true because of above max operator
 
 

Double DQN (DDQN)

Use the current network for action selection and the target network for action evaluation (simple but effective)
y=r+λQϕ(s,argmaxaQϕ(s,a))y = r + \lambda Q_{\phi'}(s', argmax_{a'}Q_\phi (s', a'))
Decorrelate errors in action selection and evaluation
notion image
 
 

N-step returns

Like
GAE
, we use n-step target for Q-learning
Less biased when Q-values are inaccurate but correct only when learning on-policy
 
 

Prioritized experience replay (PER)

Transitions are uniformly sampled → new experience is not frequently sampled as replay buffer gets larger (prioritize sampling using more high TD-error transitions)
Transitions are uniformly sampled because experience replay using replay buffer makes training data
iid
. Prioritized sampling more with high TD-error Q(si,ai)yi2||Q(s_i,a_i)- y_i||^2
  • Loss shrinks very quickly
  • Only sample a few transitions with high error; thus, prone to overfit
  • Evaluating TD-error is very expensive overhead so only updated when loss is computed not for every Q update
 
 
 
 

 

Recommendations