DQN

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Jun 17 17:18
Editor
Edited
Edited
2024 May 31 3:6
Refs
Refs
  • Mainly for discrete action space
  • Correlate samples in online Q-learning through replay buffer
  • Moving targets through target network
  • Exploration through Epsilon greedy
 

Epsilon Greedy
when sampling

With probability , select a random action to explore new action. Otherwise choose greedy action . We change epsilon across training like learning rate.
그냥 다양한 action sampling을 위한 것
 

Replay Buffer
for sampled state management

for correlated samples in online Q-learning
연속적인 샘플들은 서로 강한 상관관계를 가지게 되는데, 이는 학습 과정에서의 분산을 증가시키고, 최적화 과정을 불안정하게 할 수 있어서, 전체 buffer 저장한 후 무작위로 샘플을 추출하여 사용한다
notion image
to prevent
  • Correlated samples / consecutive samples
 

Target network

Moving target을 방지하기 위해 Delayed updates
Compute targets with target network which don’t change in inner loop
notion image
 
 
notion image
However, the DQN always expect larger Q values than true because of above max operator
 
 

Double DQN (DDQN)

Use the current network for action selection and the target network for action evaluation (simple but effective)
Decorrelate errors in action selection and evaluation
notion image
 
 

N-step returns

Like
GAE
, we use n-step target for Q-learning
Less biased when Q-values are inaccurate but correct only when learning on-policy
 
 

Prioritized experience replay (PER)

Transitions are uniformly sampled → new experience is not frequently sampled as replay buffer gets larger (prioritize sampling using more high TD-error transitions)
Transitions are uniformly sampled because experience replay using replay buffer makes training data
iid
. Prioritized sampling more with high TD-error
  • Loss shrinks very quickly
  • Only sample a few transitions with high error; thus, prone to overfit
  • Evaluating TD-error is very expensive overhead so only updated when loss is computed not for every Q update
 
 
 
 

 

Recommendations