- Mainly for discrete action space
- Correlate samples in online Q-learning through replay buffer
- Moving targets through target network
- Exploration through Epsilon greedy
Epsilon Greedy when sampling
With probability , select a random action to explore new action. Otherwise choose greedy action . We change epsilon across training like learning rate.
그냥 다양한 action sampling을 위한 것
Replay Buffer for sampled state management
for correlated samples in online Q-learning
Consecutive samples have strong correlations with each other, which can increase variance in the learning process and destabilize optimization. To address this, the entire buffer is stored and samples are randomly extracted for use.

to prevent
- Correlated samples / consecutive samples
Target network
Moving target을 방지하기 위해 Delayed updates
Compute targets with target network which don’t change in inner loop


However, the DQN always expect larger Q values than true because of above max operator
Double DQN (DDQN)
Use the current network for action selection and the target network for action evaluation (simple but effective)
Decorrelate errors in action selection and evaluation

N-step returns
Like GAE, we use n-step target for Q-learning
Less biased when Q-values are inaccurate but correct only when learning on-policy
Prioritized experience replay (PER)
Transitions are uniformly sampled → new experience is not frequently sampled as replay buffer gets larger (prioritize sampling using more high TD-error transitions)
Transitions are uniformly sampled because experience replay using replay buffer makes training data iid. Prioritized sampling more with high TD-error
- Loss shrinks very quickly
- Only sample a few transitions with high error; thus, prone to overfit
- Evaluating TD-error is very expensive overhead so only updated when loss is computed not for every Q update

Seonglae Cho