Replay Buffer for sampled state management
for correlated samples in online Q-learning
연속적인 샘플들은 서로 강한 상관관계를 가지게 되는데, 이는 학습 과정에서의 분산을 증가시키고, 최적화 과정을 불안정하게 할 수 있어서, 전체 buffer 저장한 후 무작위로 샘플을 추출하여 사용한다
to prevent
- Correlated samples / consecutive samples
Prioritized experience replay (PER)
Transitions are uniformly sampled → new experience is not frequently sampled as replay buffer gets larger (prioritize sampling using more high TD-error transitions)
Transitions are uniformly sampled because experience replay using replay buffer makes training data iid. Prioritized sampling more with high TD-error
- Loss shrinks very quickly
- Only sample a few transitions with high error; thus, prone to overfit
- Evaluating TD-error is very expensive overhead so only updated when loss is computed not for every Q update