Overestimation in Q-learning on OOD
Q-function is unreliable on out-of-distribution (OOD) actions since shift between and
Offline Learning 에서 데이터가 제한되다 보니 주요 문제가 된다
- Q-function is unreliable on out-of-distribution (OOD) actions.
- will seek out actions where Q-function is over-optimistic. (Maximization Bias)
- After values propagate, Q-value will become substantially overestimated. (벨만 업데이트 공식을 통해 다른 상태로 전파)
Regularization, pessimism or ensemble helps to address the overestimation issue.
Q overestimation methods