Q overestimation

Q-function is unreliable on out-of-distribution (OOD) actions since shift between and

Offline Learning 에서 데이터가 제한되다 보니 주요 문제가 된다

will seek out actions where Q-function is over-optimistic. (Maximization Bias)

After values propagate, Q-value will become substantially overestimated. (벨만 업데이트 공식을 통해 다른 상태로 전파)

Regularization, pessimism or ensemble helps to address the overestimation issue.

Q overestimation methods