Conservative Q-Learning
CQL makes OOD actions never have high values
Regularizing the value functions to assign low values to OOD actions
Conservative Approaches
- Train so that OOD(Out of distribution) actions never have high values
- Avoid evaluating actions not in dataset
- Often too conservative so it is good when the data is not good
Need to tune tradeoff factor
Can show that for large enough (It might be too pessimistic)
전체에 대해서 내리고 데이터 범위 안에선 같은 비율만큼 올려줘서 데이터 안에서만 맞춰줌
No longer guaranteed that for all
But, guaranteed that for all
Then how do we find and compute this term? We add regularizer provides a closed form solution for tradeoff factor
The final form for update Q
SAC + two terms (easy to implement)