CQL

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 17 2:19
Editor
Edited
Edited
2024 Jun 18 15:25
Refs
Refs
DQN

Conservative Q-Learning

CQL makes OOD actions never have high values

Regularizing the value functions to assign low values to OOD actions
Offline RL
version of
Soft actor-critic (SAC)
(Replacing with )

Conservative Approaches

  • Train so that OOD(Out of distribution) actions never have high values
  • Avoid evaluating actions not in dataset
  • Often too conservative so it is good when the data is not good
notion image
 

Need to tune tradeoff factor

notion image
Can show that for large enough (It might be too pessimistic)
notion image
 
 indicates policy for offline dataset
indicates policy for offline dataset
전체에 대해서 내리고 데이터 범위 안에선 같은 비율만큼 올려줘서 데이터 안에서만 맞춰줌
No longer guaranteed that for all
But, guaranteed that for all
notion image
Then how do we find and compute this term? We add regularizer provides a closed form solution for tradeoff factor
notion image

The final form for update Q

SAC + two terms (easy to implement)

A few hacks required (BC pre-training, estimating log-sum-exp)

 
 
 
 
 
 
 

Recommendations