Implicit Q Learning

Creator
Creator
Seonglae Cho
Created
Created
2024 May 1 2:16
Editor
Edited
Edited
2024 Jun 16 11:53
Refs
Refs
CQL

IQL

SARSA
style but use only good actions using L2τL_2^\tau loss
LV(ψ)=E(s,a)D[L2τ(Qϕ(s,a)Vψ(s))]L_V(\psi) = E_{(s,a) \sim D} [L_2^\tau(Q_{\phi'}(s,a) - V_\psi(s))]

Expectile regression

Prediction tends to map higher targets
notion image
notion image
  • Prediction is larger than target → small loss → prediction stays large
  • prediction is smaller than target → larger loss → prediction becomes larger
 
 

Properties

  • Avoids training on any OOD actions!
  • Policy (still) only trained on actions in data
 

Implementation

  • two hyperparameters compared to
    CQL
    has 1 which means hard to tune
  • once converged, extract π\pi using
    AWR
 
 
 
 
 
 

Recommendations