IQL
SARSA style but use only good actions using loss
Expectile regression
Prediction tends to map higher targets
- Prediction is larger than target → small loss → prediction stays large
- prediction is smaller than target → larger loss → prediction becomes larger
Properties
- Avoids training on any OOD actions!
- Policy (still) only trained on actions in data
Implementation
- two hyperparameters compared to CQL has 1 which means hard to tune
- once converged, extract using AWR