TD3+BC

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 27 14:36
Editor
Edited
Edited
2024 May 31 3:12
Refs
Refs

Policy constraint approaches

To limit how far it deviates from the behavior policy.

TD3+BC

The latter part is
Behavior Cloning
.
  • Problem: Q-function is unreliable on OOD actions
  • Solution: Keep close to unknown by fit to the data
With subtracting
Behavior Cloning
term to the policy update (make policy to minimize action difference)
 

Problems of policy constraint approaches

  • Too pessimistic
  • Not pessimistic enough
Problems are that the policy constraint approaches are too pessimistic for example if is a random policy. At the same time, it is not pessimistic enough for to derivate a little from data, incurring overestimation.

Solution options

  • Train so that OOD actions never have high values (
    CQL
    )
  • Avoid evaluating actions not in dataset
 
 
 
 
 

Recommendations