TD3+BC

Creator
Creator
Seonglae Cho
Created
Created
2024 Apr 27 14:36
Editor
Edited
Edited
2024 May 31 3:12
Refs
Refs

Policy constraint approaches

To limit how far it deviates from the behavior policy.

TD3+BC

The latter part is
Behavior Cloning
.
π=argmaxπE(s,a)D[λQ(s,π(s))(π(s)a)2]\pi = argmax_\pi E_{(s,a)\sim D}[\lambda Q(s, \pi(s)) - (\pi(s) - a)^2]
  • Problem: Q-function is unreliable on OOD actions
  • Solution: Keep π\pi close to unknown πβ\pi_\beta by fit π\pi to the data
With subtracting
Behavior Cloning
term (π(s)a)2(\pi(s) - a)^2 to the policy update (make policy to minimize action difference)
π=arg maxπE(s,a)D[λQ(s,π(s)(π(s)a)2)]\pi = \argmax_\pi E_{(s,a) \sim D} [\lambda Q(s, \pi(s) - (\pi(s) - a)^2)]
 

Problems of policy constraint approaches

  • Too pessimistic
  • Not pessimistic enough
Problems are that the policy constraint approaches are too pessimistic for example if πβ\pi_\beta is a random policy. At the same time, it is not pessimistic enough for π\pi to derivate a little from data, incurring overestimation.

Solution options

  • Train QQ so that OOD actions never have high values (
    CQL
    )
  • Avoid evaluating actions not in dataset
 
 
 
 
 

Recommendations