Policy constraint approaches
To limit how far it deviates from the behavior policy.
TD3+BC
The latter part is Behavior Cloning.
- Problem: Q-function is unreliable on OOD actions
- Solution: Keep close to unknown by fit to the data
With subtracting Behavior Cloning term to the policy update (make policy to minimize action difference)
Problems of policy constraint approaches
- Too pessimistic
- Not pessimistic enough
Problems are that the policy constraint approaches are too pessimistic for example if is a random policy. At the same time, it is not pessimistic enough for to derivate a little from data, incurring overestimation.
Solution options
- Train so that OOD actions never have high values (CQL)
- Avoid evaluating actions not in dataset