TD3+BC

Policy constraint approaches

To limit how far it deviates from the behavior policy.

TD3+BC

The latter part is

Behavior Cloning.

CQL

Problem: Q-function is unreliable on OOD actions

Solution: Keep close to unknown by fit to the data

With subtracting

Behavior Cloning term to the policy update (make policy to minimize action difference)

Problems of policy constraint approaches

Too pessimistic

Not pessimistic enough

Problems are that the policy constraint approaches are too pessimistic for example if is a random policy. At the same time, it is not pessimistic enough for to derivate a little from data, incurring overestimation.

Solution options

Train so that OOD actions never have high values (
CQL)

Avoid evaluating actions not in dataset

TD3+BC

Policy constraint approaches

TD3+BC

Problems of policy constraint approaches

Solution options

Recommendations