Classifier weakness
Idea is that adding states that RL visits as negative examples for the classifier
Use final states as success state examples → train binary classifier
learned classifier
like GAN
- the policy is generator
- the goal classifier is classifier
VQ-GAN
We want goal classifier to match goal state distribution
Goal is to slightly difference from Behavior Cloning (match expert state-action distribution)
typically sample positive and negative half and half for make expectation 0.5
general reward classifier
with demonstrating trajectories