Update policy on data so far
able to improve the policy without generating new samples from that policy
On-policy methods have a dependency on the policy, requiring new data generation each time the policy is updated. This dependency limits parallelization, making off-policy methods better suited for large-scale training.


The problem is that the Q-function is unreliable on OOD(Out of distribution) actions. So we need to fit close to unknown (behavior policy) by the data

Seonglae Cho