Off-policy

able to improve the policy without generating new samples from that policy

on policy는 policy에 의존성이 있어서 policy update에 따라 데이터 생성을 해주어야하는 의존성이 있다. 즉 병렬성이 안좋아서 off policy가 대량학습을 위해서는 좋다

y_i \leftarrow r(s_i, a_i) + max_{a_i}Q_\phi(s_i',a_i')

The problem is that the Q-function is unreliable on OOD(Out of distribution) actions. So we need to fit

\pi

close to unknown

\pi_\beta

(behavior policy) by the data