Trust Region Policy OptimizationDKL(πθ,πθold)≤ϵD_{KL} (\pi_\theta, \pi_{\theta_{old}}) \le \epsilonDKL(πθ,πθold)≤ϵKL divergence로 implementation이 어렵고 느려서 ppo가 선호된다