Proximal Policy Optimization
Methods
- Use advantage function to reduce the variance
- Use GAE for better bias/variance trade-off (Advantage function)
- Use importance sampling to take multiple gradient steps by utilizing semi-off-policy data
- Constrain the optimization objective in the policy space by clipping
Property
- It scales very well as perspective of parallel training and it is rare property in RL
- It supports discrete and continuous
- balance between ease of implementation, sample complexity, and tuning unlike TRPO
- Importance sampling allows using different policies (making it semi off-policy), which enables multiple policy updates from a single trajectory
Background
- A model-free reinforcement learning algorithm developed by OpenAI in 2017
- However, PPO is still considered on-policy RL because the new policy should be sufficiently close to roll-out policy due to the constraint. Also, PPO cannot handle data from multiple different polices.
Importance sampling
Clipping
Based on TRPO, it uses adaptive objective function with clipping to control the size of policy updates.
In PPO's objective function, clipping is applied in both directions, but the actual effect of the minimum operation is applied only on one side depending on the ratio value and the sign of advantage .
In other words, the following formula is a representation where inequalities are applied one at a time, not simultaneously:
Out of 6 possible cases (two directions × three cases exceeding epsilon), 2 cases never occur in practice.
Since the ratio is not independent of , the differentiation process is complex, but the log probability disappears when applying the chain rule. There is an approximation in the previous process where per-timestep calculations are simplified to trajectory-level. This approximation works only near the clipped area.