Proximal Policy Optimization
Methods
- Use advantage function to reduce the variance
- Use GAE for better bias/variance trade-off (Advantage function)
- Use importance sampling to take multiple gradient steps by utilizing semi-off-policy data
- Constrain the optimization objective in the policy space by clipping
Property
- It scales very well as perspective of parallel training and it is rare property in RL
- It supports discrete and continuous
- balance between ease of implementation, sample complexity, and tuning unlike TRPO
- Importance sampling allows using different policies (making it semi off-policy), which enables multiple policy updates from a single trajectory
Background
- A model-free reinforcement learning algorithm developed by OpenAI in 2017
- However, PPO is still considered on-policy RL because the new policy should be sufficiently close to roll-out policy due to the constraint. Also, PPO cannot handle data from multiple different polices.


Importance sampling
Clipping (surrogate objective)
Based on TRPO, it uses adaptive objective function with clipping to control the size of policy updates.

In PPO's objective function, clipping is applied in both directions, but the actual effect of the minimum operation is applied only on one side depending on the ratio value and the sign of advantage .
In other words, the following formula is a representation where inequalities are applied one at a time, not simultaneously:
Out of 6 possible cases (two directions × three cases exceeding epsilon), 2 cases never occur in practice.
Since the ratio is not independent of , the differentiation process is complex, but the log probability disappears when applying the chain rule. There is an approximation in the previous process where per-timestep calculations are simplified to trajectory-level. This approximation works only near the clipped area.

Additional implementation options

Proximal Policy Optimization — Spinning Up documentation
PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse? Where TRPO tries to solve this problem with a complex second-order method, PPO is a family of first-order methods that use a few other tricks to keep new policies close to old. PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.
https://spinningup.openai.com/en/latest/algorithms/ppo.html
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate"...
https://arxiv.org/abs/1707.06347

Proximal Policy Optimization
We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance.
https://openai.com/research/openai-baselines-ppo


Seonglae Cho