PPO

Proximal Policy Optimization

Methods

Use advantage function to reduce the variance

Use GAE for better bias/variance trade-off (
Advantage function)

Use importance sampling to take multiple gradient steps by utilizing semi-off-policy data

Constrain the optimization objective in the policy space by clipping

Property

It scales very well as perspective of parallel training and it is rare property in RL

It supports discrete and continuous

balance between ease of implementation, sample complexity, and tuning unlike TRPO

Importance sampling allows using different policies (making it semi off-policy), which enables multiple policy updates from a single trajectory

Background

A model-free reinforcement learning algorithm developed by OpenAI in 2017

However, PPO is still considered on-policy RL because the new policy should be sufficiently close to roll-out policy due to the constraint. Also, PPO cannot handle data from multiple different polices.

Importance sampling

Clipping (surrogate objective)

Based on TRPO, it uses adaptive objective function with clipping to control the size of policy updates.

In PPO's objective function, clipping is applied in both directions, but the actual effect of the minimum operation is applied only on one side depending on the ratio value and the sign of advantage

A

In other words, the following formula is a representation where inequalities are applied one at a time, not simultaneously:

1 - \epsilon \le \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \le 1 + \epsilon

Out of 6 possible cases (two directions × three cases exceeding epsilon), 2 cases never occur in practice.

Since the ratio is not independent of

\theta

, the differentiation process is complex, but the log probability disappears when applying the chain rule. There is an approximation in the previous process where per-timestep calculations are simplified to trajectory-level. This approximation works only near the clipped area.