PPO

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Jul 15 17:6
Editor
Edited
Edited
2025 Sep 7 1:29

Proximal Policy Optimization

Methods

  • Use advantage function to reduce the variance
  • Use importance sampling to take multiple gradient steps by utilizing semi-off-policy data
  • Constrain the optimization objective in the policy space by clipping

Property

  • It scales very well as perspective of parallel training and it is rare property in RL
  • It supports discrete and continuous
  • balance between ease of implementation, sample complexity, and tuning unlike TRPO
  • Importance sampling allows using different policies (making it semi off-policy), which enables multiple policy updates from a single trajectory
https://arxiv.org/pdf/2402.01306

Background

  • A model-free reinforcement learning algorithm developed by OpenAI in 2017
  • However, PPO is still considered on-policy RL because the new policy should be sufficiently close to roll-out policy due to the constraint. Also, PPO cannot handle data from multiple different polices.
notion image
notion image

Importance sampling

Clipping (surrogate objective)

Based on TRPO, it uses adaptive objective function with clipping to control the size of policy updates.
notion image
In PPO's objective function, clipping is applied in both directions, but the actual effect of the minimum operation is applied only on one side depending on the ratio value and the sign of advantage .
In other words, the following formula is a representation where inequalities are applied one at a time, not simultaneously:
Out of 6 possible cases (two directions × three cases exceeding epsilon), 2 cases never occur in practice.
Since the ratio is not independent of , the differentiation process is complex, but the log probability disappears when applying the chain rule. There is an approximation in the previous process where per-timestep calculations are simplified to trajectory-level. This approximation works only near the clipped area.
annotated PPO from Edan Toledo
annotated PPO from Edan Toledo
 
 
 

Additional implementation options

notion image
 
 
 
Proximal Policy Optimization — Spinning Up documentation
PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse? Where TRPO tries to solve this problem with a complex second-order method, PPO is a family of first-order methods that use a few other tricks to keep new policies close to old. PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate"...
Proximal Policy Optimization Algorithms
Proximal Policy Optimization
We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance.
Proximal Policy Optimization

PPO2

github.com
 
 
 

Recommendations