PPO

Creator
Creator
Seonglae Cho
Created
Created
2023 Jul 15 17:6
Editor
Edited
Edited
2025 Mar 21 11:53

Proximal Policy Optimization

Methods

  • Use advantage function to reduce the variance
  • Use importance sampling to take multiple gradient steps by utilizing semi-off-policy data
  • Constrain the optimization objective in the policy space by clipping

Property

  • It scales very well as perspective of parallel training and it is rare property in RL
  • It supports discrete and continuous
  • balance between ease of implementation, sample complexity, and tuning unlike TRPO
  • Importance sampling allows using different policies (making it semi off-policy), which enables multiple policy updates from a single trajectory

Background

  • A model-free reinforcement learning algorithm developed by OpenAI in 2017
  • However, PPO is still considered on-policy RL because the new policy should be sufficiently close to roll-out policy due to the constraint. Also, PPO cannot handle data from multiple different polices.
notion image
notion image

Importance sampling

Clipping

Based on TRPO, it uses adaptive objective function with clipping to control the size of policy updates.
notion image
In PPO's objective function, clipping is applied in both directions, but the actual effect of the minimum operation is applied only on one side depending on the ratio value and the sign of advantage .
In other words, the following formula is a representation where inequalities are applied one at a time, not simultaneously:
Out of 6 possible cases (two directions × three cases exceeding epsilon), 2 cases never occur in practice.
Since the ratio is not independent of , the differentiation process is complex, but the log probability disappears when applying the chain rule. There is an approximation in the previous process where per-timestep calculations are simplified to trajectory-level. This approximation works only near the clipped area.
annotated PPO from Edan Toledo
annotated PPO from Edan Toledo
 
 
 

Additional implementation options

notion image
 
 

PPO2

 
 
 

Recommendations