ReST

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Sep 9 16:51
Editor
Edited
Edited
2023 Sep 10 8:22
traditional supervised learning methods can be limited by the availability of high-quality labeled data, and that reinforcement learning can be computationally expensive and sample-inefficient
unlike online
RLHF
, learns from a fixed dataset of examples and is more computationally efficient and less prone to reward hacking (offline)
The algorithm alternates between training a base policy on the available data and refining the policy using offline RL with the reward function
 
 
notion image

Grow step

generating a dataset of samples using the current policy
 
 

Improve step

refining the policy using the generated dataset and the reward function
During the Improve step, the policy is fine-tuned to maximize the expected reward on the filtered dataset
Improve step being repeated more frequently to amortize the dataset creation cost
notion image
 
 
 
 
 

Recommendations