traditional supervised learning methods can be limited by the availability of high-quality labeled data, and that reinforcement learning can be computationally expensive and sample-inefficient
unlike online RLHF, learns from a fixed dataset of examples and is more computationally efficient and less prone to reward hacking (offline)
The algorithm alternates between training a base policy on the available data and refining the policy using offline RL with the reward function
Grow step
generating a dataset of samples using the current policy
Improve step
refining the policy using the generated dataset and the reward function
During the Improve step, the policy is fine-tuned to maximize the expected reward on the filtered dataset
Improve step being repeated more frequently to amortize the dataset creation cost