ReST

traditional supervised learning methods can be limited by the availability of high-quality labeled data, and that reinforcement learning can be computationally expensive and sample-inefficient

unlike online

RLHF, learns from a fixed dataset of examples and is more computationally efficient and less prone to reward hacking (offline)

The algorithm alternates between training a base policy on the available data and refining the policy using offline RL with the reward function

Grow step

generating a dataset of samples using the current policy

Improve step

refining the policy using the generated dataset and the reward function

During the Improve step, the policy is fine-tuned to maximize the expected reward on the filtered dataset

Improve step being repeated more frequently to amortize the dataset creation cost

arxiv.org

https://arxiv.org/pdf/2308.08998.pdf

ReST

Grow step

Improve step

Recommendations