RLHF

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Apr 30 7:23
Editor
Edited
Edited
2024 Dec 20 17:34

Reinforcement learning from human feedback

인간의 피드백을 기반으로 보상 함수를 학습하고 이를 통해 policy를 업데이트
https://openai.com/blog/chatgpt
 

Limitation

LM의 근본적인 문제인 Size, hallucination을 아직까지는 개선할 수는 없는 한계점이 있고 복잡하다
 
 

LLaVA-RLHF

OOD generalization is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model’s ability to generate varied outputs and is important for a variety of use cases
RLHF generalizes better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalization and diversity.
 
 

 

Recommendations