Direct Preference OptimizationRL 없이도 Human Preference를 학습 RLHF에서 PPO architecture 보다 훨적은 memory사용사용자 선호도 쌍의 집합과 logit으로 사용자의 pairwise preferences를 직접적으로 모델 학습에 반영 sDPO from upstage Implementationdpo-from-scratch.ipynbrasbttowardsdatascience.comhttps://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aacarxiv.orghttps://arxiv.org/pdf/2305.18290.pdfDirect Preference Optimization 논문리뷰Direct Preference Optimization: Your Language Model is Secretly a Reward Model 논문리뷰https://junbuml.ee/dpoSelf-Rewarding Language ModelsPaper page - Self-Rewarding Language ModelsJoin the discussion on this paper pagehttps://huggingface.co/papers/2401.10020DPO Datasetsargilla/OpenHermesPreferences · Datasets at Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/datasets/argilla/OpenHermesPreferences