Skip to content

Latest commit

 

History

History
7 lines (4 loc) · 576 Bytes

230529 Direct Preference Optimization.md

File metadata and controls

7 lines (4 loc) · 576 Bytes

https://arxiv.org/abs/2305.18290

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn)

오...아주 흥미로운 접근이네요. RL objective에 Bradley-Terry preference model (https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) 을 결합해서 reward function, reinforcement learning이 필요하지 않은 objective를 만들었습니다. 그냥 preference data에 대해 직접 학습시켜버리는 형태가 되는군요.

#alignment