Awesome RLHF

Awesome RLHF
- Survey
- Papers
- Projects
- Other
- Extra reference

Survey

Papers

Transforming and Combining Rewards for Aligning Large Language Models, arXiv, 2402.00742, arxiv, pdf, cication: -1

Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch
Aligning Large Language Models with Counterfactual DPO, arXiv, 2401.09566, arxiv, pdf, cication: -1

Bradley Butcher
WARM: On the Benefits of Weight Averaged Reward Models, arXiv, 2401.12187, arxiv, pdf, cication: -1

Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
ReFT: Reasoning with Reinforced Fine-Tuning, arXiv, 2401.08967, arxiv, pdf, cication: -1

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li
Self-Rewarding Language Models, arXiv, 2401.10020, arxiv, pdf, cication: -1

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation, arXiv, 2401.08417, arxiv, pdf, cication: -1

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim
Secrets of RLHF in Large Language Models Part II: Reward Modeling, arXiv, 2401.06080, arxiv, pdf, cication: -1

Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi

· (jiqizhixin)
ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers, arXiv, 2401.02072, arxiv, pdf, cication: -1

Chen Zheng, Ke Sun, Da Tang, Yukun Ma, Yuyu Zhang, Chenguang Xi, Xun Zhou
InstructVideo: Instructing Video Diffusion Models with Human Feedback, arXiv, 2312.12490, arxiv, pdf, cication: -1

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni
Silkie: Preference Distillation for Large Visual Language Models, arXiv, 2312.10665, arxiv, pdf, cication: -1

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong
Align on the Fly: Adapting Chatbot Behavior to Established Norms, arXiv, 2312.15907, arxiv, pdf, cication: -1

Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, Pengfei Liu · (jiqizhixin) · (OPO - GAIR-NLP) · (gair-nlp.github)
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking, arXiv, 2312.09244, arxiv, pdf, cication: -1

Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, arXiv, 2312.06585, arxiv, pdf, cication: -1

Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi
HALOs - ContextualAI

Human-Centered Loss Functions (HALOs) · (HALOs - ContextualAI)
Axiomatic Preference Modeling for Longform Question Answering, arXiv, 2312.02206, arxiv, pdf, cication: -1

Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett · (huggingface)
Nash Learning from Human Feedback, arXiv, 2312.00886, arxiv, pdf, cication: -1

Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback, arXiv, 2312.00849, arxiv, pdf, cication: -1

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun · (RLHF-V - RLHF-V)
Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF
Adversarial Preference Optimization, arXiv, 2311.08045, arxiv, pdf, cication: -1

Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Nan Du

· (mp.weixin.qq)
Diffusion Model Alignment Using Direct Preference Optimization, arXiv, 2311.12908, arxiv, pdf, cication: -1

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
Black-Box Prompt Optimization: Aligning Large Language Models without Model Training, arXiv, 2311.04155, arxiv, pdf, cication: -1

Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang · (bpo - thu-coai)
Towards Understanding Sycophancy in Language Models, arXiv, 2310.13548, arxiv, pdf, cication: -1

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston · (jiqizhixin)
Contrastive Preference Learning: Learning from Human Feedback without RL, arXiv, 2310.13639, arxiv, pdf, cication: -1

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh · (jiqizhixin)
Don't throw away your value model! Making PPO even better via Value-Guided Monte-Carlo Tree Search decoding, arXiv, 2309.15028, arxiv, pdf, cication: 1

Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, Asli Celikyilmaz · (jiqizhixin)
The N Implementation Details of RLHF with PPO
Specific versus General Principles for Constitutional AI, arXiv, 2310.13798, arxiv, pdf, cication: 1

Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean
Contrastive Preference Learning: Learning from Human Feedback without RL, arXiv, 2310.13639, arxiv, pdf, cication: -1

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh
A General Theoretical Paradigm to Understand Learning from Human Preferences, arXiv, 2310.12036, arxiv, pdf, cication: 1

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos
Tuna: Instruction Tuning using Feedback from Large Language Models, arXiv, 2310.13385, arxiv, pdf, cication: -1

Haoran Li, Yiran Liu, Xingxing Zhang, Wei Lu, Furu Wei
Safe RLHF: Safe Reinforcement Learning from Human Feedback, arXiv, 2310.12773, arxiv, pdf, cication: 1

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, arXiv, 2310.10505, arxiv, pdf, cication: -1

Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun, Zhi-Quan Luo · (jiqizhixin)
Rethinking the Role of PPO in RLHF – The Berkeley Artificial Intelligence Research Blog
Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond, arXiv, 2310.06147, arxiv, pdf, cication: -1

Hao Sun
A Long Way to Go: Investigating Length Correlations in RLHF, arXiv, 2310.03716, arxiv, pdf, cication: 3

Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett
Aligning Large Multimodal Models with Factually Augmented RLHF, arXiv, 2309.14525, arxiv, pdf, cication: 4

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang
Stabilizing RLHF through Advantage Model and Selective Rehearsal, arXiv, 2309.10202, arxiv, pdf, cication: 1

Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, Dong Yu
Statistical Rejection Sampling Improves Preference Optimization, arXiv, 2309.06657, arxiv, pdf, cication: -1

Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, Jialu Liu
Efficient RLHF: Reducing the Memory Usage of PPO, arXiv, 2309.00754, arxiv, pdf, cication: 1

Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, Yelong Shen
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, arXiv, 2309.00267, arxiv, pdf, cication: 24

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi · (mp.weixin.qq)
Reinforced Self-Training (ReST) for Language Modeling, arXiv, 2308.08998, arxiv, pdf, cication: 12

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu · (jiqizhixin)
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales, arXiv, 2308.01320, arxiv, pdf, cication: 4

Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, arXiv, 2307.15217, arxiv, pdf, cication: 36

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire · (jiqizhixin)
ICML '23 Tutorial on Reinforcement Learning from Human Feedback

· (openlmlab.github) · (mp.weixin.qq)
Fine-Tuning Language Models with Advantage-Induced Policy Alignment, arXiv, 2306.02231, arxiv, pdf, cication: 5

Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao
System-Level Natural Language Feedback, arXiv, 2306.13588, arxiv, pdf, cication: 1

Weizhe Yuan, Kyunghyun Cho, Jason Weston
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training, arXiv, 2306.01693, arxiv, pdf, cication: 7

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi · (finegrainedrlhf.github) · (qbitai)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model, arXiv, 2305.18290, arxiv, pdf, cication: -1

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Let's Verify Step by Step, arXiv, 2305.20050, arxiv, pdf, cication: 76

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, arXiv, 2204.05862, arxiv, pdf, cication: 109

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan · (hh-rlhf - anthropics)

Projects

PairRM - llm-blender 🤗
OpenRLHF - OpenLLMAI

A Ray-based High-performance RLHF framework (for 7B on RTX4090 and 34B on A100)
direct-preference-optimization - eric-mitchell

Reference implementation for DPO (Direct Preference Optimization)
trl - huggingface

Train transformer language models with reinforcement learning.
tril - cornell-rl

Other

Constitutional AI with Open LLMs
Preference Tuning LLMs with Direct Preference Optimization Methods
Reinforcement Learning from Human Feedback - DeepLearning.AI
Reinforcement Learning for Language Models - yoavg
The Q* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data
reverse engineer the Q* fantasy
Fine-tune Llama 2 with DPO
LLM Training: RLHF and Its Alternatives

· (mp.weixin.qq)
ICML '23 Tutorial on Reinforcement Learning from Human Feedback
RLHF中Reward model的trick
怎样让 PPO 训练更稳定？早期人类征服 RLHF 的驯化经验
RLHF实践 - 知乎

· (mp.weixin.qq)
LLM成功不可或缺的基石：RLHF及其替代技术 | 机器之心

Extra reference

awesome-RLHF - opendilab

A curated list of reinforcement learning with human feedback resources (continually updated)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome_rlhf.md

awesome_rlhf.md

Awesome RLHF

Survey

Papers

Projects

Other

Extra reference

Files

awesome_rlhf.md

Latest commit

History

awesome_rlhf.md

File metadata and controls

Awesome RLHF

Survey

Papers

Projects

Other

Extra reference