Skip to content

Latest commit

 

History

History
214 lines (150 loc) · 21.6 KB

awesome_rlhf.md

File metadata and controls

214 lines (150 loc) · 21.6 KB

Awesome RLHF

Survey

Papers

  • Transforming and Combining Rewards for Aligning Large Language Models, arXiv, 2402.00742, arxiv, pdf, cication: -1

    Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch

  • Aligning Large Language Models with Counterfactual DPO, arXiv, 2401.09566, arxiv, pdf, cication: -1

    Bradley Butcher

  • WARM: On the Benefits of Weight Averaged Reward Models, arXiv, 2401.12187, arxiv, pdf, cication: -1

    Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret

  • ReFT: Reasoning with Reinforced Fine-Tuning, arXiv, 2401.08967, arxiv, pdf, cication: -1

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li

  • Self-Rewarding Language Models, arXiv, 2401.10020, arxiv, pdf, cication: -1

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston

  • Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation, arXiv, 2401.08417, arxiv, pdf, cication: -1

    Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim

  • Secrets of RLHF in Large Language Models Part II: Reward Modeling, arXiv, 2401.06080, arxiv, pdf, cication: -1

    Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi

    · (jiqizhixin)

  • ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers, arXiv, 2401.02072, arxiv, pdf, cication: -1

    Chen Zheng, Ke Sun, Da Tang, Yukun Ma, Yuyu Zhang, Chenguang Xi, Xun Zhou

  • InstructVideo: Instructing Video Diffusion Models with Human Feedback, arXiv, 2312.12490, arxiv, pdf, cication: -1

    Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni

  • Silkie: Preference Distillation for Large Visual Language Models, arXiv, 2312.10665, arxiv, pdf, cication: -1

    Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong

  • Align on the Fly: Adapting Chatbot Behavior to Established Norms, arXiv, 2312.15907, arxiv, pdf, cication: -1

    Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, Pengfei Liu · (jiqizhixin) · (OPO - GAIR-NLP) Star · (gair-nlp.github)

  • Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking, arXiv, 2312.09244, arxiv, pdf, cication: -1

    Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran

  • Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, arXiv, 2312.06585, arxiv, pdf, cication: -1

    Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi

  • HALOs - ContextualAI Star

    Human-Centered Loss Functions (HALOs) · (HALOs - ContextualAI) Star

  • Axiomatic Preference Modeling for Longform Question Answering, arXiv, 2312.02206, arxiv, pdf, cication: -1

    Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett · (huggingface)

  • Nash Learning from Human Feedback, arXiv, 2312.00886, arxiv, pdf, cication: -1

    Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi

  • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback, arXiv, 2312.00849, arxiv, pdf, cication: -1

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun · (RLHF-V - RLHF-V) Star

  • Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF

  • Adversarial Preference Optimization, arXiv, 2311.08045, arxiv, pdf, cication: -1

    Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Nan Du

    · (mp.weixin.qq)

  • Diffusion Model Alignment Using Direct Preference Optimization, arXiv, 2311.12908, arxiv, pdf, cication: -1

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik

  • Black-Box Prompt Optimization: Aligning Large Language Models without Model Training, arXiv, 2311.04155, arxiv, pdf, cication: -1

    Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang · (bpo - thu-coai) Star

  • Towards Understanding Sycophancy in Language Models, arXiv, 2310.13548, arxiv, pdf, cication: -1

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston · (jiqizhixin)

  • Contrastive Preference Learning: Learning from Human Feedback without RL, arXiv, 2310.13639, arxiv, pdf, cication: -1

    Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh · (jiqizhixin)

  • Don't throw away your value model! Making PPO even better via Value-Guided Monte-Carlo Tree Search decoding, arXiv, 2309.15028, arxiv, pdf, cication: 1

    Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, Asli Celikyilmaz · (jiqizhixin)

  • The N Implementation Details of RLHF with PPO

  • Specific versus General Principles for Constitutional AI, arXiv, 2310.13798, arxiv, pdf, cication: 1

    Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean

  • Contrastive Preference Learning: Learning from Human Feedback without RL, arXiv, 2310.13639, arxiv, pdf, cication: -1

    Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh

  • A General Theoretical Paradigm to Understand Learning from Human Preferences, arXiv, 2310.12036, arxiv, pdf, cication: 1

    Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos

  • Tuna: Instruction Tuning using Feedback from Large Language Models, arXiv, 2310.13385, arxiv, pdf, cication: -1

    Haoran Li, Yiran Liu, Xingxing Zhang, Wei Lu, Furu Wei

  • Safe RLHF: Safe Reinforcement Learning from Human Feedback, arXiv, 2310.12773, arxiv, pdf, cication: 1

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang

  • ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, arXiv, 2310.10505, arxiv, pdf, cication: -1

    Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun, Zhi-Quan Luo · (jiqizhixin)

  • Rethinking the Role of PPO in RLHF – The Berkeley Artificial Intelligence Research Blog

  • Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond, arXiv, 2310.06147, arxiv, pdf, cication: -1

    Hao Sun

  • A Long Way to Go: Investigating Length Correlations in RLHF, arXiv, 2310.03716, arxiv, pdf, cication: 3

    Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett

  • Aligning Large Multimodal Models with Factually Augmented RLHF, arXiv, 2309.14525, arxiv, pdf, cication: 4

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang

  • Stabilizing RLHF through Advantage Model and Selective Rehearsal, arXiv, 2309.10202, arxiv, pdf, cication: 1

    Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, Dong Yu

  • Statistical Rejection Sampling Improves Preference Optimization, arXiv, 2309.06657, arxiv, pdf, cication: -1

    Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, Jialu Liu

  • Efficient RLHF: Reducing the Memory Usage of PPO, arXiv, 2309.00754, arxiv, pdf, cication: 1

    Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, Yelong Shen

  • RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, arXiv, 2309.00267, arxiv, pdf, cication: 24

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi · (mp.weixin.qq)

  • Reinforced Self-Training (ReST) for Language Modeling, arXiv, 2308.08998, arxiv, pdf, cication: 12

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu · (jiqizhixin)

  • DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales, arXiv, 2308.01320, arxiv, pdf, cication: 4

    Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes

  • Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, arXiv, 2307.15217, arxiv, pdf, cication: 36

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire · (jiqizhixin)

  • ICML '23 Tutorial on Reinforcement Learning from Human Feedback

    · (openlmlab.github) · (mp.weixin.qq)

  • Fine-Tuning Language Models with Advantage-Induced Policy Alignment, arXiv, 2306.02231, arxiv, pdf, cication: 5

    Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao

  • System-Level Natural Language Feedback, arXiv, 2306.13588, arxiv, pdf, cication: 1

    Weizhe Yuan, Kyunghyun Cho, Jason Weston

  • Fine-Grained Human Feedback Gives Better Rewards for Language Model Training, arXiv, 2306.01693, arxiv, pdf, cication: 7

    Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi · (finegrainedrlhf.github) · (qbitai)

  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model, arXiv, 2305.18290, arxiv, pdf, cication: -1

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

  • Let's Verify Step by Step, arXiv, 2305.20050, arxiv, pdf, cication: 76

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, arXiv, 2204.05862, arxiv, pdf, cication: 109

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan · (hh-rlhf - anthropics) Star

Projects

  • PairRM - llm-blender 🤗

  • OpenRLHF - OpenLLMAI Star

    A Ray-based High-performance RLHF framework (for 7B on RTX4090 and 34B on A100)

  • direct-preference-optimization - eric-mitchell Star

    Reference implementation for DPO (Direct Preference Optimization)

  • trl - huggingface Star

    Train transformer language models with reinforcement learning.

  • tril - cornell-rl Star

Other

Extra reference

  • awesome-RLHF - opendilab Star

    A curated list of reinforcement learning with human feedback resources (continually updated)