-
Transforming and Combining Rewards for Aligning Large Language Models,
arXiv, 2402.00742
, arxiv, pdf, cication: -1Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch
-
Aligning Large Language Models with Counterfactual DPO,
arXiv, 2401.09566
, arxiv, pdf, cication: -1Bradley Butcher
-
WARM: On the Benefits of Weight Averaged Reward Models,
arXiv, 2401.12187
, arxiv, pdf, cication: -1Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
-
ReFT: Reasoning with Reinforced Fine-Tuning,
arXiv, 2401.08967
, arxiv, pdf, cication: -1Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li
-
Self-Rewarding Language Models,
arXiv, 2401.10020
, arxiv, pdf, cication: -1Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
-
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation,
arXiv, 2401.08417
, arxiv, pdf, cication: -1Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim
-
Secrets of RLHF in Large Language Models Part II: Reward Modeling,
arXiv, 2401.06080
, arxiv, pdf, cication: -1Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi
· (jiqizhixin)
-
ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers,
arXiv, 2401.02072
, arxiv, pdf, cication: -1Chen Zheng, Ke Sun, Da Tang, Yukun Ma, Yuyu Zhang, Chenguang Xi, Xun Zhou
-
InstructVideo: Instructing Video Diffusion Models with Human Feedback,
arXiv, 2312.12490
, arxiv, pdf, cication: -1Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni
-
Silkie: Preference Distillation for Large Visual Language Models,
arXiv, 2312.10665
, arxiv, pdf, cication: -1Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong
-
Align on the Fly: Adapting Chatbot Behavior to Established Norms,
arXiv, 2312.15907
, arxiv, pdf, cication: -1Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, Pengfei Liu · (jiqizhixin) · (OPO - GAIR-NLP)
· (gair-nlp.github)
-
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking,
arXiv, 2312.09244
, arxiv, pdf, cication: -1Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran
-
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models,
arXiv, 2312.06585
, arxiv, pdf, cication: -1Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi
-
HALOs - ContextualAI
Human-Centered Loss Functions (HALOs) · (HALOs - ContextualAI)
-
Axiomatic Preference Modeling for Longform Question Answering,
arXiv, 2312.02206
, arxiv, pdf, cication: -1Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett · (huggingface)
-
Nash Learning from Human Feedback,
arXiv, 2312.00886
, arxiv, pdf, cication: -1Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi
-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback,
arXiv, 2312.00849
, arxiv, pdf, cication: -1Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun · (RLHF-V - RLHF-V)
-
Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF
-
Adversarial Preference Optimization,
arXiv, 2311.08045
, arxiv, pdf, cication: -1Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Nan Du
· (mp.weixin.qq)
-
Diffusion Model Alignment Using Direct Preference Optimization,
arXiv, 2311.12908
, arxiv, pdf, cication: -1Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
-
Black-Box Prompt Optimization: Aligning Large Language Models without Model Training,
arXiv, 2311.04155
, arxiv, pdf, cication: -1Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang · (bpo - thu-coai)
-
Towards Understanding Sycophancy in Language Models,
arXiv, 2310.13548
, arxiv, pdf, cication: -1Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston · (jiqizhixin)
-
Contrastive Preference Learning: Learning from Human Feedback without RL,
arXiv, 2310.13639
, arxiv, pdf, cication: -1Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh · (jiqizhixin)
-
Don't throw away your value model! Making PPO even better via Value-Guided Monte-Carlo Tree Search decoding,
arXiv, 2309.15028
, arxiv, pdf, cication: 1Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, Asli Celikyilmaz · (jiqizhixin)
-
Specific versus General Principles for Constitutional AI,
arXiv, 2310.13798
, arxiv, pdf, cication: 1Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean
-
Contrastive Preference Learning: Learning from Human Feedback without RL,
arXiv, 2310.13639
, arxiv, pdf, cication: -1Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh
-
A General Theoretical Paradigm to Understand Learning from Human Preferences,
arXiv, 2310.12036
, arxiv, pdf, cication: 1Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos
-
Tuna: Instruction Tuning using Feedback from Large Language Models,
arXiv, 2310.13385
, arxiv, pdf, cication: -1Haoran Li, Yiran Liu, Xingxing Zhang, Wei Lu, Furu Wei
-
Safe RLHF: Safe Reinforcement Learning from Human Feedback,
arXiv, 2310.12773
, arxiv, pdf, cication: 1Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang
-
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models,
arXiv, 2310.10505
, arxiv, pdf, cication: -1Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun, Zhi-Quan Luo · (jiqizhixin)
-
Rethinking the Role of PPO in RLHF – The Berkeley Artificial Intelligence Research Blog
-
Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond,
arXiv, 2310.06147
, arxiv, pdf, cication: -1Hao Sun
-
A Long Way to Go: Investigating Length Correlations in RLHF,
arXiv, 2310.03716
, arxiv, pdf, cication: 3Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett
-
Aligning Large Multimodal Models with Factually Augmented RLHF,
arXiv, 2309.14525
, arxiv, pdf, cication: 4Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang
-
Stabilizing RLHF through Advantage Model and Selective Rehearsal,
arXiv, 2309.10202
, arxiv, pdf, cication: 1Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, Dong Yu
-
Statistical Rejection Sampling Improves Preference Optimization,
arXiv, 2309.06657
, arxiv, pdf, cication: -1Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, Jialu Liu
-
Efficient RLHF: Reducing the Memory Usage of PPO,
arXiv, 2309.00754
, arxiv, pdf, cication: 1Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, Yelong Shen
-
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback,
arXiv, 2309.00267
, arxiv, pdf, cication: 24Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi · (mp.weixin.qq)
-
Reinforced Self-Training (ReST) for Language Modeling,
arXiv, 2308.08998
, arxiv, pdf, cication: 12Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu · (jiqizhixin)
-
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales,
arXiv, 2308.01320
, arxiv, pdf, cication: 4Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes
-
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback,
arXiv, 2307.15217
, arxiv, pdf, cication: 36Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire · (jiqizhixin)
-
ICML '23 Tutorial on Reinforcement Learning from Human Feedback
· (openlmlab.github) · (mp.weixin.qq)
-
Fine-Tuning Language Models with Advantage-Induced Policy Alignment,
arXiv, 2306.02231
, arxiv, pdf, cication: 5Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao
-
System-Level Natural Language Feedback,
arXiv, 2306.13588
, arxiv, pdf, cication: 1Weizhe Yuan, Kyunghyun Cho, Jason Weston
-
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training,
arXiv, 2306.01693
, arxiv, pdf, cication: 7Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi · (finegrainedrlhf.github) · (qbitai)
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model,
arXiv, 2305.18290
, arxiv, pdf, cication: -1Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
-
Let's Verify Step by Step,
arXiv, 2305.20050
, arxiv, pdf, cication: 76Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback,
arXiv, 2204.05862
, arxiv, pdf, cication: 109Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan · (hh-rlhf - anthropics)
-
PairRM - llm-blender 🤗
-
OpenRLHF - OpenLLMAI
A Ray-based High-performance RLHF framework (for 7B on RTX4090 and 34B on A100)
-
direct-preference-optimization - eric-mitchell
Reference implementation for DPO (Direct Preference Optimization)
-
trl - huggingface
Train transformer language models with reinforcement learning.
-
tril - cornell-rl
-
Preference Tuning LLMs with Direct Preference Optimization Methods
-
Reinforcement Learning from Human Feedback - DeepLearning.AI
-
LLM Training: RLHF and Its Alternatives
· (mp.weixin.qq)
-
ICML '23 Tutorial on Reinforcement Learning from Human Feedback
-
· (mp.weixin.qq)
-
awesome-RLHF - opendilab
A curated list of reinforcement learning with human feedback resources (continually updated)