Mini-RLHF: Reinforcement Learning with Machine Feedback

This repository contains a minimalist implementation of Proximal Policy Optimization (PPO), drawing inspiration from Reinforcement Learning from Human Feedback (RLHF). Instead of relying on human feedback, we substitute it with machine feedback using a BERT model trained for sentiment analysis.

The main objective is to finetune GPT-2 to generate positive movie reviews based on the IMDB dataset. The GPT-2 model is rewarded for generating positive sentiment continuations by leveraging feedback from a BERT classifier. The project is structured to be clean, interpretable, and suitable for educational purposes.

Key Features

Finetune GPT-2 on the IMDB dataset to generate movie reviews.
Sentiment Reward: Use a finetuned BERT model to evaluate generated reviews and provide rewards based on sentiment (positive/negative).
PPO Optimization: Train the model using Proximal Policy Optimization (PPO) to maximize the reward signal (positive sentiment) and prevent divergence from the base GPT-2 model.

Project Structure

This project involves three major steps:

1. Supervised Fine-Tuning of GPT-2

We begin by fine-tuning GPT-2 on the IMDB movie reviews dataset. This step enables the model to generate both positive and negative reviews. Once finetuned, we will evaluate the model’s ability to generate positive continuations, even though it was trained on both types of sentiment.

Steps to Fine-Tune GPT-2:

Install dependencies:
```
pip install -r requirements.txt
```
Run the fine-tuning script:
```
python finetune_gpt2.py
```

The fine-tuned model will be saved under the finetune_checkpoints/ directory.

2. Sentiment Reward Model

The reward model is based on a BERT classifier that has been finetuned for sentiment analysis. For each query-response pair, the reward model provides a scalar reward reflecting the sentiment of the response (positive or negative).

We utilize the pre-trained BERT model from the Hugging Face model hub (lvwerra/bert-imdb). This model is used to evaluate the sentiment of generated reviews.

Note: In future updates, I will include finetuning of the BERT reward model from scratch. For now, the existing BERT model is sufficient for the PPO reward evaluation.

3. Proximal Policy Optimization (PPO)

In the optimization step, query-response pairs are used to calculate the log-probabilities of tokens. The current model's output and a reference model’s output (pre-finetuned GPT-2) are compared using Kullback–Leibler (KL) divergence. This additional KL loss serves as a regularization term to prevent the model from deviating too far from the base language model during training.

The core optimization is carried out using the PPO algorithm, which helps the model update its policy by maximizing the reward signal while adhering to the KL constraints.

Steps to Run PPO Training:

Once fine-tuning is complete, initiate PPO training:
```
python ppo.py
```
PPO training should take around 3 hours on a single NVIDIA RTX 3080 GPU. The checkpoints will be saved under the ppo_checkpoints/ directory.

4. Example Results

Before and after training, the model will be able to generate more positive reviews when prompted with movie-related queries. The following is an example of the model’s responses before and after the PPO training process.

Figure: Before and After Responses.

References

Special Thanks to:

karpathy Andrej for his nano-GPT-2
Leandro von Werra for his work on TRL

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
asset/images		asset/images
data		data
eval_metrics		eval_metrics
finetune_checkpoints/checkpoint_57.pt		finetune_checkpoints/checkpoint_57.pt
notebooks		notebooks
ppo_checkpoints		ppo_checkpoints
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
core.py		core.py
data_processing.py		data_processing.py
finetune_gpt2.py		finetune_gpt2.py
generate.py		generate.py
load_finetuned_model.py		load_finetuned_model.py
modeling_value_head.py		modeling_value_head.py
ppo.py		ppo.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini-RLHF: Reinforcement Learning with Machine Feedback

Key Features

Project Structure

1. Supervised Fine-Tuning of GPT-2

2. Sentiment Reward Model

3. Proximal Policy Optimization (PPO)

Steps to Run PPO Training:

4. Example Results

References

About

Releases

Packages

Languages

Motsepe-Jr/mini_RLHF

Folders and files

Latest commit

History

Repository files navigation

Mini-RLHF: Reinforcement Learning with Machine Feedback

Key Features

Project Structure

1. Supervised Fine-Tuning of GPT-2

2. Sentiment Reward Model

3. Proximal Policy Optimization (PPO)

Steps to Run PPO Training:

4. Example Results

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages