This repository contains a minimalist implementation of Proximal Policy Optimization (PPO), drawing inspiration from Reinforcement Learning from Human Feedback (RLHF). Instead of relying on human feedback, we substitute it with machine feedback using a BERT model trained for sentiment analysis.
The main objective is to finetune GPT-2 to generate positive movie reviews based on the IMDB dataset. The GPT-2 model is rewarded for generating positive sentiment continuations by leveraging feedback from a BERT classifier. The project is structured to be clean, interpretable, and suitable for educational purposes.
- Finetune GPT-2 on the IMDB dataset to generate movie reviews.
- Sentiment Reward: Use a finetuned BERT model to evaluate generated reviews and provide rewards based on sentiment (positive/negative).
- PPO Optimization: Train the model using Proximal Policy Optimization (PPO) to maximize the reward signal (positive sentiment) and prevent divergence from the base GPT-2 model.
This project involves three major steps:
We begin by fine-tuning GPT-2 on the IMDB movie reviews dataset. This step enables the model to generate both positive and negative reviews. Once finetuned, we will evaluate the model’s ability to generate positive continuations, even though it was trained on both types of sentiment.
Steps to Fine-Tune GPT-2:
- Install dependencies:
pip install -r requirements.txt
- Run the fine-tuning script:
python finetune_gpt2.py
The fine-tuned model will be saved under the finetune_checkpoints/ directory.
The reward model is based on a BERT classifier that has been finetuned for sentiment analysis. For each query-response pair, the reward model provides a scalar reward reflecting the sentiment of the response (positive or negative).
We utilize the pre-trained BERT model from the Hugging Face model hub (lvwerra/bert-imdb). This model is used to evaluate the sentiment of generated reviews.
Note: In future updates, I will include finetuning of the BERT reward model from scratch. For now, the existing BERT model is sufficient for the PPO reward evaluation.
In the optimization step, query-response pairs are used to calculate the log-probabilities of tokens. The current model's output and a reference model’s output (pre-finetuned GPT-2) are compared using Kullback–Leibler (KL) divergence. This additional KL loss serves as a regularization term to prevent the model from deviating too far from the base language model during training.
The core optimization is carried out using the PPO algorithm, which helps the model update its policy by maximizing the reward signal while adhering to the KL constraints.
- Once fine-tuning is complete, initiate PPO training:
python ppo.py
- PPO training should take around 3 hours on a single NVIDIA RTX 3080 GPU. The checkpoints will be saved under the ppo_checkpoints/ directory.
Before and after training, the model will be able to generate more positive reviews when prompted with movie-related queries. The following is an example of the model’s responses before and after the PPO training process.
Special Thanks to:
- karpathy Andrej for his nano-GPT-2
- Leandro von Werra for his work on TRL