-
Notifications
You must be signed in to change notification settings - Fork 7k
Migrate rl-skyrl from templates repo to Ray repo #58014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
a4844c9
c5d835d
2b87dc3
636436b
46d0dcc
a31b972
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,137 @@ | ||||||
| { | ||||||
| "cells": [ | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "# Reinforcement Learning for LLMs with SkyRL\n", | ||||||
| "\n", | ||||||
| "**⏱️ Time to complete**: ~40 mins, including the time to train the models\n", | ||||||
| "\n", | ||||||
| "\n", | ||||||
| "This template walks through running [GRPO](https://arxiv.org/pdf/2402.03300) on Anyscale using the [SkyRL](https://github.com/NovaSky-AI/SkyRL) framework. \n", | ||||||
| "SkyRL is a modular full-stack RL library for LLMs developed at the Berkeley Sky Computing Lab in collaboration with Anyscale, providing a flexible framework \n", | ||||||
| "for training LLMs on tool-use tasks and multi-turn agentic workflows using popular RL algorithms (PPO, GRPO, DAPO). SkyRL uses [Ray](https://github.com/ray-project/ray) extensively for managing training and generation workers, and for orchestration of the RL training loop, allowing it to easily scale to multiple GPUs and nodes within a Ray cluster.\n", | ||||||
| "\n", | ||||||
| "This template will first show a basic example of training a model to solve math word problems from the GSM8K dataset using GRPO. Next, the template will\n", | ||||||
| "show how you can create your own new environment to train on your specific task using the SkyRL-Gym.\n" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Setup\n", | ||||||
| "SkyRL uses the [uv + Ray integration](https://www.anyscale.com/blog/uv-ray-pain-free-python-dependencies-in-clusters) for dependency management, ensuring a consistent set of dependencies get shipped to all Ray workers. This template uses the `novaskyai/skyrl-train-ray-2.48.0-py3.12-cu12.8` docker image to ensure all necessary system depedencies are installed. The exact Dockerfile can be found at [SkyRL/docker/Dockerfile](https://github.com/NovaSky-AI/SkyRL/blob/skyrl_train-v0.2.0/docker/Dockerfile).\n", | ||||||
| "\n", | ||||||
| "First, clone SkyRL and cd to `skyrl-train/`.\n", | ||||||
| "\n", | ||||||
| "```bash\n", | ||||||
| "git clone --branch skyrl_train-v0.2.0 https://github.com/NovaSky-AI/SkyRL.git\n", | ||||||
| "cd SkyRL/skyrl-train/\n", | ||||||
| "```" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## GRPO for solving math problems (GSM8K)\n", | ||||||
| "### Dataset preparation\n", | ||||||
| "To download and prepare the GSM8K dataset from HuggingFace, run the following command:\n", | ||||||
| "\n", | ||||||
| "```bash\n", | ||||||
| "uv run --isolated examples/gsm8k/gsm8k_dataset.py --output_dir /mnt/cluster_storage/data/gsm8k\n", | ||||||
| "```\n", | ||||||
| "\n", | ||||||
| "This script converts the Huggingface GSM8K dataset to two Parquet files with the [schema required by SkyRL](https://skyrl.readthedocs.io/en/latest/datasets/dataset-preparation.html).\n", | ||||||
| "- `train.parquet` - Training data.\n", | ||||||
| "- `validation.parquet` - Validation data.\n", | ||||||
| "\n", | ||||||
| "### Launching your training run\n", | ||||||
| "\n", | ||||||
| "Now you're ready to launch a training run! If you choose to use the W&B logger (`trainer.logger=\"wandb\"`), first set the `WANDB_API_KEY` environment variable in the [Dependencies tab](https://docs.anyscale.com/development#environment-variables). Otherwise, you can set `trainer.logger=\"console\"` to print training logs to console. \n", | ||||||
| "\n", | ||||||
| "\n", | ||||||
| "```bash\n", | ||||||
| "SKYRL_RAY_PG_TIMEOUT_IN_S=90 bash examples/gsm8k/run_gsm8k.sh \\\n", | ||||||
| " data.train_data=\"['/mnt/cluster_storage/data/gsm8k/train.parquet']\" \\\n", | ||||||
| " data.val_data=\"['/mnt/cluster_storage/data/gsm8k/validation.parquet']\" \\\n", | ||||||
| " trainer.ckpt_path=\"/mnt/cluster_storage/ckpts/gsm8k_1.5B_ckpt\" \\\n", | ||||||
| " trainer.micro_forward_batch_size_per_gpu=16 \\\n", | ||||||
| " trainer.micro_train_batch_size_per_gpu=16 \\\n", | ||||||
| " trainer.epochs=1 \\\n", | ||||||
| " trainer.logger=\"console\"\n", | ||||||
| "```\n", | ||||||
| "\n", | ||||||
| "If using W&B, you should see logs like the ones shown below, with detailed metric tracking and timing breakdowns for each stage of the RL pipeline.\n", | ||||||
| "<img src=\"assets/gsm8k_wandb.png\" width=1500px />\n" | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The image
Suggested change
|
||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Creating a new environment or task\n", | ||||||
| "\n", | ||||||
| "Now that you've run a basic example to teach an LLM to solve math word problems, you might want to start training on your own custom task! Check out the SkyRL docs for [creating a new environment or task](https://skyrl.readthedocs.io/en/latest/tutorials/new_env.html) for a full walkthrough of the simple steps to implement a custom multi-turn environment using the SkyRL-Gym interface. The commands needed to run the multi-turn example in the linked tutorial on Anyscale are shown below.\n", | ||||||
| "\n", | ||||||
| "### Preparing your data\n", | ||||||
| "\n", | ||||||
| "```bash\n", | ||||||
| "uv run --isolated examples/multiply/multiply_dataset.py \\\n", | ||||||
| " --output_dir /mnt/cluster_storage/data/multiply \\\n", | ||||||
| " --num_digits 4 \\\n", | ||||||
| " --train_size 10000 \\\n", | ||||||
| " --test_size 200\n", | ||||||
| "```\n", | ||||||
| "\n", | ||||||
| "### Training your model\n", | ||||||
| "```bash\n", | ||||||
| "SKYRL_RAY_PG_TIMEOUT_IN_S=90 bash examples/multiply/run_multiply.sh \\\n", | ||||||
| " data.train_data=\"['/mnt/cluster_storage/data/multiply/train.parquet']\" \\\n", | ||||||
| " data.val_data=\"['/mnt/cluster_storage/data/multiply/validation.parquet']\" \\\n", | ||||||
| " trainer.ckpt_path=\"/mnt/cluster_storage/ckpts/multiply_1.5B_ckpt\" \\\n", | ||||||
| " trainer.micro_forward_batch_size_per_gpu=16 \\\n", | ||||||
| " trainer.micro_train_batch_size_per_gpu=16 \\\n", | ||||||
| " trainer.epochs=1 \\\n", | ||||||
| " trainer.logger=\"console\"\n", | ||||||
| "```" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Next steps\n", | ||||||
| "\n", | ||||||
| "After completing this template, you can:\n", | ||||||
| "- Explore more advanced algorithms, like [PPO](https://github.com/NovaSky-AI/SkyRL/tree/main/skyrl-train/examples/ppo) or [DAPO](https://skyrl.readthedocs.io/en/latest/algorithms/dapo.html)\n", | ||||||
| "- Explore more advanced tasks like [SWE-Bench](https://skyrl.readthedocs.io/en/latest/examples/mini_swe_agent.html), or [agentic search (Search-R1)](https://skyrl.readthedocs.io/en/latest/examples/search.html).\n", | ||||||
| "- Optimize your training pipeline using [Async Training](https://skyrl.readthedocs.io/en/latest/tutorials/async.html)\n", | ||||||
| "- Deploy your trained LLM using [Ray Serve LLM on Anyscale](https://console.anyscale.com/template-preview/deployment-serve-llm?utm_source=anyscale_docs&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm?utm_source=anyscale&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm)." | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This URL contains duplicated query parameters, which makes it malformed. The second set of UTM parameters, starting from the second
Suggested change
|
||||||
| ] | ||||||
| } | ||||||
| ], | ||||||
| "metadata": { | ||||||
| "kernelspec": { | ||||||
| "display_name": "base", | ||||||
| "language": "python", | ||||||
| "name": "python3" | ||||||
| }, | ||||||
| "language_info": { | ||||||
| "codemirror_mode": { | ||||||
| "name": "ipython", | ||||||
| "version": 3 | ||||||
| }, | ||||||
| "file_extension": ".py", | ||||||
| "mimetype": "text/x-python", | ||||||
| "name": "python", | ||||||
| "nbconvert_exporter": "python", | ||||||
| "pygments_lexer": "ipython3", | ||||||
| "version": "3.12.11" | ||||||
| } | ||||||
| }, | ||||||
| "nbformat": 4, | ||||||
| "nbformat_minor": 2 | ||||||
| } | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # Reinforcement Learning for LLMs with SkyRL | ||
|
|
||
| **⏱️ Time to complete**: ~40 minutes, including the time to train the models | ||
|
|
||
|
|
||
| This template walks through running [Group Relative Policy Optimization](https://arxiv.org/pdf/2402.03300) on Anyscale using the [SkyRL](https://github.com/NovaSky-AI/SkyRL) framework. | ||
| SkyRL is a modular full-stack RL library for LLMs developed at the Berkeley Sky Computing Lab in collaboration with Anyscale, providing a flexible framework | ||
| for training LLMs on tool-use tasks and multi-turn agent workflows using popular RL algorithms such as PPO, Group Relative Policy Optimization, and Direct Alignment from Preference Optimization. SkyRL uses [Ray](https://github.com/ray-project/ray) extensively for managing training and generation workers, and for orchestration of the RL training loop, allowing it to easily scale to multiple GPUs and nodes within a Ray cluster. | ||
|
|
||
| This template first shows a basic example of training a model to solve math word problems from the GSM8K dataset using Group Relative Policy Optimization. Next, the template | ||
| shows how you can create your own new environment to train on your specific task using the SkyRL-Gym. | ||
|
|
||
|
|
||
| ## Setup | ||
| SkyRL uses the [uv + Ray integration](https://www.anyscale.com/blog/uv-ray-pain-free-python-dependencies-in-clusters) for dependency management, ensuring a consistent set of dependencies get shipped to all Ray workers. This template uses the `novaskyai/skyrl-train-ray-2.48.0-py3.12-cu12.8` docker image with all necessary system dependencies installed. You can find the exact Dockerfile at [SkyRL/docker/Dockerfile](https://github.com/NovaSky-AI/SkyRL/blob/skyrl_train-v0.2.0/docker/Dockerfile). | ||
|
|
||
| First, clone SkyRL and cd to `skyrl-train/`. | ||
|
|
||
| ```bash | ||
| git clone --branch skyrl_train-v0.2.0 https://github.com/NovaSky-AI/SkyRL.git | ||
| cd SkyRL/skyrl-train/ | ||
| ``` | ||
|
|
||
| ## Group Relative Policy Optimization for solving math problems on GSM8K | ||
| ### Dataset preparation | ||
| To download and prepare the GSM8K dataset from Hugging Face, run the following command: | ||
|
|
||
| ```bash | ||
| uv run --isolated examples/gsm8k/gsm8k_dataset.py --output_dir /mnt/cluster_storage/data/gsm8k | ||
| ``` | ||
|
|
||
| This script converts the Hugging Face GSM8K dataset to two Parquet files with the [schema required by SkyRL](https://skyrl.readthedocs.io/en/latest/datasets/dataset-preparation.html): | ||
| - `train.parquet` - Training data | ||
| - `validation.parquet` - Validation data | ||
|
|
||
| ### Launching your training run | ||
|
|
||
| Now you're ready to launch a training run. If you choose to use the W&B logger with `trainer.logger="wandb"`, first set the `WANDB_API_KEY` environment variable in the [Dependencies tab](https://docs.anyscale.com/development#environment-variables). Otherwise, you can set `trainer.logger="console"` to print training logs to console. | ||
|
|
||
|
|
||
| ```bash | ||
| SKYRL_RAY_PG_TIMEOUT_IN_S=90 bash examples/gsm8k/run_gsm8k.sh \ | ||
| data.train_data="['/mnt/cluster_storage/data/gsm8k/train.parquet']" \ | ||
| data.val_data="['/mnt/cluster_storage/data/gsm8k/validation.parquet']" \ | ||
| trainer.ckpt_path="/mnt/cluster_storage/ckpts/gsm8k_1.5B_ckpt" \ | ||
| trainer.micro_forward_batch_size_per_gpu=16 \ | ||
| trainer.micro_train_batch_size_per_gpu=16 \ | ||
| trainer.epochs=1 \ | ||
| trainer.logger="console" | ||
| ``` | ||
|
|
||
| If using W&B, you should see logs like the ones shown below, with detailed metric tracking and timing breakdowns for each stage of the RL pipeline. | ||
| <img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/rl-skyrl/assets/gsm8k_wandb.png" width=1500px /> | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
|
|
||
| ## Creating a new environment or task | ||
|
|
||
| Now that you've run a basic example to teach an LLM to solve math word problems, you might want to start training on your own custom task. Check out the SkyRL docs for [creating a new environment or task](https://skyrl.readthedocs.io/en/latest/tutorials/new_env.html) for a full walk-through of the simple steps to implement a custom multi-turn environment using the SkyRL-Gym interface. The following commands run the multi-turn example in the linked tutorial on Anyscale. | ||
|
|
||
| ### Preparing your data | ||
|
|
||
| ```bash | ||
| uv run --isolated examples/multiply/multiply_dataset.py \ | ||
| --output_dir /mnt/cluster_storage/data/multiply \ | ||
| --num_digits 4 \ | ||
| --train_size 10000 \ | ||
| --test_size 200 | ||
| ``` | ||
|
|
||
| ### Training your model | ||
| ```bash | ||
| SKYRL_RAY_PG_TIMEOUT_IN_S=90 bash examples/multiply/run_multiply.sh \ | ||
| data.train_data="['/mnt/cluster_storage/data/multiply/train.parquet']" \ | ||
| data.val_data="['/mnt/cluster_storage/data/multiply/validation.parquet']" \ | ||
| trainer.ckpt_path="/mnt/cluster_storage/ckpts/multiply_1.5B_ckpt" \ | ||
| trainer.micro_forward_batch_size_per_gpu=16 \ | ||
| trainer.micro_train_batch_size_per_gpu=16 \ | ||
| trainer.epochs=1 \ | ||
| trainer.logger="console" | ||
| ``` | ||
|
|
||
| ## Next steps | ||
|
|
||
| After completing this template, you can: | ||
| - Explore more advanced algorithms, such as [PPO](https://github.com/NovaSky-AI/SkyRL/tree/main/skyrl-train/examples/ppo) or [Direct Alignment from Preference Optimization](https://skyrl.readthedocs.io/en/latest/algorithms/dapo.html) | ||
| - Explore more advanced tasks such as the [Software Engineering Benchmark](https://skyrl.readthedocs.io/en/latest/examples/mini_swe_agent.html) or [agent search with Search-R1](https://skyrl.readthedocs.io/en/latest/examples/search.html) | ||
| - Optimize your training pipeline using [async training](https://skyrl.readthedocs.io/en/latest/tutorials/async.html) | ||
| - Deploy your trained LLM using [Ray Serve LLM on Anyscale](https://console.anyscale.com/template-preview/deployment-serve-llm?utm_source=anyscale_docs&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm?utm_source=anyscale&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Missing Image Assets Break README Notebook
The
README.ipynbnotebook references a local image,assets/gsm8k_wandb.png, but theassetsdirectory and image file are not included in this migration. This results in a broken image display in the notebook. The correspondingREADME.mduses a working remote URL for the same image.