Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/ray-overview/examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ Examples
./object-detection/README.ipynb
./e2e-rag/README.ipynb
./mcp-ray-serve/README.ipynb
./rl-skyrl/README.ipynb
137 changes: 137 additions & 0 deletions doc/source/ray-overview/examples/rl-skyrl/README.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reinforcement Learning for LLMs with SkyRL\n",
"\n",
"**⏱️ Time to complete**: ~40 mins, including the time to train the models\n",
"\n",
"\n",
"This template walks through running [GRPO](https://arxiv.org/pdf/2402.03300) on Anyscale using the [SkyRL](https://github.com/NovaSky-AI/SkyRL) framework. \n",
"SkyRL is a modular full-stack RL library for LLMs developed at the Berkeley Sky Computing Lab in collaboration with Anyscale, providing a flexible framework \n",
"for training LLMs on tool-use tasks and multi-turn agentic workflows using popular RL algorithms (PPO, GRPO, DAPO). SkyRL uses [Ray](https://github.com/ray-project/ray) extensively for managing training and generation workers, and for orchestration of the RL training loop, allowing it to easily scale to multiple GPUs and nodes within a Ray cluster.\n",
"\n",
"This template will first show a basic example of training a model to solve math word problems from the GSM8K dataset using GRPO. Next, the template will\n",
"show how you can create your own new environment to train on your specific task using the SkyRL-Gym.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"SkyRL uses the [uv + Ray integration](https://www.anyscale.com/blog/uv-ray-pain-free-python-dependencies-in-clusters) for dependency management, ensuring a consistent set of dependencies get shipped to all Ray workers. This template uses the `novaskyai/skyrl-train-ray-2.48.0-py3.12-cu12.8` docker image to ensure all necessary system depedencies are installed. The exact Dockerfile can be found at [SkyRL/docker/Dockerfile](https://github.com/NovaSky-AI/SkyRL/blob/skyrl_train-v0.2.0/docker/Dockerfile).\n",
"\n",
"First, clone SkyRL and cd to `skyrl-train/`.\n",
"\n",
"```bash\n",
"git clone --branch skyrl_train-v0.2.0 https://github.com/NovaSky-AI/SkyRL.git\n",
"cd SkyRL/skyrl-train/\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## GRPO for solving math problems (GSM8K)\n",
"### Dataset preparation\n",
"To download and prepare the GSM8K dataset from HuggingFace, run the following command:\n",
"\n",
"```bash\n",
"uv run --isolated examples/gsm8k/gsm8k_dataset.py --output_dir /mnt/cluster_storage/data/gsm8k\n",
"```\n",
"\n",
"This script converts the Huggingface GSM8K dataset to two Parquet files with the [schema required by SkyRL](https://skyrl.readthedocs.io/en/latest/datasets/dataset-preparation.html).\n",
"- `train.parquet` - Training data.\n",
"- `validation.parquet` - Validation data.\n",
"\n",
"### Launching your training run\n",
"\n",
"Now you're ready to launch a training run! If you choose to use the W&B logger (`trainer.logger=\"wandb\"`), first set the `WANDB_API_KEY` environment variable in the [Dependencies tab](https://docs.anyscale.com/development#environment-variables). Otherwise, you can set `trainer.logger=\"console\"` to print training logs to console. \n",
"\n",
"\n",
"```bash\n",
"SKYRL_RAY_PG_TIMEOUT_IN_S=90 bash examples/gsm8k/run_gsm8k.sh \\\n",
" data.train_data=\"['/mnt/cluster_storage/data/gsm8k/train.parquet']\" \\\n",
" data.val_data=\"['/mnt/cluster_storage/data/gsm8k/validation.parquet']\" \\\n",
" trainer.ckpt_path=\"/mnt/cluster_storage/ckpts/gsm8k_1.5B_ckpt\" \\\n",
" trainer.micro_forward_batch_size_per_gpu=16 \\\n",
" trainer.micro_train_batch_size_per_gpu=16 \\\n",
" trainer.epochs=1 \\\n",
" trainer.logger=\"console\"\n",
"```\n",
"\n",
"If using W&B, you should see logs like the ones shown below, with detailed metric tracking and timing breakdowns for each stage of the RL pipeline.\n",
"<img src=\"assets/gsm8k_wandb.png\" width=1500px />\n"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing Image Assets Break README Notebook

The README.ipynb notebook references a local image, assets/gsm8k_wandb.png, but the assets directory and image file are not included in this migration. This results in a broken image display in the notebook. The corresponding README.md uses a working remote URL for the same image.

Fix in Cursor Fix in Web

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The image assets/gsm8k_wandb.png is referenced with a relative path. This will result in a broken image when the notebook is rendered, as the assets directory is not included in this pull request. Please use a full URL to the image. I'd also recommend hosting this image within the Ray repository to avoid it breaking if the source repository is changed or removed in the future.

Suggested change
"<img src=\"assets/gsm8k_wandb.png\" width=1500px />\n"
"<img src=\"https://raw.githubusercontent.com/anyscale/templates/main/templates/rl-skyrl/assets/gsm8k_wandb.png\" width=1500px />\n"

]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a new environment or task\n",
"\n",
"Now that you've run a basic example to teach an LLM to solve math word problems, you might want to start training on your own custom task! Check out the SkyRL docs for [creating a new environment or task](https://skyrl.readthedocs.io/en/latest/tutorials/new_env.html) for a full walkthrough of the simple steps to implement a custom multi-turn environment using the SkyRL-Gym interface. The commands needed to run the multi-turn example in the linked tutorial on Anyscale are shown below.\n",
"\n",
"### Preparing your data\n",
"\n",
"```bash\n",
"uv run --isolated examples/multiply/multiply_dataset.py \\\n",
" --output_dir /mnt/cluster_storage/data/multiply \\\n",
" --num_digits 4 \\\n",
" --train_size 10000 \\\n",
" --test_size 200\n",
"```\n",
"\n",
"### Training your model\n",
"```bash\n",
"SKYRL_RAY_PG_TIMEOUT_IN_S=90 bash examples/multiply/run_multiply.sh \\\n",
" data.train_data=\"['/mnt/cluster_storage/data/multiply/train.parquet']\" \\\n",
" data.val_data=\"['/mnt/cluster_storage/data/multiply/validation.parquet']\" \\\n",
" trainer.ckpt_path=\"/mnt/cluster_storage/ckpts/multiply_1.5B_ckpt\" \\\n",
" trainer.micro_forward_batch_size_per_gpu=16 \\\n",
" trainer.micro_train_batch_size_per_gpu=16 \\\n",
" trainer.epochs=1 \\\n",
" trainer.logger=\"console\"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"After completing this template, you can:\n",
"- Explore more advanced algorithms, like [PPO](https://github.com/NovaSky-AI/SkyRL/tree/main/skyrl-train/examples/ppo) or [DAPO](https://skyrl.readthedocs.io/en/latest/algorithms/dapo.html)\n",
"- Explore more advanced tasks like [SWE-Bench](https://skyrl.readthedocs.io/en/latest/examples/mini_swe_agent.html), or [agentic search (Search-R1)](https://skyrl.readthedocs.io/en/latest/examples/search.html).\n",
"- Optimize your training pipeline using [Async Training](https://skyrl.readthedocs.io/en/latest/tutorials/async.html)\n",
"- Deploy your trained LLM using [Ray Serve LLM on Anyscale](https://console.anyscale.com/template-preview/deployment-serve-llm?utm_source=anyscale_docs&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm?utm_source=anyscale&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm)."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This URL contains duplicated query parameters, which makes it malformed. The second set of UTM parameters, starting from the second ?, should be removed.

Suggested change
"- Deploy your trained LLM using [Ray Serve LLM on Anyscale](https://console.anyscale.com/template-preview/deployment-serve-llm?utm_source=anyscale_docs&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm?utm_source=anyscale&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm)."
"- Deploy your trained LLM using [Ray Serve LLM on Anyscale](https://console.anyscale.com/template-preview/deployment-serve-llm?utm_source=anyscale_docs&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm)."

]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
88 changes: 88 additions & 0 deletions doc/source/ray-overview/examples/rl-skyrl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Reinforcement Learning for LLMs with SkyRL

**⏱️ Time to complete**: ~40 minutes, including the time to train the models


This template walks through running [Group Relative Policy Optimization](https://arxiv.org/pdf/2402.03300) on Anyscale using the [SkyRL](https://github.com/NovaSky-AI/SkyRL) framework.
SkyRL is a modular full-stack RL library for LLMs developed at the Berkeley Sky Computing Lab in collaboration with Anyscale, providing a flexible framework
for training LLMs on tool-use tasks and multi-turn agent workflows using popular RL algorithms such as PPO, Group Relative Policy Optimization, and Direct Alignment from Preference Optimization. SkyRL uses [Ray](https://github.com/ray-project/ray) extensively for managing training and generation workers, and for orchestration of the RL training loop, allowing it to easily scale to multiple GPUs and nodes within a Ray cluster.

This template first shows a basic example of training a model to solve math word problems from the GSM8K dataset using Group Relative Policy Optimization. Next, the template
shows how you can create your own new environment to train on your specific task using the SkyRL-Gym.


## Setup
SkyRL uses the [uv + Ray integration](https://www.anyscale.com/blog/uv-ray-pain-free-python-dependencies-in-clusters) for dependency management, ensuring a consistent set of dependencies get shipped to all Ray workers. This template uses the `novaskyai/skyrl-train-ray-2.48.0-py3.12-cu12.8` docker image with all necessary system dependencies installed. You can find the exact Dockerfile at [SkyRL/docker/Dockerfile](https://github.com/NovaSky-AI/SkyRL/blob/skyrl_train-v0.2.0/docker/Dockerfile).

First, clone SkyRL and cd to `skyrl-train/`.

```bash
git clone --branch skyrl_train-v0.2.0 https://github.com/NovaSky-AI/SkyRL.git
cd SkyRL/skyrl-train/
```

## Group Relative Policy Optimization for solving math problems on GSM8K
### Dataset preparation
To download and prepare the GSM8K dataset from Hugging Face, run the following command:

```bash
uv run --isolated examples/gsm8k/gsm8k_dataset.py --output_dir /mnt/cluster_storage/data/gsm8k
```

This script converts the Hugging Face GSM8K dataset to two Parquet files with the [schema required by SkyRL](https://skyrl.readthedocs.io/en/latest/datasets/dataset-preparation.html):
- `train.parquet` - Training data
- `validation.parquet` - Validation data

### Launching your training run

Now you're ready to launch a training run. If you choose to use the W&B logger with `trainer.logger="wandb"`, first set the `WANDB_API_KEY` environment variable in the [Dependencies tab](https://docs.anyscale.com/development#environment-variables). Otherwise, you can set `trainer.logger="console"` to print training logs to console.


```bash
SKYRL_RAY_PG_TIMEOUT_IN_S=90 bash examples/gsm8k/run_gsm8k.sh \
data.train_data="['/mnt/cluster_storage/data/gsm8k/train.parquet']" \
data.val_data="['/mnt/cluster_storage/data/gsm8k/validation.parquet']" \
trainer.ckpt_path="/mnt/cluster_storage/ckpts/gsm8k_1.5B_ckpt" \
trainer.micro_forward_batch_size_per_gpu=16 \
trainer.micro_train_batch_size_per_gpu=16 \
trainer.epochs=1 \
trainer.logger="console"
```

If using W&B, you should see logs like the ones shown below, with detailed metric tracking and timing breakdowns for each stage of the RL pipeline.
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/rl-skyrl/assets/gsm8k_wandb.png" width=1500px />
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Deprecated Repo Causes Broken Image Link

The README.md references an image from the anyscale/templates GitHub repository. This repository is being deprecated, which will result in a broken image link.

Fix in Cursor Fix in Web

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This image source points to the anyscale/templates repository. According to the pull request description, this repository is being deprecated. To prevent the image from breaking in the future, it should be moved to a more permanent location, such as within this repository's assets.



## Creating a new environment or task

Now that you've run a basic example to teach an LLM to solve math word problems, you might want to start training on your own custom task. Check out the SkyRL docs for [creating a new environment or task](https://skyrl.readthedocs.io/en/latest/tutorials/new_env.html) for a full walk-through of the simple steps to implement a custom multi-turn environment using the SkyRL-Gym interface. The following commands run the multi-turn example in the linked tutorial on Anyscale.

### Preparing your data

```bash
uv run --isolated examples/multiply/multiply_dataset.py \
--output_dir /mnt/cluster_storage/data/multiply \
--num_digits 4 \
--train_size 10000 \
--test_size 200
```

### Training your model
```bash
SKYRL_RAY_PG_TIMEOUT_IN_S=90 bash examples/multiply/run_multiply.sh \
data.train_data="['/mnt/cluster_storage/data/multiply/train.parquet']" \
data.val_data="['/mnt/cluster_storage/data/multiply/validation.parquet']" \
trainer.ckpt_path="/mnt/cluster_storage/ckpts/multiply_1.5B_ckpt" \
trainer.micro_forward_batch_size_per_gpu=16 \
trainer.micro_train_batch_size_per_gpu=16 \
trainer.epochs=1 \
trainer.logger="console"
```

## Next steps

After completing this template, you can:
- Explore more advanced algorithms, such as [PPO](https://github.com/NovaSky-AI/SkyRL/tree/main/skyrl-train/examples/ppo) or [Direct Alignment from Preference Optimization](https://skyrl.readthedocs.io/en/latest/algorithms/dapo.html)
- Explore more advanced tasks such as the [Software Engineering Benchmark](https://skyrl.readthedocs.io/en/latest/examples/mini_swe_agent.html) or [agent search with Search-R1](https://skyrl.readthedocs.io/en/latest/examples/search.html)
- Optimize your training pipeline using [async training](https://skyrl.readthedocs.io/en/latest/tutorials/async.html)
- Deploy your trained LLM using [Ray Serve LLM on Anyscale](https://console.anyscale.com/template-preview/deployment-serve-llm?utm_source=anyscale_docs&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm?utm_source=anyscale&utm_medium=docs&utm_campaign=examples_page&utm_content=deployment-serve-llm)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.