diff --git a/docs/examples.md b/docs/examples.md index 17df886a9..d83ccb778 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -74,6 +74,16 @@ hide: with RAGEN, verl, and Ray.

+ +

+ Open-R1 +

+ +

+ Fine-tune to reproduce Deepseek-R1 with separate inference and training nodes. +

+
diff --git a/docs/examples/distributed-training/open-r1/index.md b/docs/examples/distributed-training/open-r1/index.md new file mode 100644 index 000000000..e69de29bb diff --git a/examples/distributed-training/open-r1/.dstack.yml b/examples/distributed-training/open-r1/.dstack.yml new file mode 100644 index 000000000..dda383a3d --- /dev/null +++ b/examples/distributed-training/open-r1/.dstack.yml @@ -0,0 +1,78 @@ +type: task +name: open-r1-grpo + +# Size of the cluster +nodes: 2 + +python: 3.12 + +nvcc: true + +# Required environment variables +env: + - HF_TOKEN + - WANDB_API_KEY + - NCCL_DEBUG=INFO + # VLLM configuration + - USE_VLLM=true + - MODEL=Qwen/Qwen2.5-Coder-7B-Instruct + # Qwen2.5-Coder-7B-Instruct has 28 attention heads and should be divisible by TP and DP + - TP=4 + - DP=2 + +# Commands of the task +commands: + - uv pip install vllm==0.8.5.post1 + - uv pip install setuptools + - uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl + - git clone https://github.com/huggingface/open-r1.git + - cd open-r1 + - uv pip install . + - | + if [ "$USE_VLLM" = "true" ]; then + # Get the last IP from DSTACK_NODES_IPS for vLLM node + VLLM_HOST=$(echo $DSTACK_NODES_IPS | tr ' ' '\n' | tail -n 1) + if [ "$DSTACK_NODE_RANK" -eq $(($DSTACK_NODES_NUM - 1)) ]; then + # Last Node runs VLLM server + trl vllm-serve --model $MODEL --tensor_parallel_size $TP --data_parallel_size $DP --host 0.0.0.0 + else + # Training node - adjust world size and nodes count for training + ADJUSTED_NODES_NUM=$(($DSTACK_NODES_NUM - 1)) + ADJUSTED_GPUS_TOTAL=$(($DSTACK_GPUS_PER_NODE * $ADJUSTED_NODES_NUM)) + # Other nodes run training + accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \ + --num_processes=$ADJUSTED_GPUS_TOTAL \ + --num_machines=$ADJUSTED_NODES_NUM \ + --machine_rank=$DSTACK_NODE_RANK \ + --main_process_ip=$DSTACK_MASTER_NODE_IP \ + --main_process_port=8008 \ + src/open_r1/grpo.py \ + --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml \ + --model_name_or_path $MODEL \ + --output_dir /checkpoints/Qwen2.5-Coder-7B-Instruct-GRPO \ + --hub_model_id sjbbihan/Qwen2.5-Coder-7B-Instruct \ + --vllm_server_host=$VLLM_HOST + fi + else + # Standard training mode without VLLM + echo "Running standard training without VLLM" + accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \ + --num_processes=$DSTACK_GPUS_NUM \ + --num_machines=$DSTACK_NODES_NUM \ + --machine_rank=$DSTACK_NODE_RANK \ + --main_process_ip=$DSTACK_MASTER_NODE_IP \ + --main_process_port=8008 \ + src/open_r1/grpo.py \ + --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml \ + --model_name_or_path $MODEL \ + --output_dir /checkpoints/Qwen2.5-Coder-7B-Instruct-GRPO \ + --hub_model_id sjbbihan/Qwen2.5-Coder-7B-Instruct \ + --use_vllm false + fi + +resources: + gpu: 80GB:8 + shm_size: 128GB + +volumes: + - /checkpoints:/checkpoints diff --git a/examples/distributed-training/open-r1/README.md b/examples/distributed-training/open-r1/README.md new file mode 100644 index 000000000..92aff2a9a --- /dev/null +++ b/examples/distributed-training/open-r1/README.md @@ -0,0 +1,155 @@ +# Open-R1 + +This example demonstrates how to use `dstack` and [open-r1](https://github.com/huggingface/open-r1) to run GRPO training across distributed nodes, dedicating one node for [vLLM](https://github.com/vllm-project/vllm) inference while using the remaining nodes for training. + +??? info "Prerequisites" + Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + +
+ + ```shell + $ git clone https://github.com/dstackai/dstack + $ cd dstack + $ dstack init + ``` +
+ +## Create fleet + +Before submitting distributed training runs, make sure to create a fleet with a `placement` set to `cluster`. + +> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide. + +## Define a configurtation + +Once the fleet is created, define a distributed task configuration. + +
+ +```yaml +type: task +name: open-r1-grpo + +# Size of the cluster +nodes: 2 + +python: 3.12 + +nvcc: true + +# Required environment variables +env: + - HF_TOKEN + - HUB_MODEL_ID + - WANDB_API_KEY + - NCCL_DEBUG=INFO + # VLLM configuration + - USE_VLLM=true + - MODEL=Qwen/Qwen2.5-Coder-7B-Instruct + - TP=4 + - DP=2 + +# Commands of the task +commands: + - uv pip install vllm==0.8.5.post1 + - uv pip install setuptools + - uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl + - git clone https://github.com/huggingface/open-r1.git + - cd open-r1 + - uv pip install . + - | + # Get the last IP from DSTACK_NODES_IPS for vLLM node + VLLM_HOST=$(echo $DSTACK_NODES_IPS | tr ' ' '\n' | tail -n 1) + echo "VLLM host IP (last node): $VLLM_HOST" + + if [ "$USE_VLLM" = "true" ]; then + if [ "$DSTACK_NODE_RANK" -eq $(($DSTACK_NODES_NUM - 1)) ]; then + # Last Node runs VLLM server + echo "Starting VLLM server on Last Node (IP: $VLLM_HOST)" + trl vllm-serve --model $MODEL --tensor_parallel_size $TP --data_parallel_size $DP --host 0.0.0.0 + else + # Training node - adjust world size and nodes count for training + GPUS_PER_NODE=$(($DSTACK_GPUS_NUM / $DSTACK_NODES_NUM)) + ADJUSTED_NODES_NUM=$(($DSTACK_NODES_NUM - 1)) + ADJUSTED_GPUS_TOTAL=$(($GPUS_PER_NODE * $ADJUSTED_NODES_NUM)) + # Other nodes run training + echo "Starting training with VLLM on $VLLM_HOST" + accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \ + --num_processes=$ADJUSTED_GPUS_TOTAL \ + --num_machines=$ADJUSTED_NODES_NUM \ + --machine_rank=$DSTACK_NODE_RANK \ + --main_process_ip=$DSTACK_MASTER_NODE_IP \ + --main_process_port=8008 \ + src/open_r1/grpo.py \ + --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml \ + --model_name_or_path $MODEL \ + --output_dir /checkpoints/Qwen2.5-Coder-7B-Instruct-GRPO \ + --hub_model_id $HUB_MODEL_ID \ + --vllm_server_host=$VLLM_HOST + fi + else + # Standard training mode without VLLM + echo "Running standard training without VLLM" + accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \ + --num_processes=$DSTACK_GPUS_NUM \ + --num_machines=$DSTACK_NODES_NUM \ + --machine_rank=$DSTACK_NODE_RANK \ + --main_process_ip=$DSTACK_MASTER_NODE_IP \ + --main_process_port=8008 \ + src/open_r1/grpo.py \ + --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml \ + --model_name_or_path $MODEL \ + --output_dir /checkpoints/Qwen2.5-Coder-7B-Instruct-GRPO \ + --hub_model_id $HUB_MODEL_ID \ + --use_vllm false + fi + +resources: + gpu: 80GB:8 + shm_size: 128GB + +volumes: + - /checkpoints:/checkpoints +``` +
+ +!!! info "NOTE:" + 1. When `nodes: N` and `USE_VLLM=true`, (N-1) nodes will be used for distributed training, while last node will be used for vLLM inference. + 2. When `USE_VLLM=false`, training happens across all nodes. + 3. The number of `attention heads` of the model should be divisible by `TP` and `DP` values. + 4. We pin the `flash-attn` to `2.7.4.post1` for compatibility with `vllm=0.8.5.post1`, as latest vLLM causes TRL's GRPO to fail. See [issue #3608](https://github.com/huggingface/trl/issues/3608) for details. + + +### Apply the configuration + +To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. + +
+ +```shell +$ HF_TOKEN=... +$ HUB_MODEL_ID=... +$ WANDB_API_KEY=... +$ dstack apply -f examples/distributed-training/open-r1/.dstack.yml + + # BACKEND RESOURCES INSTANCE TYPE PRICE + 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle + 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle + +Submit the run open-r1-grpo? [y/n]: y + +Provisioning... +---> 100% +``` +
+ +## Source code + +The source-code of this example can be found in +[`examples/distributed-training/open-r1` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/open-r1){:target="_blank"}. + +!!! info "What's next?" + 1. Read the [clusters](https://dstack.ai/docs/guides/clusters) guide + 2. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), + [services](https://dstack.ai/docs/concepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets) + diff --git a/mkdocs.yml b/mkdocs.yml index 5bb5718c6..aa70e85db 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -268,6 +268,7 @@ nav: - TRL: examples/distributed-training/trl/index.md - Axolotl: examples/distributed-training/axolotl/index.md - Ray+RAGEN: examples/distributed-training/ray-ragen/index.md + - Open-R1: examples/distributed-training/open-r1/index.md - Clusters: - NCCL tests: examples/clusters/nccl-tests/index.md - RCCL tests: examples/clusters/rccl-tests/index.md