GRPO: Scalable training with one LLM/node #3186

jglaser · 2025-03-31T07:28:30Z

What does this PR do?

Truly scalable training with GRPO + 1 local vllm process per node. Also works with FSDP. Tested on 256 nodes of Frontier (2048 GPUs).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case. Scaling bottleneck in GRPO Training. #3258 Support FSDP #3259
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?
- developing FSDP or multinode tests is challenging and might be addressed best via a separate PR. For now we would like to make sure that existing single node tests are not breaking. Suggestions welcome!

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@qgallouedec @binary-husky

Caveat:

currently disables an optimization for multiple completions as that one does not seem to be robust w.r.t. duplicate user inputs

- Temporarily disable optimations for multiple generations from the same prompt - Compute number of local proceses using collective - Strip FSDP checkpointing artifact from parameter names

relevant for multiple processes posting requests to the same server

Use an intra-node communicator to avoid sending large global messages

qgallouedec · 2025-04-08T00:56:07Z

Thanks @jglaser

I'm not sure to understand why FSDP requires to have one vLLM instance per node?

jglaser · 2025-04-08T00:59:24Z

Thanks @jglaser

I'm not sure to understand why FSDP requires to have one vLLM instance per node?

It does not... FSDP changes and vllm scaling in this PR are not strictly related - however they arose in the same stream of work, as I was trying to train a 14B model which also required sharding (in addition to data parallelism).

If the FSDP feature complicates review unnecessarily, this can be factored out into a separate PR. Suggestions?

qgallouedec · 2025-04-08T01:05:32Z

Ok, it makes more sense. To make the review easier can you split into two separate PRs 🙏

jglaser added 14 commits March 7, 2025 15:04

Support one vLLM instance per node

a793a15

Merge branch 'main' into local_main_process

7ba2ff3

Disable optimizations for multiple generation

a609ed4

- Temporarily disable optimations for multiple generations from the same prompt - Compute number of local proceses using collective - Strip FSDP checkpointing artifact from parameter names

Memory efficient parameter upload with FSDP

afa5637

remove extraneous code

ab4246a

Memory efficient FSDP subtree traversal

eb4abb3

Use provided IP for connection setup

6729cc7

Up the keepalive timeout

d0f7996

relevant for multiple processes posting requests to the same server

Do not show progress bar in server process

adb8135

Scalable communications

80b9d9c

Use an intra-node communicator to avoid sending large global messages

Update documentation

c2c345d

Merge branch 'main' into local_main_process

cbecfbd

remove left over function call

17a905f

fix merge conflict

65fbec4

jglaser marked this pull request as ready for review April 4, 2025 18:22

Update doc

b746287

This was referenced Apr 4, 2025

GRPO training fails with vllm=True due to connection issues with vLLM Server #3214

Open

GRPO failed when training with fsdp #2796

Open

Merge branch 'main' into local_main_process

0a391ce

This was referenced Apr 8, 2025

Scaling bottleneck in GRPO Training. #3258

Open

Support FSDP #3259

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO: Scalable training with one LLM/node #3186

GRPO: Scalable training with one LLM/node #3186

jglaser commented Mar 31, 2025 •

edited

Loading

qgallouedec commented Apr 8, 2025

jglaser commented Apr 8, 2025

qgallouedec commented Apr 8, 2025

GRPO: Scalable training with one LLM/node #3186

Are you sure you want to change the base?

GRPO: Scalable training with one LLM/node #3186

Conversation

jglaser commented Mar 31, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

qgallouedec commented Apr 8, 2025

jglaser commented Apr 8, 2025

qgallouedec commented Apr 8, 2025

jglaser commented Mar 31, 2025 •

edited

Loading