Add SFT LoRA support#1849
Conversation
6700860 to
d1cfa48
Compare
|
Hi! It would be great to get some feedback here and in #1850 when you have time 😌 I rebased on the latest I also have a follow-up PR prepared for LoRA warm-start support in the RL trainer, enabling end-to-end SFT+RL with LoRA adapters (without merges). I was waiting for feedback here and in #1850 before opening the next ones. If possible, could someone also trigger CI for both PRs? Thanks a lot! 🙏🏼 |
d1cfa48 to
6e30981
Compare
6e30981 to
723a6c5
Compare
723a6c5 to
29bf608
Compare
|
Just pushed an update that adds integration tests and fixes the adapter checkpoint export so saved adapters are PEFT-compatible / vLLM-loadable:
Tested end-to-end on 1x 4090 and 2x H100, adapter loads in vLLM. |
|
Thanks for the feedback @Jackmin801! I cleaned this up so The filtering was from an attempt to also keep |
Summary
model.lorais enabled.MultiRunManagerstate and LoRA scaling (alpha / rank) for SFT LoRA runs.Why
SFT LoRA was not fully wired as a first-class runtime path and could fail at startup without manual setup.
Before
RuntimeError: MultiRunManager not initialized. Please call setup_multi_run_manager first.After
Evidence
loss/meanconvergence (full-ft vs LoRA), 200 steps.sft_fullft_rtext_200.tomlsft_lora_rtext_200.tomlValidation
MultiRunManagerinitialization failures.Scope
Note
Medium Risk
Modifies SFT training initialization and distributed checkpoint saving paths for LoRA, including collectives over FSDP/DTensor; mistakes could break training startup or produce invalid adapter checkpoints.
Overview
Enables SFT runs with
model.loraby initializing theMultiRunManager, setting LoRA scaling (alpha / rank), and updating per-step LoRA token counts so LoRA layers receive correct partition metadata.Updates weight checkpointing when
save_adapter_separatelyis enabled to save PEFT-compatible adapter artifacts viasave_state_dict, while also capturing MultiRun LoRA state across all ranks (handlingDTensor.full_tensor()collectives under FSDP).Adds GPU CI integration configs and a new integration test that runs SFT LoRA start + resume, asserts loss decreases, and validates adapter checkpoint structure/keys.
Written by Cursor Bugbot for commit 833afe9. This will update automatically on new commits. Configure here.