Add a DAPO recipe under math_rl (clip-higher PPO config + dynamic-sample-keep flag, no new algorithm)#726
Open
mvanhorn wants to merge 1 commit into
Conversation
…ple-keep flag, no new algorithm) Fixes thinking-machines-lab#281
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
tinker_cookbook/recipes/math_rl/dapo_train.py. It importsCLIConfigandcli_mainfromrecipes/math_rl/train.pyand exposes a slimmer DAPO-specific CLIConfig that hardcodes the DAPO defaults:(The IS-ratio-space numbers 0.8 / 1.28 correspond to the DAPO paper's epsilon_low=0.2 / epsilon_high=0.28 in log-ratio space - tinker's
clip_low_threshold/clip_high_thresholdoperate directly on the ratio, seetutorials/202_loss_functions.pylines ~219-238.)If
do_group_rollout_and_filter_constant_reward(..., do_remove_constant_reward_groups=...)is exposed via the math_rl train config path, ensure the DAPO recipe sets it to True. If it is only configurable via the lower-levelConfig(inrl/train.py), document this in the recipe README rather than threading a new CLI flag - keep the diff small.Add a
## DAPO presetsection totinker_cookbook/recipes/math_rl/README.mdwith:One-paragraph summary: DAPO is
ppowith asymmetric clipping + dynamic sampling, designed for math RL on small/mid-size models.Citation: https://arxiv.org/abs/2503.14476
Example launch command:
One-paragraph note: "This is a configuration over the existing
ppoloss; no new algorithm in the library."PR title:
feat(recipes): add DAPO preset (clip-higher PPO) under math_rlPR body (concise, factual):
Why this matters
Issue #281 collected three asks: (1) wandb resume - tyler-griggs greenlit a PR for that (PR #305 closed unmerged, complexity concerns - skipping in this plan); (2) loss_fn config - already resolved in PR #156; (3) "popular recipes out of the box (DAPO, GFPO, GSPO, etc.)" - tyler-griggs explicitly invited a more detailed issue and a PR.
This plan addresses scope-(3) with the smallest defensible step: add a single DAPO preset under
recipes/math_rl/. DAPO is the cleanest fit because it does not require a new loss function:clip_higher: asymmetric PPO clipping withclip_low_threshold< 1 and a largerclip_high_threshold> 1 + epsilon (paper recommends 0.2 / 0.28 in IS-ratio space).dynamic sampling: drop rollout groups where all members produced the same reward (already supported -do_group_rollout_and_filter_constant_rewarddoes this; just needs to be the default in the DAPO recipe).token-level loss aggregation: tinker'sppoloss already operates token-level.So a DAPO recipe is a CLIConfig preset plus a launch script and a README pointer - no library-level code change. GFPO and GSPO are deliberately not in scope here; if this lands cleanly they can be follow-up PRs (and PR #672 is already an open lane for GSPO).
Testing
uv run ruff check tinker_cookbook/recipes/math_rl/anduv run ruff format --check tinker_cookbook/recipes/math_rl/pass.uv run pyright tinker_cookbook/recipes/math_rl/dapo_train.pypasses - the chz inheritance and entrypoint pattern match the existingtrain.py.python -m tinker_cookbook.recipes.math_rl.dapo_train --helpprints the chz-generated help and showsloss_fn: ppoand the populatedloss_fn_configdefaults (no real run needed; chz CLI parses without an API key).uv run pytest tinker_cookbook/rl/builder_pickle_test.pypasses - the new module does not interfere with the global config registry.loss_fn_configthresholds round-trip intoConfig.loss_fn_configand are passed toforward_backward_async(no execution required; readingrecipes/math_rl/train.pylines 167-170 shows the field is threaded through verbatim).grep -n "DAPO\|dapo" tinker_cookbook/recipes/math_rl/README.mdshows the new recipe section; no other docs reference DAPO.Fixes #281
AI was used for assistance.