Add a DAPO recipe under math_rl (clip-higher PPO config + dynamic-sample-keep flag, no new algorithm) by mvanhorn · Pull Request #726 · thinking-machines-lab/tinker-cookbook

mvanhorn · 2026-05-19T23:40:36Z

Summary

Add tinker_cookbook/recipes/math_rl/dapo_train.py. It imports CLIConfig and cli_main from recipes/math_rl/train.py and exposes a slimmer DAPO-specific CLIConfig that hardcodes the DAPO defaults:

import asyncio
import chz
from tinker_cookbook.recipes.math_rl.train import CLIConfig as MathRLCLIConfig, cli_main

@chz.chz
class DAPOConfig(MathRLCLIConfig):
    """DAPO preset for math_rl: clip-higher PPO + dynamic group filtering.

    References:
        - DAPO paper: https://arxiv.org/abs/2503.14476
    """
    loss_fn: str = "ppo"
    # DAPO recommends asymmetric clipping: tight low side, looser high side
    # so positive-advantage tokens with rising ratios can keep contributing.
    loss_fn_config: dict | None = chz.field(
        default_factory=lambda: {
            "clip_low_threshold": 0.8,
            "clip_high_threshold": 1.28,
        }
    )

if __name__ == "__main__":
    cli_config = chz.entrypoint(DAPOConfig)
    asyncio.run(cli_main(cli_config))

(The IS-ratio-space numbers 0.8 / 1.28 correspond to the DAPO paper's epsilon_low=0.2 / epsilon_high=0.28 in log-ratio space - tinker's clip_low_threshold / clip_high_threshold operate directly on the ratio, see tutorials/202_loss_functions.py lines ~219-238.)

If do_group_rollout_and_filter_constant_reward(..., do_remove_constant_reward_groups=...) is exposed via the math_rl train config path, ensure the DAPO recipe sets it to True. If it is only configurable via the lower-level Config (in rl/train.py), document this in the recipe README rather than threading a new CLI flag - keep the diff small.
Add a ## DAPO preset section to tinker_cookbook/recipes/math_rl/README.md with:
- One-paragraph summary: DAPO is ppo with asymmetric clipping + dynamic sampling, designed for math RL on small/mid-size models.
- Citation: https://arxiv.org/abs/2503.14476
- Example launch command:
```
uv run python -m tinker_cookbook.recipes.math_rl.dapo_train \
  env=math \
  model_name=Qwen/Qwen3-4B-Instruct-2507 \
  group_size=8 \
  groups_per_batch=64 \
  learning_rate=1e-5
```
- One-paragraph note: "This is a configuration over the existing ppo loss; no new algorithm in the library."
PR title: feat(recipes): add DAPO preset (clip-higher PPO) under math_rl

PR body (concise, factual):

Addresses the third bullet of suggested fixes and improvements to rl/train.py #281, which tyler-griggs explicitly greenlit in suggested fixes and improvements to rl/train.py #281 (comment).

Adds tinker_cookbook/recipes/math_rl/dapo_train.py as a thin CLIConfig wrapper over the existing math_rl cli_main, with DAPO's recommended clip-higher PPO settings (clip_low=0.8, clip_high=1.28) wired through loss_fn_config. No new loss function and no library code change - DAPO's other two ingredients (token-level loss aggregation, dynamic constant-reward group filtering) are already supported by tinker / the cookbook.

Deliberately scoped to DAPO only; GFPO and GSPO would be follow-up PRs (GSPO is also in flight as Add GSPO recipe #672).

Why this matters

Issue #281 collected three asks: (1) wandb resume - tyler-griggs greenlit a PR for that (PR #305 closed unmerged, complexity concerns - skipping in this plan); (2) loss_fn config - already resolved in PR #156; (3) "popular recipes out of the box (DAPO, GFPO, GSPO, etc.)" - tyler-griggs explicitly invited a more detailed issue and a PR.

This plan addresses scope-(3) with the smallest defensible step: add a single DAPO preset under recipes/math_rl/. DAPO is the cleanest fit because it does not require a new loss function:

clip_higher: asymmetric PPO clipping with clip_low_threshold < 1 and a larger clip_high_threshold > 1 + epsilon (paper recommends 0.2 / 0.28 in IS-ratio space).
dynamic sampling: drop rollout groups where all members produced the same reward (already supported - do_group_rollout_and_filter_constant_reward does this; just needs to be the default in the DAPO recipe).
token-level loss aggregation: tinker's ppo loss already operates token-level.

So a DAPO recipe is a CLIConfig preset plus a launch script and a README pointer - no library-level code change. GFPO and GSPO are deliberately not in scope here; if this lands cleanly they can be follow-up PRs (and PR #672 is already an open lane for GSPO).

Testing

uv run ruff check tinker_cookbook/recipes/math_rl/ and uv run ruff format --check tinker_cookbook/recipes/math_rl/ pass.
uv run pyright tinker_cookbook/recipes/math_rl/dapo_train.py passes - the chz inheritance and entrypoint pattern match the existing train.py.
python -m tinker_cookbook.recipes.math_rl.dapo_train --help prints the chz-generated help and shows loss_fn: ppo and the populated loss_fn_config defaults (no real run needed; chz CLI parses without an API key).
uv run pytest tinker_cookbook/rl/builder_pickle_test.py passes - the new module does not interfere with the global config registry.
Sanity-check by hand that the new file's loss_fn_config thresholds round-trip into Config.loss_fn_config and are passed to forward_backward_async (no execution required; reading recipes/math_rl/train.py lines 167-170 shows the field is threaded through verbatim).
grep -n "DAPO\|dapo" tinker_cookbook/recipes/math_rl/README.md shows the new recipe section; no other docs reference DAPO.

Fixes #281

AI was used for assistance.

…ple-keep flag, no new algorithm) Fixes thinking-machines-lab#281

Add a DAPO recipe under math_rl (clip-higher PPO config + dynamic-sam…

e441ef2

…ple-keep flag, no new algorithm) Fixes thinking-machines-lab#281

mvanhorn mentioned this pull request May 19, 2026

suggested fixes and improvements to rl/train.py #281

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a DAPO recipe under math_rl (clip-higher PPO config + dynamic-sample-keep flag, no new algorithm)#726

Add a DAPO recipe under math_rl (clip-higher PPO config + dynamic-sample-keep flag, no new algorithm)#726
mvanhorn wants to merge 1 commit into
thinking-machines-lab:mainfrom
mvanhorn:fix/281-tinker-cookbook-dapo-recipe

mvanhorn commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvanhorn commented May 19, 2026

Summary

Why this matters

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant