Skip to content

Add a DAPO recipe under math_rl (clip-higher PPO config + dynamic-sample-keep flag, no new algorithm)#726

Open
mvanhorn wants to merge 1 commit into
thinking-machines-lab:mainfrom
mvanhorn:fix/281-tinker-cookbook-dapo-recipe
Open

Add a DAPO recipe under math_rl (clip-higher PPO config + dynamic-sample-keep flag, no new algorithm)#726
mvanhorn wants to merge 1 commit into
thinking-machines-lab:mainfrom
mvanhorn:fix/281-tinker-cookbook-dapo-recipe

Conversation

@mvanhorn

Copy link
Copy Markdown

Summary

  1. Add tinker_cookbook/recipes/math_rl/dapo_train.py. It imports CLIConfig and cli_main from recipes/math_rl/train.py and exposes a slimmer DAPO-specific CLIConfig that hardcodes the DAPO defaults:

    import asyncio
    import chz
    from tinker_cookbook.recipes.math_rl.train import CLIConfig as MathRLCLIConfig, cli_main
    
    @chz.chz
    class DAPOConfig(MathRLCLIConfig):
        """DAPO preset for math_rl: clip-higher PPO + dynamic group filtering.
    
        References:
            - DAPO paper: https://arxiv.org/abs/2503.14476
        """
        loss_fn: str = "ppo"
        # DAPO recommends asymmetric clipping: tight low side, looser high side
        # so positive-advantage tokens with rising ratios can keep contributing.
        loss_fn_config: dict | None = chz.field(
            default_factory=lambda: {
                "clip_low_threshold": 0.8,
                "clip_high_threshold": 1.28,
            }
        )
    
    if __name__ == "__main__":
        cli_config = chz.entrypoint(DAPOConfig)
        asyncio.run(cli_main(cli_config))

    (The IS-ratio-space numbers 0.8 / 1.28 correspond to the DAPO paper's epsilon_low=0.2 / epsilon_high=0.28 in log-ratio space - tinker's clip_low_threshold / clip_high_threshold operate directly on the ratio, see tutorials/202_loss_functions.py lines ~219-238.)

  2. If do_group_rollout_and_filter_constant_reward(..., do_remove_constant_reward_groups=...) is exposed via the math_rl train config path, ensure the DAPO recipe sets it to True. If it is only configurable via the lower-level Config (in rl/train.py), document this in the recipe README rather than threading a new CLI flag - keep the diff small.

  3. Add a ## DAPO preset section to tinker_cookbook/recipes/math_rl/README.md with:

    • One-paragraph summary: DAPO is ppo with asymmetric clipping + dynamic sampling, designed for math RL on small/mid-size models.

    • Citation: https://arxiv.org/abs/2503.14476

    • Example launch command:

      uv run python -m tinker_cookbook.recipes.math_rl.dapo_train \
        env=math \
        model_name=Qwen/Qwen3-4B-Instruct-2507 \
        group_size=8 \
        groups_per_batch=64 \
        learning_rate=1e-5
    • One-paragraph note: "This is a configuration over the existing ppo loss; no new algorithm in the library."

  4. PR title: feat(recipes): add DAPO preset (clip-higher PPO) under math_rl

    PR body (concise, factual):

    Addresses the third bullet of suggested fixes and improvements to rl/train.py #281, which tyler-griggs explicitly greenlit in suggested fixes and improvements to rl/train.py #281 (comment).

    Adds tinker_cookbook/recipes/math_rl/dapo_train.py as a thin CLIConfig wrapper over the existing math_rl cli_main, with DAPO's recommended clip-higher PPO settings (clip_low=0.8, clip_high=1.28) wired through loss_fn_config. No new loss function and no library code change - DAPO's other two ingredients (token-level loss aggregation, dynamic constant-reward group filtering) are already supported by tinker / the cookbook.

    Deliberately scoped to DAPO only; GFPO and GSPO would be follow-up PRs (GSPO is also in flight as Add GSPO recipe #672).

Why this matters

Issue #281 collected three asks: (1) wandb resume - tyler-griggs greenlit a PR for that (PR #305 closed unmerged, complexity concerns - skipping in this plan); (2) loss_fn config - already resolved in PR #156; (3) "popular recipes out of the box (DAPO, GFPO, GSPO, etc.)" - tyler-griggs explicitly invited a more detailed issue and a PR.

This plan addresses scope-(3) with the smallest defensible step: add a single DAPO preset under recipes/math_rl/. DAPO is the cleanest fit because it does not require a new loss function:

  • clip_higher: asymmetric PPO clipping with clip_low_threshold < 1 and a larger clip_high_threshold > 1 + epsilon (paper recommends 0.2 / 0.28 in IS-ratio space).
  • dynamic sampling: drop rollout groups where all members produced the same reward (already supported - do_group_rollout_and_filter_constant_reward does this; just needs to be the default in the DAPO recipe).
  • token-level loss aggregation: tinker's ppo loss already operates token-level.

So a DAPO recipe is a CLIConfig preset plus a launch script and a README pointer - no library-level code change. GFPO and GSPO are deliberately not in scope here; if this lands cleanly they can be follow-up PRs (and PR #672 is already an open lane for GSPO).

Testing

  • uv run ruff check tinker_cookbook/recipes/math_rl/ and uv run ruff format --check tinker_cookbook/recipes/math_rl/ pass.
  • uv run pyright tinker_cookbook/recipes/math_rl/dapo_train.py passes - the chz inheritance and entrypoint pattern match the existing train.py.
  • python -m tinker_cookbook.recipes.math_rl.dapo_train --help prints the chz-generated help and shows loss_fn: ppo and the populated loss_fn_config defaults (no real run needed; chz CLI parses without an API key).
  • uv run pytest tinker_cookbook/rl/builder_pickle_test.py passes - the new module does not interfere with the global config registry.
  • Sanity-check by hand that the new file's loss_fn_config thresholds round-trip into Config.loss_fn_config and are passed to forward_backward_async (no execution required; reading recipes/math_rl/train.py lines 167-170 shows the field is threaded through verbatim).
  • grep -n "DAPO\|dapo" tinker_cookbook/recipes/math_rl/README.md shows the new recipe section; no other docs reference DAPO.

Fixes #281

AI was used for assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

suggested fixes and improvements to rl/train.py

1 participant