feat(rl): opt-in async eval scheduling by StuffByLiang · Pull Request #714 · thinking-machines-lab/tinker-cookbook

StuffByLiang · 2026-05-17T03:06:34Z

Summary

Adds Config.async_eval: bool = False (default off — no behavior change). When enabled, run_evaluations_parallel schedules each evaluator as a fire-and-forget asyncio.Task instead of awaiting them inline. Eval metrics are logged asynchronously via ml_logger.log_metrics(metrics, step=i_batch_when_scheduled) when each evaluator finishes; pending tasks are awaited up to async_eval_drain_timeout_s (default 600s) at shutdown.

Motivation

In long-CoT RL training (e.g. max_tokens=24576 against hundreds of LiveCodeBench-style problems), a single eval pass can take minutes — comparable to a training step. With the current sync-await pattern at train.py:692, train.py:1225, train.py:1694, training stalls for the entire eval. Even though evaluators internally use asyncio.gather, the train loop awaits the whole gather before the next step.

set_rollout_executor(ProcessPoolExecutor(...)) is the cookbook's existing scale-out path, but it parallelizes within an eval rather than overlapping eval with training. This PR adds the missing overlap by making the schedule fire-and-forget.

I was running this pattern as an out-of-tree wrapper around RLTestSetEvaluator for our own training and it cleanly cut the eval-induced stall to zero, so figured it might be worth upstreaming as a built-in opt-in.

Design


Backward compat	Off by default. `Config.async_eval=False` → existing behavior, byte-for-byte.
Wiring	New `_AsyncEvalDispatcher` class + `ContextVar` (`_async_eval_dispatcher`). Pattern mirrors the existing `_rollout_executor` ContextVar in `rl/rollouts.py`.
Signature change	None. `run_evaluations_parallel` consults `get_async_eval_dispatcher()` at the top of the function.
Metric logging	When async, each eval's metrics are written via `ml_logger.log_metrics(metrics, step=i_batch_when_scheduled)` after the task completes. W&B accepts retroactive step writes.
Shutdown	`main()` awaits `dispatcher.drain(timeout=config.async_eval_drain_timeout_s)` before `ml_logger.close()`. Set timeout to `None` for unbounded.

Tradeoffs (documented in the Config docstring)

Eval rollouts compete with training rollouts for per-account Tinker sampling concurrency. Heavy evals may slow training.
W&B charts will show retroactive step writes — well-supported but worth knowing when reading hover values.
Evaluator exceptions are swallowed with a warning rather than raised (a flaky eval shouldn't kill training).

Tests

Adds tinker_cookbook/rl/train_async_eval_test.py (5 tests, no API or network):

test_schedule_returns_sentinel_immediately — schedule returns without awaiting the underlying evaluator
test_drain_awaits_pending_tasks_and_logs_at_captured_step — multiple scheduled evals all log via ml_logger at their captured i_batch
test_run_evaluations_parallel_routes_through_dispatcher_when_installed — ContextVar-based dispatch works at the public API boundary
test_get_set_dispatcher_round_trip — ContextVar install/clear
test_eval_failure_is_swallowed_with_warning — exception in evaluator does not propagate

All 5 pass.

I also ran pytest tinker_cookbook/rl/ (full RL test suite): 108 passed, 3 skipped. One pre-existing failure unrelated to this change (rollout_logging_test::test_write_rollout_summaries_jsonl_handles_numpy_scalars — a deprecation that should have been removed in 0.4.0 per its own message; reproduces on main without this PR).

Test plan

New unit tests added and passing
Existing tinker_cookbook/rl/ tests still pass (one pre-existing unrelated failure on main)
Smoke test the cookbook-internal Config(async_eval=True) path against a real Tinker RL run (I've validated the same approach via an external wrapper; haven't yet run with the in-cookbook Config.async_eval=True flag wired end-to-end)

🤖 Generated with Claude Code

Add `Config.async_eval: bool = False` (default off — no behavior change for existing users). When enabled, `run_evaluations_parallel` schedules each evaluator as a fire-and-forget `asyncio.Task` rather than awaiting them inline. Evaluator metrics are logged asynchronously via `ml_logger` at the iteration step at which the eval was scheduled. Pending tasks are awaited up to `async_eval_drain_timeout_s` (default 600s) at shutdown. This unblocks training loops where the eval pass is comparable in wall-clock to a training step — e.g. long-CoT code-RL eval on hundreds of problems, where synchronous eval can stall training for minutes at a time. The plumbing uses a `ContextVar` (`_async_eval_dispatcher`) so the signature of `run_evaluations_parallel` is unchanged. The pattern mirrors the existing `_rollout_executor` ContextVar in `rl/rollouts.py`. Tradeoffs called out in the docstring: * Eval rollouts compete with training rollouts for Tinker's per-account sampling concurrency. Heavy evals may slow training. * W&B receives retroactive step writes when async evals complete out of order. * Pending evals are awaited at shutdown; set `async_eval_drain_timeout_s=None` for unbounded. Includes 5 unit tests (`tinker_cookbook/rl/train_async_eval_test.py`): * sentinel return value and non-blocking schedule * drain() awaits and logs at the captured iteration step * `run_evaluations_parallel` routes through dispatcher when installed * ContextVar install/clear round-trip * evaluator failures are swallowed (logged, not raised) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove redundant quoted forward refs (file has `from __future__ import annotations`), drop unused `# noqa: BLE001`, and apply ruff-format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

StuffByLiang and others added 2 commits May 16, 2026 20:05

chore(rl): satisfy ruff lint and format

db115ca

Remove redundant quoted forward refs (file has `from __future__ import annotations`), drop unused `# noqa: BLE001`, and apply ruff-format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rl): opt-in async eval scheduling#714

feat(rl): opt-in async eval scheduling#714
StuffByLiang wants to merge 2 commits into
thinking-machines-lab:mainfrom
StuffByLiang:feat/async-eval

StuffByLiang commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

StuffByLiang commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

Tradeoffs (documented in the Config docstring)

Tests

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

StuffByLiang commented May 17, 2026 •

edited

Loading