Skip to content

feat(rl): opt-in async eval scheduling#714

Draft
StuffByLiang wants to merge 2 commits into
thinking-machines-lab:mainfrom
StuffByLiang:feat/async-eval
Draft

feat(rl): opt-in async eval scheduling#714
StuffByLiang wants to merge 2 commits into
thinking-machines-lab:mainfrom
StuffByLiang:feat/async-eval

Conversation

@StuffByLiang

@StuffByLiang StuffByLiang commented May 17, 2026

Copy link
Copy Markdown

Summary

Adds Config.async_eval: bool = False (default off — no behavior change). When enabled, run_evaluations_parallel schedules each evaluator as a fire-and-forget asyncio.Task instead of awaiting them inline. Eval metrics are logged asynchronously via ml_logger.log_metrics(metrics, step=i_batch_when_scheduled) when each evaluator finishes; pending tasks are awaited up to async_eval_drain_timeout_s (default 600s) at shutdown.

Motivation

In long-CoT RL training (e.g. max_tokens=24576 against hundreds of LiveCodeBench-style problems), a single eval pass can take minutes — comparable to a training step. With the current sync-await pattern at train.py:692, train.py:1225, train.py:1694, training stalls for the entire eval. Even though evaluators internally use asyncio.gather, the train loop awaits the whole gather before the next step.

set_rollout_executor(ProcessPoolExecutor(...)) is the cookbook's existing scale-out path, but it parallelizes within an eval rather than overlapping eval with training. This PR adds the missing overlap by making the schedule fire-and-forget.

I was running this pattern as an out-of-tree wrapper around RLTestSetEvaluator for our own training and it cleanly cut the eval-induced stall to zero, so figured it might be worth upstreaming as a built-in opt-in.

Design

Backward compat Off by default. Config.async_eval=False → existing behavior, byte-for-byte.
Wiring New _AsyncEvalDispatcher class + ContextVar (_async_eval_dispatcher). Pattern mirrors the existing _rollout_executor ContextVar in rl/rollouts.py.
Signature change None. run_evaluations_parallel consults get_async_eval_dispatcher() at the top of the function.
Metric logging When async, each eval's metrics are written via ml_logger.log_metrics(metrics, step=i_batch_when_scheduled) after the task completes. W&B accepts retroactive step writes.
Shutdown main() awaits dispatcher.drain(timeout=config.async_eval_drain_timeout_s) before ml_logger.close(). Set timeout to None for unbounded.

Tradeoffs (documented in the Config docstring)

  • Eval rollouts compete with training rollouts for per-account Tinker sampling concurrency. Heavy evals may slow training.
  • W&B charts will show retroactive step writes — well-supported but worth knowing when reading hover values.
  • Evaluator exceptions are swallowed with a warning rather than raised (a flaky eval shouldn't kill training).

Tests

Adds tinker_cookbook/rl/train_async_eval_test.py (5 tests, no API or network):

  • test_schedule_returns_sentinel_immediately — schedule returns without awaiting the underlying evaluator
  • test_drain_awaits_pending_tasks_and_logs_at_captured_step — multiple scheduled evals all log via ml_logger at their captured i_batch
  • test_run_evaluations_parallel_routes_through_dispatcher_when_installed — ContextVar-based dispatch works at the public API boundary
  • test_get_set_dispatcher_round_trip — ContextVar install/clear
  • test_eval_failure_is_swallowed_with_warning — exception in evaluator does not propagate

All 5 pass.

I also ran pytest tinker_cookbook/rl/ (full RL test suite): 108 passed, 3 skipped. One pre-existing failure unrelated to this change (rollout_logging_test::test_write_rollout_summaries_jsonl_handles_numpy_scalars — a deprecation that should have been removed in 0.4.0 per its own message; reproduces on main without this PR).

Test plan

  • New unit tests added and passing
  • Existing tinker_cookbook/rl/ tests still pass (one pre-existing unrelated failure on main)
  • Smoke test the cookbook-internal Config(async_eval=True) path against a real Tinker RL run (I've validated the same approach via an external wrapper; haven't yet run with the in-cookbook Config.async_eval=True flag wired end-to-end)

🤖 Generated with Claude Code

StuffByLiang and others added 2 commits May 16, 2026 20:05
Add `Config.async_eval: bool = False` (default off — no behavior change for
existing users). When enabled, `run_evaluations_parallel` schedules each
evaluator as a fire-and-forget `asyncio.Task` rather than awaiting them inline.
Evaluator metrics are logged asynchronously via `ml_logger` at the iteration
step at which the eval was scheduled. Pending tasks are awaited up to
`async_eval_drain_timeout_s` (default 600s) at shutdown.

This unblocks training loops where the eval pass is comparable in wall-clock
to a training step — e.g. long-CoT code-RL eval on hundreds of problems, where
synchronous eval can stall training for minutes at a time.

The plumbing uses a `ContextVar` (`_async_eval_dispatcher`) so the signature of
`run_evaluations_parallel` is unchanged. The pattern mirrors the existing
`_rollout_executor` ContextVar in `rl/rollouts.py`.

Tradeoffs called out in the docstring:
* Eval rollouts compete with training rollouts for Tinker's per-account
  sampling concurrency. Heavy evals may slow training.
* W&B receives retroactive step writes when async evals complete out of order.
* Pending evals are awaited at shutdown; set `async_eval_drain_timeout_s=None`
  for unbounded.

Includes 5 unit tests (`tinker_cookbook/rl/train_async_eval_test.py`):
* sentinel return value and non-blocking schedule
* drain() awaits and logs at the captured iteration step
* `run_evaluations_parallel` routes through dispatcher when installed
* ContextVar install/clear round-trip
* evaluator failures are swallowed (logged, not raised)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove redundant quoted forward refs (file has `from __future__ import
annotations`), drop unused `# noqa: BLE001`, and apply ruff-format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant