feat(rl): opt-in async eval scheduling#714
Draft
StuffByLiang wants to merge 2 commits into
Draft
Conversation
Add `Config.async_eval: bool = False` (default off — no behavior change for existing users). When enabled, `run_evaluations_parallel` schedules each evaluator as a fire-and-forget `asyncio.Task` rather than awaiting them inline. Evaluator metrics are logged asynchronously via `ml_logger` at the iteration step at which the eval was scheduled. Pending tasks are awaited up to `async_eval_drain_timeout_s` (default 600s) at shutdown. This unblocks training loops where the eval pass is comparable in wall-clock to a training step — e.g. long-CoT code-RL eval on hundreds of problems, where synchronous eval can stall training for minutes at a time. The plumbing uses a `ContextVar` (`_async_eval_dispatcher`) so the signature of `run_evaluations_parallel` is unchanged. The pattern mirrors the existing `_rollout_executor` ContextVar in `rl/rollouts.py`. Tradeoffs called out in the docstring: * Eval rollouts compete with training rollouts for Tinker's per-account sampling concurrency. Heavy evals may slow training. * W&B receives retroactive step writes when async evals complete out of order. * Pending evals are awaited at shutdown; set `async_eval_drain_timeout_s=None` for unbounded. Includes 5 unit tests (`tinker_cookbook/rl/train_async_eval_test.py`): * sentinel return value and non-blocking schedule * drain() awaits and logs at the captured iteration step * `run_evaluations_parallel` routes through dispatcher when installed * ContextVar install/clear round-trip * evaluator failures are swallowed (logged, not raised) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove redundant quoted forward refs (file has `from __future__ import annotations`), drop unused `# noqa: BLE001`, and apply ruff-format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
Config.async_eval: bool = False(default off — no behavior change). When enabled,run_evaluations_parallelschedules each evaluator as a fire-and-forgetasyncio.Taskinstead of awaiting them inline. Eval metrics are logged asynchronously viaml_logger.log_metrics(metrics, step=i_batch_when_scheduled)when each evaluator finishes; pending tasks are awaited up toasync_eval_drain_timeout_s(default 600s) at shutdown.Motivation
In long-CoT RL training (e.g.
max_tokens=24576against hundreds of LiveCodeBench-style problems), a single eval pass can take minutes — comparable to a training step. With the current sync-await pattern attrain.py:692,train.py:1225,train.py:1694, training stalls for the entire eval. Even though evaluators internally useasyncio.gather, the train loop awaits the whole gather before the next step.set_rollout_executor(ProcessPoolExecutor(...))is the cookbook's existing scale-out path, but it parallelizes within an eval rather than overlapping eval with training. This PR adds the missing overlap by making the schedule fire-and-forget.I was running this pattern as an out-of-tree wrapper around
RLTestSetEvaluatorfor our own training and it cleanly cut the eval-induced stall to zero, so figured it might be worth upstreaming as a built-in opt-in.Design
Config.async_eval=False→ existing behavior, byte-for-byte._AsyncEvalDispatcherclass +ContextVar(_async_eval_dispatcher). Pattern mirrors the existing_rollout_executorContextVar inrl/rollouts.py.run_evaluations_parallelconsultsget_async_eval_dispatcher()at the top of the function.ml_logger.log_metrics(metrics, step=i_batch_when_scheduled)after the task completes. W&B accepts retroactive step writes.main()awaitsdispatcher.drain(timeout=config.async_eval_drain_timeout_s)beforeml_logger.close(). Set timeout toNonefor unbounded.Tradeoffs (documented in the Config docstring)
Tests
Adds
tinker_cookbook/rl/train_async_eval_test.py(5 tests, no API or network):test_schedule_returns_sentinel_immediately— schedule returns without awaiting the underlying evaluatortest_drain_awaits_pending_tasks_and_logs_at_captured_step— multiple scheduled evals all log via ml_logger at their captured i_batchtest_run_evaluations_parallel_routes_through_dispatcher_when_installed— ContextVar-based dispatch works at the public API boundarytest_get_set_dispatcher_round_trip— ContextVar install/cleartest_eval_failure_is_swallowed_with_warning— exception in evaluator does not propagateAll 5 pass.
I also ran
pytest tinker_cookbook/rl/(full RL test suite): 108 passed, 3 skipped. One pre-existing failure unrelated to this change (rollout_logging_test::test_write_rollout_summaries_jsonl_handles_numpy_scalars— a deprecation that should have been removed in 0.4.0 per its own message; reproduces onmainwithout this PR).Test plan
tinker_cookbook/rl/tests still pass (one pre-existing unrelated failure on main)Config(async_eval=True)path against a real Tinker RL run (I've validated the same approach via an external wrapper; haven't yet run with the in-cookbookConfig.async_eval=Trueflag wired end-to-end)🤖 Generated with Claude Code