Make RL eval rollout temperature configurable in RLTestSetEvaluator by mvanhorn · Pull Request #724 · thinking-machines-lab/tinker-cookbook

mvanhorn · 2026-05-19T22:49:05Z

Summary

Add an optional temperature field to RLTestSetEvaluator (it is a chz config so adding a field with a default is a single line) and thread it through both rollout paths. Default stays 1.0 so existing callers see no behavior change.

In tinker_cookbook/rl/metric_util.py, add a temperature: float = 1.0 attribute to the RLTestSetEvaluator chz config (near the existing max_tokens / num_groups_to_log fields). Document the field with a docstring snippet that says: eval rollouts do not feed back into the training loss or the optional KL penalty, so this can differ from the train-time rollout temperature - set it to match your intended serving decoder.
Pass temperature=self.temperature to do_group_rollout_and_filter_constant_reward(...) in _eval_with_executor (line ~297) instead of the literal 1.0.
In the non-executor branch, add temperature to TinkerTokenCompleter. Inspect TinkerTokenCompleter in tinker_cookbook/completers.py to confirm the parameter name. If it accepts temperature directly, pass it. If it only accepts SamplingParams, construct one with the explicit temperature and pass that.
Surface the option in the RLTrainConfig chz config (in tinker_cookbook/rl/train.py) as eval_temperature: float = 1.0 and wire it into the RLTestSetEvaluator construction site(s). Mention in the docstring that this is separate from the rollout temperature (which is constrained by the KL penalty discussion).
In tinker_cookbook/rl/metric_util_test.py, add a unit test that constructs an RLTestSetEvaluator with temperature=0.5, runs against a stub sampling client that records the temperature it was called with, and asserts the recorded value is 0.5. If the existing test file already has a stub sampling client, reuse it.
PR title: Make RL eval rollout temperature configurable in RLTestSetEvaluator

PR body (concise, factual):

Closes Question: temperature != 1.0, KL penalty, and eval temperature in RL cookbook #689.

RLTestSetEvaluator currently hardcodes temperature=1.0 for eval rollouts. The train-time recommendation to keep temperature=1.0 is specifically about the IS / KL penalty terms in the training loss; eval rollouts do not feed back into the loss, so users who serve at a lower temperature should be able to evaluate at that same temperature.

Adds an eval_temperature knob (default 1.0, so no behavior change for existing configs) threaded through RLTrainConfig to RLTestSetEvaluator and both the executor and non-executor rollout paths.

Why this matters

RLTestSetEvaluator in tinker_cookbook/rl/metric_util.py hardcodes temperature=1.0 for evaluation rollouts in both the executor path (line 297) and the non-executor path (it constructs TinkerTokenCompleter(sampling_client, max_tokens=self.max_tokens) with no temperature override, so the policy samples at the client default).

The issue points out that the train-time recommendation "T=1 is near-optimal; non-1 temperatures do not play well with KL penalty" is specifically about the IS / KL term in the training loss. Eval rollouts do not contribute to the training loss or the KL penalty - they just produce metrics. So train-time and eval-time temperature settings are independent.

In practice many users serve their fine-tuned models with temperature below 1 (or with deterministic decoding) and want eval-time rollouts to match their serving decoder. Today they have to either monkey-patch the evaluator or work around it.

Testing

uv run pytest tinker_cookbook/rl/metric_util_test.py passes - the new test asserts the stub sampling client receives the requested temperature.
uv run pytest tinker_cookbook/rl/ full suite passes - no other test depended on the hardcoded 1.0.
uv run pyright tinker_cookbook/rl/metric_util.py tinker_cookbook/rl/train.py passes.
uv run ruff check tinker_cookbook/rl/ and uv run ruff format --check tinker_cookbook/rl/ pass.
Manual: construct an RLTrainConfig with eval_temperature=0.0 and assert the resulting evaluator passes temperature=0.0 through to both rollout paths (no need for a live API key; mocking the sampling client is sufficient).
grep -n "temperature=" tinker_cookbook/rl/metric_util.py shows the new field and the threaded value - no other literal 1.0 left behind in the eval rollout paths.

Fixes #689

AI was used for assistance.

Fixes thinking-machines-lab#689

… pyright

Make RL eval rollout temperature configurable in RLTestSetEvaluator

92464f5

Fixes thinking-machines-lab#689

mvanhorn mentioned this pull request May 19, 2026

Question: temperature != 1.0, KL penalty, and eval temperature in RL cookbook #689

Open

fix(test): cast _RecordingSamplingClient to tinker.SamplingClient for…

9b07fb9

… pyright

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make RL eval rollout temperature configurable in RLTestSetEvaluator#724

Make RL eval rollout temperature configurable in RLTestSetEvaluator#724
mvanhorn wants to merge 2 commits into
thinking-machines-lab:mainfrom
mvanhorn:fix/689-tinker-cookbook-eval-temperature-config

mvanhorn commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvanhorn commented May 19, 2026

Summary

Why this matters

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant