Skip to content

Make RL eval rollout temperature configurable in RLTestSetEvaluator#724

Open
mvanhorn wants to merge 2 commits into
thinking-machines-lab:mainfrom
mvanhorn:fix/689-tinker-cookbook-eval-temperature-config
Open

Make RL eval rollout temperature configurable in RLTestSetEvaluator#724
mvanhorn wants to merge 2 commits into
thinking-machines-lab:mainfrom
mvanhorn:fix/689-tinker-cookbook-eval-temperature-config

Conversation

@mvanhorn

Copy link
Copy Markdown

Summary

Add an optional temperature field to RLTestSetEvaluator (it is a chz config so adding a field with a default is a single line) and thread it through both rollout paths. Default stays 1.0 so existing callers see no behavior change.

  1. In tinker_cookbook/rl/metric_util.py, add a temperature: float = 1.0 attribute to the RLTestSetEvaluator chz config (near the existing max_tokens / num_groups_to_log fields). Document the field with a docstring snippet that says: eval rollouts do not feed back into the training loss or the optional KL penalty, so this can differ from the train-time rollout temperature - set it to match your intended serving decoder.

  2. Pass temperature=self.temperature to do_group_rollout_and_filter_constant_reward(...) in _eval_with_executor (line ~297) instead of the literal 1.0.

  3. In the non-executor branch, add temperature to TinkerTokenCompleter. Inspect TinkerTokenCompleter in tinker_cookbook/completers.py to confirm the parameter name. If it accepts temperature directly, pass it. If it only accepts SamplingParams, construct one with the explicit temperature and pass that.

  4. Surface the option in the RLTrainConfig chz config (in tinker_cookbook/rl/train.py) as eval_temperature: float = 1.0 and wire it into the RLTestSetEvaluator construction site(s). Mention in the docstring that this is separate from the rollout temperature (which is constrained by the KL penalty discussion).

  5. In tinker_cookbook/rl/metric_util_test.py, add a unit test that constructs an RLTestSetEvaluator with temperature=0.5, runs against a stub sampling client that records the temperature it was called with, and asserts the recorded value is 0.5. If the existing test file already has a stub sampling client, reuse it.

  6. PR title: Make RL eval rollout temperature configurable in RLTestSetEvaluator

    PR body (concise, factual):

    Closes Question: temperature != 1.0, KL penalty, and eval temperature in RL cookbook #689.

    RLTestSetEvaluator currently hardcodes temperature=1.0 for eval rollouts. The train-time recommendation to keep temperature=1.0 is specifically about the IS / KL penalty terms in the training loss; eval rollouts do not feed back into the loss, so users who serve at a lower temperature should be able to evaluate at that same temperature.

    Adds an eval_temperature knob (default 1.0, so no behavior change for existing configs) threaded through RLTrainConfig to RLTestSetEvaluator and both the executor and non-executor rollout paths.

Why this matters

RLTestSetEvaluator in tinker_cookbook/rl/metric_util.py hardcodes temperature=1.0 for evaluation rollouts in both the executor path (line 297) and the non-executor path (it constructs TinkerTokenCompleter(sampling_client, max_tokens=self.max_tokens) with no temperature override, so the policy samples at the client default).

The issue points out that the train-time recommendation "T=1 is near-optimal; non-1 temperatures do not play well with KL penalty" is specifically about the IS / KL term in the training loss. Eval rollouts do not contribute to the training loss or the KL penalty - they just produce metrics. So train-time and eval-time temperature settings are independent.

In practice many users serve their fine-tuned models with temperature below 1 (or with deterministic decoding) and want eval-time rollouts to match their serving decoder. Today they have to either monkey-patch the evaluator or work around it.

Testing

  • uv run pytest tinker_cookbook/rl/metric_util_test.py passes - the new test asserts the stub sampling client receives the requested temperature.
  • uv run pytest tinker_cookbook/rl/ full suite passes - no other test depended on the hardcoded 1.0.
  • uv run pyright tinker_cookbook/rl/metric_util.py tinker_cookbook/rl/train.py passes.
  • uv run ruff check tinker_cookbook/rl/ and uv run ruff format --check tinker_cookbook/rl/ pass.
  • Manual: construct an RLTrainConfig with eval_temperature=0.0 and assert the resulting evaluator passes temperature=0.0 through to both rollout paths (no need for a live API key; mocking the sampling client is sufficient).
  • grep -n "temperature=" tinker_cookbook/rl/metric_util.py shows the new field and the threaded value - no other literal 1.0 left behind in the eval rollout paths.

Fixes #689

AI was used for assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Question: temperature != 1.0, KL penalty, and eval temperature in RL cookbook

1 participant