Make RL eval rollout temperature configurable in RLTestSetEvaluator#724
Open
mvanhorn wants to merge 2 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add an optional
temperaturefield toRLTestSetEvaluator(it is a chz config so adding a field with a default is a single line) and thread it through both rollout paths. Default stays1.0so existing callers see no behavior change.In
tinker_cookbook/rl/metric_util.py, add atemperature: float = 1.0attribute to theRLTestSetEvaluatorchz config (near the existingmax_tokens/num_groups_to_logfields). Document the field with a docstring snippet that says: eval rollouts do not feed back into the training loss or the optional KL penalty, so this can differ from the train-time rollout temperature - set it to match your intended serving decoder.Pass
temperature=self.temperaturetodo_group_rollout_and_filter_constant_reward(...)in_eval_with_executor(line ~297) instead of the literal1.0.In the non-executor branch, add
temperaturetoTinkerTokenCompleter. InspectTinkerTokenCompleterintinker_cookbook/completers.pyto confirm the parameter name. If it acceptstemperaturedirectly, pass it. If it only acceptsSamplingParams, construct one with the explicit temperature and pass that.Surface the option in the
RLTrainConfigchz config (intinker_cookbook/rl/train.py) aseval_temperature: float = 1.0and wire it into theRLTestSetEvaluatorconstruction site(s). Mention in the docstring that this is separate from the rollouttemperature(which is constrained by the KL penalty discussion).In
tinker_cookbook/rl/metric_util_test.py, add a unit test that constructs anRLTestSetEvaluatorwithtemperature=0.5, runs against a stub sampling client that records the temperature it was called with, and asserts the recorded value is0.5. If the existing test file already has a stub sampling client, reuse it.PR title:
Make RL eval rollout temperature configurable in RLTestSetEvaluatorPR body (concise, factual):
Why this matters
RLTestSetEvaluatorintinker_cookbook/rl/metric_util.pyhardcodestemperature=1.0for evaluation rollouts in both the executor path (line 297) and the non-executor path (it constructsTinkerTokenCompleter(sampling_client, max_tokens=self.max_tokens)with no temperature override, so the policy samples at the client default).The issue points out that the train-time recommendation "T=1 is near-optimal; non-1 temperatures do not play well with KL penalty" is specifically about the IS / KL term in the training loss. Eval rollouts do not contribute to the training loss or the KL penalty - they just produce metrics. So train-time and eval-time temperature settings are independent.
In practice many users serve their fine-tuned models with
temperaturebelow 1 (or with deterministic decoding) and want eval-time rollouts to match their serving decoder. Today they have to either monkey-patch the evaluator or work around it.Testing
uv run pytest tinker_cookbook/rl/metric_util_test.pypasses - the new test asserts the stub sampling client receives the requested temperature.uv run pytest tinker_cookbook/rl/full suite passes - no other test depended on the hardcoded1.0.uv run pyright tinker_cookbook/rl/metric_util.py tinker_cookbook/rl/train.pypasses.uv run ruff check tinker_cookbook/rl/anduv run ruff format --check tinker_cookbook/rl/pass.RLTrainConfigwitheval_temperature=0.0and assert the resulting evaluator passestemperature=0.0through to both rollout paths (no need for a live API key; mocking the sampling client is sufficient).grep -n "temperature=" tinker_cookbook/rl/metric_util.pyshows the new field and the threaded value - no other literal1.0left behind in the eval rollout paths.Fixes #689
AI was used for assistance.