SFT: run evaluators on post-final-optim weights before final checkpoint by Butanium · Pull Request #736 · thinking-machines-lab/tinker-cookbook

Butanium · 2026-05-26T21:33:11Z

Summary

In tinker_cookbook/supervised/train.py, add an evals_final block inside the if did_train: branch that re-runs evaluators on the post-final-optim weights at step = total_steps, just before checkpoint_mgr.save_final_async(...).
Only fires when evaluators is non-empty and config.eval_every > 0 (i.e. when eval is enabled at all), so default behavior for runs without evaluators is unchanged.

Motivation

The in-loop eval check in the SFT loop is

if evaluators and config.eval_every > 0 and step % config.eval_every == 0:
    ...run_evals(evaluators, training_client, step)...

and runs before that iteration's forward_backward_async / optim_step_async, so it snapshots pre-step weights. Two consequences:

The last in-loop eval logs metrics at step = floor((total_steps - 1) / eval_every) * eval_every, on weights that have undergone that many optim steps.
The final checkpoint, saved by checkpoint_mgr.save_final_async immediately after the loop, captures weights that have undergone total_steps optim steps.

So even in the vanilla case (eval_every=10, total_steps=100), the last in-loop eval fires at step=90 and reports metrics for weights ~10 optim steps stale relative to the final checkpoint. In the worst case (when total_steps is one short of a multiple of eval_every), the in-loop eval can be eval_every - 1 steps stale. For users who treat the final-checkpoint metrics as the model's "final" evaluation, this is a silent mismatch between what got logged and what got saved.

This PR closes that gap by running the configured evaluators once more at step = total_steps, immediately before the final checkpoint save, matching the checkpoint's step label.

Notes / alternatives considered

Could be opt-out via a new final_eval: bool = True config flag. Happy to add one if reviewers prefer — the current design treats "if you have evals configured, you almost certainly want one at the genuinely-final weights" as the obvious default.
Could instead change the in-loop condition to evaluate post-step weights instead of pre-step, but that would shift the step labels of every existing run's metrics — a much more invasive change.

🤖 Drafted with Claude Code — Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

Butanium

pretty useful if you want to ensure the final checkpoint is evaled

The in-loop eval cadence in the supervised training loop stops at step total_steps-1 (loop `range` upper bound). For asymmetric eval cadences where the user wants an eval at the very end of training (e.g. eval fractions {1/6, 4/6, 1.0} of total steps), the 1.0 case would silently never fire — evaluators would never see the genuinely-final weights that get saved in the final checkpoint. Add an `evals_final` block inside `if did_train` that re-runs the evaluators once at `total_steps` before saving the final checkpoint, matching the checkpoint's label semantics.

Butanium · 2026-05-26T21:51:31Z

check failed because of hf api error in some unrelated test, reopening to retrigger test

Butanium force-pushed the supervised-final-eval branch from a035833 to 62979f6 Compare May 26, 2026 21:36

Butanium commented May 26, 2026

View reviewed changes

Butanium force-pushed the supervised-final-eval branch from 62979f6 to 7490b8e Compare May 26, 2026 21:43

Butanium marked this pull request as ready for review May 26, 2026 21:50

Butanium closed this May 26, 2026

Butanium reopened this May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT: run evaluators on post-final-optim weights before final checkpoint#736

SFT: run evaluators on post-final-optim weights before final checkpoint#736
Butanium wants to merge 1 commit into
thinking-machines-lab:mainfrom
Butanium:supervised-final-eval

Butanium commented May 26, 2026 •

edited

Loading

Uh oh!

Butanium left a comment

Uh oh!

Butanium commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Butanium commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Notes / alternatives considered

Uh oh!

Butanium left a comment

Choose a reason for hiding this comment

Uh oh!

Butanium commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Butanium commented May 26, 2026 •

edited

Loading