Skip to content

SFT: run evaluators on post-final-optim weights before final checkpoint#736

Open
Butanium wants to merge 1 commit into
thinking-machines-lab:mainfrom
Butanium:supervised-final-eval
Open

SFT: run evaluators on post-final-optim weights before final checkpoint#736
Butanium wants to merge 1 commit into
thinking-machines-lab:mainfrom
Butanium:supervised-final-eval

Conversation

@Butanium

@Butanium Butanium commented May 26, 2026

Copy link
Copy Markdown

Summary

  • In tinker_cookbook/supervised/train.py, add an evals_final block inside the if did_train: branch that re-runs evaluators on the post-final-optim weights at step = total_steps, just before checkpoint_mgr.save_final_async(...).
  • Only fires when evaluators is non-empty and config.eval_every > 0 (i.e. when eval is enabled at all), so default behavior for runs without evaluators is unchanged.

Motivation

The in-loop eval check in the SFT loop is

if evaluators and config.eval_every > 0 and step % config.eval_every == 0:
    ...run_evals(evaluators, training_client, step)...

and runs before that iteration's forward_backward_async / optim_step_async, so it snapshots pre-step weights. Two consequences:

  1. The last in-loop eval logs metrics at step = floor((total_steps - 1) / eval_every) * eval_every, on weights that have undergone that many optim steps.
  2. The final checkpoint, saved by checkpoint_mgr.save_final_async immediately after the loop, captures weights that have undergone total_steps optim steps.

So even in the vanilla case (eval_every=10, total_steps=100), the last in-loop eval fires at step=90 and reports metrics for weights ~10 optim steps stale relative to the final checkpoint. In the worst case (when total_steps is one short of a multiple of eval_every), the in-loop eval can be eval_every - 1 steps stale. For users who treat the final-checkpoint metrics as the model's "final" evaluation, this is a silent mismatch between what got logged and what got saved.

This PR closes that gap by running the configured evaluators once more at step = total_steps, immediately before the final checkpoint save, matching the checkpoint's step label.

Notes / alternatives considered

  • Could be opt-out via a new final_eval: bool = True config flag. Happy to add one if reviewers prefer — the current design treats "if you have evals configured, you almost certainly want one at the genuinely-final weights" as the obvious default.
  • Could instead change the in-loop condition to evaluate post-step weights instead of pre-step, but that would shift the step labels of every existing run's metrics — a much more invasive change.

🤖 Drafted with Claude Code — Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

@Butanium Butanium force-pushed the supervised-final-eval branch from a035833 to 62979f6 Compare May 26, 2026 21:36

@Butanium Butanium left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty useful if you want to ensure the final checkpoint is evaled

The in-loop eval cadence in the supervised training loop stops at step
total_steps-1 (loop `range` upper bound). For asymmetric eval cadences
where the user wants an eval at the very end of training (e.g. eval
fractions {1/6, 4/6, 1.0} of total steps), the 1.0 case would silently
never fire — evaluators would never see the genuinely-final weights that
get saved in the final checkpoint.

Add an `evals_final` block inside `if did_train` that re-runs the
evaluators once at `total_steps` before saving the final checkpoint,
matching the checkpoint's label semantics.
@Butanium Butanium force-pushed the supervised-final-eval branch from 62979f6 to 7490b8e Compare May 26, 2026 21:43
@Butanium Butanium marked this pull request as ready for review May 26, 2026 21:50
@Butanium

Copy link
Copy Markdown
Author

check failed because of hf api error in some unrelated test, reopening to retrigger test

@Butanium Butanium closed this May 26, 2026
@Butanium Butanium reopened this May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant