Skip to content

Rescoring an evaluation with a new metric#3310

Open
Mamiglia wants to merge 7 commits intoUKGovernmentBEIS:mainfrom
Mamiglia:metric_rescore
Open

Rescoring an evaluation with a new metric#3310
Mamiglia wants to merge 7 commits intoUKGovernmentBEIS:mainfrom
Mamiglia:metric_rescore

Conversation

@Mamiglia
Copy link
Contributor

@Mamiglia Mamiglia commented Feb 24, 2026

This PR contains:

  • New features

What is the current behavior? (You can also link to an open issue here)

When rescoring an existing eval log with inspect score, only the metrics stored in the original log are used. There is no way to specify new or different metrics at rescore time, so users cannot backfill new metrics (e.g. rate_of_no_answers) on previous evaluation runs without re-running the entire evaluation.

Fixes #3000

What is the new behavior?

A new --metric CLI option is added to inspect score (repeatable) that allows overriding the metrics used during rescoring. For example:

inspect score log.eval --metric accuracy --metric my_file.py@custom_metric

Metrics can be specified by registry name or by file@function reference, consistent with how --scorer works. The score() and score_async() Python APIs also accept a new metrics parameter for programmatic use.

Does this PR introduce a breaking change?

The addition of the metrics parameter as the third parameter in both score() and score_async() defaults to None, preserving existing behavior. If the implementation of someone relies on the third posistional parametet being epochs_reducer they'll get the wrong behaviour (let me know if you prefer a modification here and how). The CLI is purely additive.

Other information:

Files changed:

  • score.py: adds --metric CLI option and resolve_metrics() helper
  • loader.py: adds metric_from_spec() to load metrics from registry or file (mirrors existing scorer_from_spec pattern)
  • score.py: threads metrics parameter through score() and score_async(); when provided, overrides metrics_from_log_header()
  • __init__.py: exports metric_create
  • test_score.py: adds unscored-metric test case verifying --metric accuracy produces exactly 1 metric

I want to disclose that I was assisted by LLMs for writing this code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Anyway of rescoring an evaluation with a new metric?

1 participant