Rescoring an evaluation with a new metric by Mamiglia · Pull Request #3310 · UKGovernmentBEIS/inspect_ai

Mamiglia · 2026-02-24T10:42:53Z

This PR contains:

New features

What is the current behavior? (You can also link to an open issue here)

When rescoring an existing eval log with inspect score, only the metrics stored in the original log are used. There is no way to specify new or different metrics at rescore time, so users cannot backfill new metrics (e.g. rate_of_no_answers) on previous evaluation runs without re-running the entire evaluation.

Fixes #3000

What is the new behavior?

A new --metric CLI option is added to inspect score (repeatable) that allows overriding the metrics used during rescoring. For example:

inspect score log.eval --metric accuracy --metric my_file.py@custom_metric

Metrics can be specified by registry name or by file@function reference, consistent with how --scorer works. The score() and score_async() Python APIs also accept a new metrics parameter for programmatic use.

Does this PR introduce a breaking change?

The addition of the metrics parameter as the third parameter in both score() and score_async() defaults to None, preserving existing behavior. If the implementation of someone relies on the third posistional parametet being epochs_reducer they'll get the wrong behaviour (let me know if you prefer a modification here and how). The CLI is purely additive.

Other information:

Files changed:

score.py: adds --metric CLI option and resolve_metrics() helper
loader.py: adds metric_from_spec() to load metrics from registry or file (mirrors existing scorer_from_spec pattern)
score.py: threads metrics parameter through score() and score_async(); when provided, overrides metrics_from_log_header()
__init__.py: exports metric_create
test_score.py: adds unscored-metric test case verifying --metric accuracy produces exactly 1 metric

I want to disclose that I was assisted by LLMs for writing this code.

… flexibility

Mamiglia added 6 commits February 24, 2026 11:31

Add metric resolution functions and enhance metric loading capabilities

66c89b3

Add metrics parameter to score and score_async functions for enhanced…

7209a0a

… flexibility

Add metric option to score command for enhanced scoring flexibility

0025d88

Add metric parameter to test_score function for improved scoring tests

d8f4805

Refactor test_score_reducer to use keyword argument for epochs_reducer

d09413b

Merge branch 'main' into metric_rescore

4b3c400

Mamiglia force-pushed the metric_rescore branch from 17f428a to 4b3c400 Compare February 24, 2026 17:32

Merge branch 'main' into metric_rescore

2ef06fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rescoring an evaluation with a new metric#3310

Rescoring an evaluation with a new metric#3310
Mamiglia wants to merge 7 commits intoUKGovernmentBEIS:mainfrom
Mamiglia:metric_rescore

Mamiglia commented Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mamiglia commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR contains:

What is the current behavior? (You can also link to an open issue here)

What is the new behavior?

Does this PR introduce a breaking change?

Other information:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Mamiglia commented Feb 24, 2026 •

edited

Loading