Skip to content

Feature request: Calibrated accuracy estimates for model-graded scorers #3236

@elandesberg

Description

@elandesberg

Problem

Inspect's model-graded scorers (model_graded_qa, model_graded_fact) are useful for scaling evaluation, but raw judge scores can be biased relative to oracle labels.

Today, there is no built-in "post-hoc calibration" workflow in Inspect that answers:

  1. What is the calibrated estimate (with uncertainty) after collecting a small oracle-labeled subset?
  2. How can users run this calibration directly from EvalLog outputs?
  3. How can users plan oracle labeling budget rather than choosing sample size ad hoc?

Proposal

Add a docs-first integration path for CJE (Causal Judge Evaluation) as an optional post-processing step.

cje-eval is already usable for this workflow (verified against current PyPI API):

  • from cje import analyze_dataset
  • analyze_dataset(..., fresh_draws_data=...)
  • Returns EstimationResult with estimates, standard_errors, method, and ci()

Minimal example from Inspect logs

from inspect_ai.log import read_eval_log
from cje import analyze_dataset

log = read_eval_log("logs/my_eval.eval")

# Inspect model-graded scores are typically C/P/I/N
grade_to_score = {"C": 1.0, "P": 0.5, "I": 0.0, "N": 0.0}

scorer_name = "model_graded_qa"  # or "model_graded_fact"
policy_name = "model_under_test"

# human_labels: dict[prompt_id, float] with labels in [0,1]
records = []
for i, sample in enumerate(log.samples):
    score_obj = sample.scores.get(scorer_name)
    if score_obj is None:
        continue

    prompt_id = str(sample.id if sample.id is not None else i)
    record = {
        "prompt_id": prompt_id,
        "judge_score": grade_to_score.get(str(score_obj.value), 0.0),
    }

    if prompt_id in human_labels:
        record["oracle_label"] = float(human_labels[prompt_id])

    records.append(record)

results = analyze_dataset(fresh_draws_data={policy_name: records})

print("Calibrated estimate:", float(results.estimates[0]))
print("SE:", float(results.standard_errors[0]))
print("95% CI:", results.ci()[0])
print("Method:", results.method)

Sample-size planning (CJE functionality)

CJE also exposes planning APIs that can answer "how many oracle labels are enough":

  • fit_variance_model(...) on pilot fresh-draw data
  • plan_evaluation(...) for optimal n (judge-scored) and m (oracle-labeled) allocation under a fixed budget
  • plan_for_mde(...) for minimum budget/allocation to hit a target MDE
  • simulate_planning(...) for quick what-if planning

This could be mentioned as an advanced follow-on in docs (after the basic calibration recipe).

Scope (suggested)

Start with documentation only:

  1. Add a "Judge score calibration" section to scorer/log-analysis docs.
  2. Show the log-to-CJE bridge pattern above.
  3. Note oracle-label requirements (numeric labels; package requires at least 10 labels for CV and recommends more).
  4. Add a short pointer to CJE planning utilities for sample-size/budget planning.

No core dependency or runtime coupling required in phase 1.

If maintainers find this useful, phase 2 could be a lightweight helper in inspect_ai.analysis that exports score records in CJE-ready format.

Why this helps

This gives users a practical path from model-graded scores to calibrated estimates with confidence intervals, plus an explicit way to plan oracle labeling effort.

Disclosure

I am a co-author of CJE.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions