-
Notifications
You must be signed in to change notification settings - Fork 410
Description
Problem
Inspect's model-graded scorers (model_graded_qa, model_graded_fact) are useful for scaling evaluation, but raw judge scores can be biased relative to oracle labels.
Today, there is no built-in "post-hoc calibration" workflow in Inspect that answers:
- What is the calibrated estimate (with uncertainty) after collecting a small oracle-labeled subset?
- How can users run this calibration directly from
EvalLogoutputs? - How can users plan oracle labeling budget rather than choosing sample size ad hoc?
Proposal
Add a docs-first integration path for CJE (Causal Judge Evaluation) as an optional post-processing step.
cje-eval is already usable for this workflow (verified against current PyPI API):
from cje import analyze_datasetanalyze_dataset(..., fresh_draws_data=...)- Returns
EstimationResultwithestimates,standard_errors,method, andci()
Minimal example from Inspect logs
from inspect_ai.log import read_eval_log
from cje import analyze_dataset
log = read_eval_log("logs/my_eval.eval")
# Inspect model-graded scores are typically C/P/I/N
grade_to_score = {"C": 1.0, "P": 0.5, "I": 0.0, "N": 0.0}
scorer_name = "model_graded_qa" # or "model_graded_fact"
policy_name = "model_under_test"
# human_labels: dict[prompt_id, float] with labels in [0,1]
records = []
for i, sample in enumerate(log.samples):
score_obj = sample.scores.get(scorer_name)
if score_obj is None:
continue
prompt_id = str(sample.id if sample.id is not None else i)
record = {
"prompt_id": prompt_id,
"judge_score": grade_to_score.get(str(score_obj.value), 0.0),
}
if prompt_id in human_labels:
record["oracle_label"] = float(human_labels[prompt_id])
records.append(record)
results = analyze_dataset(fresh_draws_data={policy_name: records})
print("Calibrated estimate:", float(results.estimates[0]))
print("SE:", float(results.standard_errors[0]))
print("95% CI:", results.ci()[0])
print("Method:", results.method)Sample-size planning (CJE functionality)
CJE also exposes planning APIs that can answer "how many oracle labels are enough":
fit_variance_model(...)on pilot fresh-draw dataplan_evaluation(...)for optimaln(judge-scored) andm(oracle-labeled) allocation under a fixed budgetplan_for_mde(...)for minimum budget/allocation to hit a target MDEsimulate_planning(...)for quick what-if planning
This could be mentioned as an advanced follow-on in docs (after the basic calibration recipe).
Scope (suggested)
Start with documentation only:
- Add a "Judge score calibration" section to scorer/log-analysis docs.
- Show the log-to-CJE bridge pattern above.
- Note oracle-label requirements (numeric labels; package requires at least 10 labels for CV and recommends more).
- Add a short pointer to CJE planning utilities for sample-size/budget planning.
No core dependency or runtime coupling required in phase 1.
If maintainers find this useful, phase 2 could be a lightweight helper in inspect_ai.analysis that exports score records in CJE-ready format.
Why this helps
This gives users a practical path from model-graded scores to calibrated estimates with confidence intervals, plus an explicit way to plan oracle labeling effort.
Disclosure
I am a co-author of CJE.