Feature request: Calibrated accuracy estimates for model-graded scorers

### Problem

Inspect's model-graded scorers (`model_graded_qa`, `model_graded_fact`) are useful for scaling evaluation, but raw judge scores can be biased relative to oracle labels.

Today, there is no built-in "post-hoc calibration" workflow in Inspect that answers:

1. What is the calibrated estimate (with uncertainty) after collecting a small oracle-labeled subset?
2. How can users run this calibration directly from `EvalLog` outputs?
3. How can users plan oracle labeling budget rather than choosing sample size ad hoc?

### Proposal

Add a docs-first integration path for [CJE (Causal Judge Evaluation)](https://arxiv.org/abs/2512.11150) as an optional post-processing step.

`cje-eval` is already usable for this workflow (verified against current PyPI API):
- `from cje import analyze_dataset`
- `analyze_dataset(..., fresh_draws_data=...)`
- Returns `EstimationResult` with `estimates`, `standard_errors`, `method`, and `ci()`

### Minimal example from Inspect logs

```python
from inspect_ai.log import read_eval_log
from cje import analyze_dataset

log = read_eval_log("logs/my_eval.eval")

# Inspect model-graded scores are typically C/P/I/N
grade_to_score = {"C": 1.0, "P": 0.5, "I": 0.0, "N": 0.0}

scorer_name = "model_graded_qa"  # or "model_graded_fact"
policy_name = "model_under_test"

# human_labels: dict[prompt_id, float] with labels in [0,1]
records = []
for i, sample in enumerate(log.samples):
    score_obj = sample.scores.get(scorer_name)
    if score_obj is None:
        continue

    prompt_id = str(sample.id if sample.id is not None else i)
    record = {
        "prompt_id": prompt_id,
        "judge_score": grade_to_score.get(str(score_obj.value), 0.0),
    }

    if prompt_id in human_labels:
        record["oracle_label"] = float(human_labels[prompt_id])

    records.append(record)

results = analyze_dataset(fresh_draws_data={policy_name: records})

print("Calibrated estimate:", float(results.estimates[0]))
print("SE:", float(results.standard_errors[0]))
print("95% CI:", results.ci()[0])
print("Method:", results.method)
```

### Sample-size planning (CJE functionality)

CJE also exposes planning APIs that can answer "how many oracle labels are enough":

- `fit_variance_model(...)` on pilot fresh-draw data
- `plan_evaluation(...)` for optimal `n` (judge-scored) and `m` (oracle-labeled) allocation under a fixed budget
- `plan_for_mde(...)` for minimum budget/allocation to hit a target MDE
- `simulate_planning(...)` for quick what-if planning

This could be mentioned as an advanced follow-on in docs (after the basic calibration recipe).

### Scope (suggested)

Start with documentation only:

1. Add a "Judge score calibration" section to scorer/log-analysis docs.
2. Show the log-to-CJE bridge pattern above.
3. Note oracle-label requirements (numeric labels; package requires at least 10 labels for CV and recommends more).
4. Add a short pointer to CJE planning utilities for sample-size/budget planning.

No core dependency or runtime coupling required in phase 1.

If maintainers find this useful, phase 2 could be a lightweight helper in `inspect_ai.analysis` that exports score records in CJE-ready format.

### Why this helps

This gives users a practical path from model-graded scores to calibrated estimates with confidence intervals, plus an explicit way to plan oracle labeling effort.

### Disclosure

I am a co-author of CJE.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Calibrated accuracy estimates for model-graded scorers #3236

Problem

Proposal

Minimal example from Inspect logs

Sample-size planning (CJE functionality)

Scope (suggested)

Why this helps

Disclosure

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: Calibrated accuracy estimates for model-graded scorers #3236

Description

Problem

Proposal

Minimal example from Inspect logs

Sample-size planning (CJE functionality)

Scope (suggested)

Why this helps

Disclosure

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions