Tests for Inspect adapter edge cases (empty choices, local paths, misleading dataset names) by MattFisher · Pull Request #105 · evaleval/every_eval_ever

MattFisher · 2026-04-14T04:45:29Z

Summary

Three issues discovered while converting ~100 real-world Inspect evaluation logs (from CVE-Bench and the inspect_evals dashboard):

1. Empty `output.choices` → IndexError

Some agentic samples (e.g. sandbox failures) have output.choices == []. The instance-level adapter unconditionally indexes choices[0], raising IndexError (wrapped in TransformationError).

Possible fix: Guard the access:

if sample.output.choices:
    message = sample.output.choices[0].message
    content = message.content
else:
    content = ''

Also needs a similar guard for stop_reason in the metadata dict.

2. Local filesystem paths in `hf_repo`

_extract_source_data always returns SourceDataHf with hf_repo=dataset.location. When the dataset comes from a local file (not HuggingFace), this produces paths like /root/.cache/inspect_evals/... or /opt/venv/lib/.../challenges.json. This is particularly an issue for cyber evals that don't use an external dataset.

3. Misleading `dataset_name`

dataset_name is derived from dataset.name which for many inspect_evals is the internal filename rather than the benchmark name:

cyse2_vulnerability_exploit → 'challenges'
gdm_intercode_ctf → 'ic_ctf'

Tests

5 tests in tests/test_inspect_edge_cases.py, all marked xfail(strict=True) to document current behavior. Fixtures are stripped-down real logs (~10–27KB each, 1 sample, events removed).

Fixtures

data_cvebench_empty_choices.json — CVE-Bench sample with empty choices (sandbox failed to start)
data_intercode_ctf_local_path.json — gdm_intercode_ctf with local cache path
data_cyse2_vuln_exploit_challenges.json — cyberseceval_2 with misleading dataset name

Add tests and fixtures for three issues discovered converting real-world Inspect logs: 1. Empty output.choices — agentic samples (e.g. sandbox failures) can have output.choices == [], causing IndexError in instance_level_adapter.py at `sample.output.choices[0].message`. 2. Local filesystem paths in hf_repo — when datasets come from local files (not HuggingFace), dataset.location is a filesystem path like `/root/.cache/...` that gets passed through as hf_repo verbatim. 3. Misleading dataset_name — some inspect_evals use internal filenames as dataset names (e.g. 'challenges' for cyberseceval_2, 'ic_ctf' for gdm_intercode_ctf) rather than the benchmark name. All tests are marked xfail(strict=True) to document the current behavior. Fixtures are stripped-down real logs (~10-27KB each, 1 sample, events removed).

damian1996 · 2026-04-14T17:41:27Z

Hi @MattFisher ! Thank you for your contribution :) Would you like to include fixes for these issues in the Inspect adapter? We could add these changes to this PR for simplicity.

The fix for Issue 1 looks good to me. Regarding stop_reason, we currently use:
str(sample.output.stop_reason) if sample.output.stop_reason else ''
Do you think this is insufficiently guarded? Given that the documentation indicates stop_reason may depend on the choices list, we could change it to:
str(sample.output.stop_reason) if sample.output.choices and sample.output.stop_reason else ''
Alternatively, we could modify stop_reason to include stop reasons for all messages in the metadata, rather than just the first one. What are your thoughts?
For the second issue, good catch! What do you think about using a heuristic to detect local paths and then crafting a SourceDataPrivate object? Does that seem like a robust enough solution?
Regarding the last issue: if dataset.name is consistently misleading, do you see any other valid options for what we should use as the dataset_name?

Copilot

Pull request overview

Adds an Inspect adapter edge-case test suite (currently xfail) and corresponding real-world log fixtures to document known conversion issues around empty outputs, local dataset paths, and ambiguous dataset names.

Changes:

Add tests/test_inspect_edge_cases.py with 5 xfail(strict=True) tests covering three Inspect conversion edge cases.
Add three stripped-down Inspect JSON log fixtures under tests/data/inspect/ to reproduce the scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`tests/test_inspect_edge_cases.py`	New edge-case tests + shared loader helper for transforming logs and reading instance-level outputs.
`tests/data/inspect/data_cvebench_empty_choices.json`	Fixture with `output.choices: []` and local dataset path.
`tests/data/inspect/data_intercode_ctf_local_path.json`	Fixture with local cached dataset path and ambiguous dataset name.
`tests/data/inspect/data_cyse2_vuln_exploit_challenges.json`	Fixture with local path + generic `dataset.name` (“challenges”).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T19:24:19Z

+def _load_eval_and_instances(filepath, metadata_args=None):
+    """Load a log through the adapter and return (EvaluationLog, [InstanceLevelEvaluationLog])."""
+    adapter = InspectAIAdapter()
+    args = dict(metadata_args or METADATA_ARGS)


metadata_args or METADATA_ARGS will ignore an explicitly passed empty dict (it falls back to the defaults) and also doesn't merge overrides with the defaults. Consider using an explicit if metadata_args is None check and merging via {**METADATA_ARGS, **metadata_args} so callers can override only the fields they need without surprising behavior.

Suggested change

args = dict(metadata_args or METADATA_ARGS)

if metadata_args is None:

args = dict(METADATA_ARGS)

else:

args = {**METADATA_ARGS, **metadata_args}

Copilot · 2026-04-17T19:24:19Z

+    @pytest.mark.xfail(
+        reason='adapter._extract_source_data always returns SourceDataHf '
+        'with hf_repo=dataset.location, even for local paths',
+        strict=True,
+    )


For the xfail tests that are meant to capture incorrect source_data values (not conversion crashes), xfail(strict=True) without a raises= clause will also treat unexpected exceptions (e.g. a parsing error) as an expected failure, potentially masking regressions. Consider adding raises=AssertionError (or otherwise asserting the conversion succeeded) so they only xfail on the intended assertion failures.

Copilot · 2026-04-17T19:24:19Z

+      "name": "ic_ctf",
+      "location": "/root/.cache/inspect_evals/gdm_intercode_ctf/data/ic_ctf.json",
+      "samples": 1,
+      "sample_ids": [
+        0,
+        2,
+        3,


The dataset block says "samples": 1 but sample_ids includes many IDs. Since these fixtures are described as stripped-down logs, this inconsistency can confuse readers and makes it harder to reason about adapter output. Consider either trimming sample_ids to match the single retained sample or restoring samples to the original count so the metadata remains internally consistent.

Copilot · 2026-04-17T19:24:20Z

+      "name": "challenges",
+      "location": "/opt/venv/lib/python3.12/site-packages/inspect_evals/cyberseceval_2/vulnerability_exploit/generated/challenges.json",
+      "samples": 1,
+      "sample_ids": [


The dataset block says "samples": 1 but sample_ids spans hundreds of IDs. If the fixture is intentionally reduced to a single sample, consider trimming sample_ids (and/or setting it to just that sample's ID) or adjusting samples to match the remaining list so the metadata stays consistent and the fixture stays lightweight.

Suggested change

"sample_ids": [

"sample_ids": [

1

],

"legacy_sample_ids": [

Copilot · 2026-04-17T19:24:20Z

+    "dataset": {
+      "location": "/some_path/cve-bench/src/critical/challenges",
+      "samples": 1,
+      "sample_ids": [
+        "CVE-2024-34340-zero_day",
+        "CVE-2024-34340-one_day",
+        "CVE-2024-30542-zero_day",


The dataset block indicates "samples": 1 but sample_ids lists many challenge IDs (and results.total_samples is much larger). If this fixture is meant to represent a single-sample trimmed log, consider trimming sample_ids accordingly or keeping samples consistent with the remaining metadata to avoid misleading adapter outputs (e.g., samples_number).

MattFisher marked this pull request as ready for review April 14, 2026 04:48

nelaturuharsha mentioned this pull request Apr 17, 2026

[Feature] Expand test-suite and add automatic trigger for running it on PRs #88

Open

tommasocerruti requested a review from Copilot April 17, 2026 19:19

Copilot started reviewing on behalf of tommasocerruti April 17, 2026 19:19 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests for Inspect adapter edge cases (empty choices, local paths, misleading dataset names)#105

Tests for Inspect adapter edge cases (empty choices, local paths, misleading dataset names)#105
MattFisher wants to merge 1 commit intoevaleval:mainfrom
MattFisher:inspect-edge-case-tests

MattFisher commented Apr 14, 2026

Uh oh!

damian1996 commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    args = dict(metadata_args or METADATA_ARGS)
+    if metadata_args is None:
+        args = dict(METADATA_ARGS)
+    else:
+        args = {**METADATA_ARGS, **metadata_args}

Conversation

MattFisher commented Apr 14, 2026

Summary

1. Empty output.choices → IndexError

2. Local filesystem paths in hf_repo

3. Misleading dataset_name

Tests

Fixtures

Uh oh!

damian1996 commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Empty `output.choices` → IndexError

2. Local filesystem paths in `hf_repo`

3. Misleading `dataset_name`