Skip to content

Tests for Inspect adapter edge cases (empty choices, local paths, misleading dataset names)#105

Open
MattFisher wants to merge 1 commit intoevaleval:mainfrom
MattFisher:inspect-edge-case-tests
Open

Tests for Inspect adapter edge cases (empty choices, local paths, misleading dataset names)#105
MattFisher wants to merge 1 commit intoevaleval:mainfrom
MattFisher:inspect-edge-case-tests

Conversation

@MattFisher
Copy link
Copy Markdown

Summary

Three issues discovered while converting ~100 real-world Inspect evaluation logs (from CVE-Bench and the inspect_evals dashboard):

1. Empty output.choices → IndexError

Some agentic samples (e.g. sandbox failures) have output.choices == []. The instance-level adapter unconditionally indexes choices[0], raising IndexError (wrapped in TransformationError).

Possible fix: Guard the access:

if sample.output.choices:
    message = sample.output.choices[0].message
    content = message.content
else:
    content = ''

Also needs a similar guard for stop_reason in the metadata dict.

2. Local filesystem paths in hf_repo

_extract_source_data always returns SourceDataHf with hf_repo=dataset.location. When the dataset comes from a local file (not HuggingFace), this produces paths like /root/.cache/inspect_evals/... or /opt/venv/lib/.../challenges.json. This is particularly an issue for cyber evals that don't use an external dataset.

3. Misleading dataset_name

dataset_name is derived from dataset.name which for many inspect_evals is the internal filename rather than the benchmark name:

  • cyse2_vulnerability_exploit'challenges'
  • gdm_intercode_ctf'ic_ctf'

Tests

5 tests in tests/test_inspect_edge_cases.py, all marked xfail(strict=True) to document current behavior. Fixtures are stripped-down real logs (~10–27KB each, 1 sample, events removed).

Fixtures

  • data_cvebench_empty_choices.json — CVE-Bench sample with empty choices (sandbox failed to start)
  • data_intercode_ctf_local_path.json — gdm_intercode_ctf with local cache path
  • data_cyse2_vuln_exploit_challenges.json — cyberseceval_2 with misleading dataset name

Add tests and fixtures for three issues discovered converting real-world
Inspect logs:

1. Empty output.choices — agentic samples (e.g. sandbox failures) can
   have output.choices == [], causing IndexError in
   instance_level_adapter.py at `sample.output.choices[0].message`.

2. Local filesystem paths in hf_repo — when datasets come from local
   files (not HuggingFace), dataset.location is a filesystem path like
   `/root/.cache/...` that gets passed through as hf_repo verbatim.

3. Misleading dataset_name — some inspect_evals use internal filenames
   as dataset names (e.g. 'challenges' for cyberseceval_2, 'ic_ctf' for
   gdm_intercode_ctf) rather than the benchmark name.

All tests are marked xfail(strict=True) to document the current behavior.
Fixtures are stripped-down real logs (~10-27KB each, 1 sample, events removed).
@MattFisher MattFisher marked this pull request as ready for review April 14, 2026 04:48
@damian1996
Copy link
Copy Markdown
Collaborator

Hi @MattFisher ! Thank you for your contribution :) Would you like to include fixes for these issues in the Inspect adapter? We could add these changes to this PR for simplicity.

  • The fix for Issue 1 looks good to me. Regarding stop_reason, we currently use:
    str(sample.output.stop_reason) if sample.output.stop_reason else ''
    Do you think this is insufficiently guarded? Given that the documentation indicates stop_reason may depend on the choices list, we could change it to:
    str(sample.output.stop_reason) if sample.output.choices and sample.output.stop_reason else ''
    Alternatively, we could modify stop_reason to include stop reasons for all messages in the metadata, rather than just the first one. What are your thoughts?

  • For the second issue, good catch! What do you think about using a heuristic to detect local paths and then crafting a SourceDataPrivate object? Does that seem like a robust enough solution?

  • Regarding the last issue: if dataset.name is consistently misleading, do you see any other valid options for what we should use as the dataset_name?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an Inspect adapter edge-case test suite (currently xfail) and corresponding real-world log fixtures to document known conversion issues around empty outputs, local dataset paths, and ambiguous dataset names.

Changes:

  • Add tests/test_inspect_edge_cases.py with 5 xfail(strict=True) tests covering three Inspect conversion edge cases.
  • Add three stripped-down Inspect JSON log fixtures under tests/data/inspect/ to reproduce the scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
tests/test_inspect_edge_cases.py New edge-case tests + shared loader helper for transforming logs and reading instance-level outputs.
tests/data/inspect/data_cvebench_empty_choices.json Fixture with output.choices: [] and local dataset path.
tests/data/inspect/data_intercode_ctf_local_path.json Fixture with local cached dataset path and ambiguous dataset name.
tests/data/inspect/data_cyse2_vuln_exploit_challenges.json Fixture with local path + generic dataset.name (“challenges”).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

def _load_eval_and_instances(filepath, metadata_args=None):
"""Load a log through the adapter and return (EvaluationLog, [InstanceLevelEvaluationLog])."""
adapter = InspectAIAdapter()
args = dict(metadata_args or METADATA_ARGS)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata_args or METADATA_ARGS will ignore an explicitly passed empty dict (it falls back to the defaults) and also doesn't merge overrides with the defaults. Consider using an explicit if metadata_args is None check and merging via {**METADATA_ARGS, **metadata_args} so callers can override only the fields they need without surprising behavior.

Suggested change
args = dict(metadata_args or METADATA_ARGS)
if metadata_args is None:
args = dict(METADATA_ARGS)
else:
args = {**METADATA_ARGS, **metadata_args}

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +117
@pytest.mark.xfail(
reason='adapter._extract_source_data always returns SourceDataHf '
'with hf_repo=dataset.location, even for local paths',
strict=True,
)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the xfail tests that are meant to capture incorrect source_data values (not conversion crashes), xfail(strict=True) without a raises= clause will also treat unexpected exceptions (e.g. a parsing error) as an expected failure, potentially masking regressions. Consider adding raises=AssertionError (or otherwise asserting the conversion succeeded) so they only xfail on the intended assertion failures.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +22
"name": "ic_ctf",
"location": "/root/.cache/inspect_evals/gdm_intercode_ctf/data/ic_ctf.json",
"samples": 1,
"sample_ids": [
0,
2,
3,
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset block says "samples": 1 but sample_ids includes many IDs. Since these fixtures are described as stripped-down logs, this inconsistency can confuse readers and makes it harder to reason about adapter output. Consider either trimming sample_ids to match the single retained sample or restoring samples to the original count so the metadata remains internally consistent.

Copilot uses AI. Check for mistakes.
"name": "challenges",
"location": "/opt/venv/lib/python3.12/site-packages/inspect_evals/cyberseceval_2/vulnerability_exploit/generated/challenges.json",
"samples": 1,
"sample_ids": [
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset block says "samples": 1 but sample_ids spans hundreds of IDs. If the fixture is intentionally reduced to a single sample, consider trimming sample_ids (and/or setting it to just that sample's ID) or adjusting samples to match the remaining list so the metadata stays consistent and the fixture stays lightweight.

Suggested change
"sample_ids": [
"sample_ids": [
1
],
"legacy_sample_ids": [

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +36
"dataset": {
"location": "/some_path/cve-bench/src/critical/challenges",
"samples": 1,
"sample_ids": [
"CVE-2024-34340-zero_day",
"CVE-2024-34340-one_day",
"CVE-2024-30542-zero_day",
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset block indicates "samples": 1 but sample_ids lists many challenge IDs (and results.total_samples is much larger). If this fixture is meant to represent a single-sample trimmed log, consider trimming sample_ids accordingly or keeping samples consistent with the remaining metadata to avoid misleading adapter outputs (e.g., samples_number).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants