Tests for Inspect adapter edge cases (empty choices, local paths, misleading dataset names)#105
Tests for Inspect adapter edge cases (empty choices, local paths, misleading dataset names)#105MattFisher wants to merge 1 commit intoevaleval:mainfrom
Conversation
Add tests and fixtures for three issues discovered converting real-world Inspect logs: 1. Empty output.choices — agentic samples (e.g. sandbox failures) can have output.choices == [], causing IndexError in instance_level_adapter.py at `sample.output.choices[0].message`. 2. Local filesystem paths in hf_repo — when datasets come from local files (not HuggingFace), dataset.location is a filesystem path like `/root/.cache/...` that gets passed through as hf_repo verbatim. 3. Misleading dataset_name — some inspect_evals use internal filenames as dataset names (e.g. 'challenges' for cyberseceval_2, 'ic_ctf' for gdm_intercode_ctf) rather than the benchmark name. All tests are marked xfail(strict=True) to document the current behavior. Fixtures are stripped-down real logs (~10-27KB each, 1 sample, events removed).
|
Hi @MattFisher ! Thank you for your contribution :) Would you like to include fixes for these issues in the Inspect adapter? We could add these changes to this PR for simplicity.
|
There was a problem hiding this comment.
Pull request overview
Adds an Inspect adapter edge-case test suite (currently xfail) and corresponding real-world log fixtures to document known conversion issues around empty outputs, local dataset paths, and ambiguous dataset names.
Changes:
- Add
tests/test_inspect_edge_cases.pywith 5xfail(strict=True)tests covering three Inspect conversion edge cases. - Add three stripped-down Inspect JSON log fixtures under
tests/data/inspect/to reproduce the scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
tests/test_inspect_edge_cases.py |
New edge-case tests + shared loader helper for transforming logs and reading instance-level outputs. |
tests/data/inspect/data_cvebench_empty_choices.json |
Fixture with output.choices: [] and local dataset path. |
tests/data/inspect/data_intercode_ctf_local_path.json |
Fixture with local cached dataset path and ambiguous dataset name. |
tests/data/inspect/data_cyse2_vuln_exploit_challenges.json |
Fixture with local path + generic dataset.name (“challenges”). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _load_eval_and_instances(filepath, metadata_args=None): | ||
| """Load a log through the adapter and return (EvaluationLog, [InstanceLevelEvaluationLog]).""" | ||
| adapter = InspectAIAdapter() | ||
| args = dict(metadata_args or METADATA_ARGS) |
There was a problem hiding this comment.
metadata_args or METADATA_ARGS will ignore an explicitly passed empty dict (it falls back to the defaults) and also doesn't merge overrides with the defaults. Consider using an explicit if metadata_args is None check and merging via {**METADATA_ARGS, **metadata_args} so callers can override only the fields they need without surprising behavior.
| args = dict(metadata_args or METADATA_ARGS) | |
| if metadata_args is None: | |
| args = dict(METADATA_ARGS) | |
| else: | |
| args = {**METADATA_ARGS, **metadata_args} |
| @pytest.mark.xfail( | ||
| reason='adapter._extract_source_data always returns SourceDataHf ' | ||
| 'with hf_repo=dataset.location, even for local paths', | ||
| strict=True, | ||
| ) |
There was a problem hiding this comment.
For the xfail tests that are meant to capture incorrect source_data values (not conversion crashes), xfail(strict=True) without a raises= clause will also treat unexpected exceptions (e.g. a parsing error) as an expected failure, potentially masking regressions. Consider adding raises=AssertionError (or otherwise asserting the conversion succeeded) so they only xfail on the intended assertion failures.
| "name": "ic_ctf", | ||
| "location": "/root/.cache/inspect_evals/gdm_intercode_ctf/data/ic_ctf.json", | ||
| "samples": 1, | ||
| "sample_ids": [ | ||
| 0, | ||
| 2, | ||
| 3, |
There was a problem hiding this comment.
The dataset block says "samples": 1 but sample_ids includes many IDs. Since these fixtures are described as stripped-down logs, this inconsistency can confuse readers and makes it harder to reason about adapter output. Consider either trimming sample_ids to match the single retained sample or restoring samples to the original count so the metadata remains internally consistent.
| "name": "challenges", | ||
| "location": "/opt/venv/lib/python3.12/site-packages/inspect_evals/cyberseceval_2/vulnerability_exploit/generated/challenges.json", | ||
| "samples": 1, | ||
| "sample_ids": [ |
There was a problem hiding this comment.
The dataset block says "samples": 1 but sample_ids spans hundreds of IDs. If the fixture is intentionally reduced to a single sample, consider trimming sample_ids (and/or setting it to just that sample's ID) or adjusting samples to match the remaining list so the metadata stays consistent and the fixture stays lightweight.
| "sample_ids": [ | |
| "sample_ids": [ | |
| 1 | |
| ], | |
| "legacy_sample_ids": [ |
| "dataset": { | ||
| "location": "/some_path/cve-bench/src/critical/challenges", | ||
| "samples": 1, | ||
| "sample_ids": [ | ||
| "CVE-2024-34340-zero_day", | ||
| "CVE-2024-34340-one_day", | ||
| "CVE-2024-30542-zero_day", |
There was a problem hiding this comment.
The dataset block indicates "samples": 1 but sample_ids lists many challenge IDs (and results.total_samples is much larger). If this fixture is meant to represent a single-sample trimmed log, consider trimming sample_ids accordingly or keeping samples consistent with the remaining metadata to avoid misleading adapter outputs (e.g., samples_number).
Summary
Three issues discovered while converting ~100 real-world Inspect evaluation logs (from CVE-Bench and the inspect_evals dashboard):
1. Empty
output.choices→ IndexErrorSome agentic samples (e.g. sandbox failures) have
output.choices == []. The instance-level adapter unconditionally indexeschoices[0], raisingIndexError(wrapped inTransformationError).Possible fix: Guard the access:
Also needs a similar guard for
stop_reasonin the metadata dict.2. Local filesystem paths in
hf_repo_extract_source_dataalways returnsSourceDataHfwithhf_repo=dataset.location. When the dataset comes from a local file (not HuggingFace), this produces paths like/root/.cache/inspect_evals/...or/opt/venv/lib/.../challenges.json. This is particularly an issue for cyber evals that don't use an external dataset.3. Misleading
dataset_namedataset_nameis derived fromdataset.namewhich for many inspect_evals is the internal filename rather than the benchmark name:cyse2_vulnerability_exploit→'challenges'gdm_intercode_ctf→'ic_ctf'Tests
5 tests in
tests/test_inspect_edge_cases.py, all markedxfail(strict=True)to document current behavior. Fixtures are stripped-down real logs (~10–27KB each, 1 sample, events removed).Fixtures
data_cvebench_empty_choices.json— CVE-Bench sample with empty choices (sandbox failed to start)data_intercode_ctf_local_path.json— gdm_intercode_ctf with local cache pathdata_cyse2_vuln_exploit_challenges.json— cyberseceval_2 with misleading dataset name