Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2,961 changes: 2,961 additions & 0 deletions tests/data/inspect/inspect_shortened/DS-1000.json

Large diffs are not rendered by default.

2,154 changes: 2,154 additions & 0 deletions tests/data/inspect/inspect_shortened/browse-comp.json

Large diffs are not rendered by default.

2,052 changes: 2,052 additions & 0 deletions tests/data/inspect/inspect_shortened/chembench.json

Large diffs are not rendered by default.

3,573 changes: 3,573 additions & 0 deletions tests/data/inspect/inspect_shortened/class-eval.json

Large diffs are not rendered by default.

1,951 changes: 1,951 additions & 0 deletions tests/data/inspect/inspect_shortened/commonsense-qa.json

Large diffs are not rendered by default.

3,532 changes: 3,532 additions & 0 deletions tests/data/inspect/inspect_shortened/compute-eval.json

Large diffs are not rendered by default.

2,256 changes: 2,256 additions & 0 deletions tests/data/inspect/inspect_shortened/cybermetric-10000.json

Large diffs are not rendered by default.

2,433 changes: 2,433 additions & 0 deletions tests/data/inspect/inspect_shortened/drop.json

Large diffs are not rendered by default.

59,070 changes: 59,070 additions & 0 deletions tests/data/inspect/inspect_shortened/gaia.json

Large diffs are not rendered by default.

1,898 changes: 1,898 additions & 0 deletions tests/data/inspect/inspect_shortened/gpqa-diamond.json

Large diffs are not rendered by default.

2,397 changes: 2,397 additions & 0 deletions tests/data/inspect/inspect_shortened/gsm8k.json

Large diffs are not rendered by default.

2,292 changes: 2,292 additions & 0 deletions tests/data/inspect/inspect_shortened/hellaswag.json

Large diffs are not rendered by default.

1,963 changes: 1,963 additions & 0 deletions tests/data/inspect/inspect_shortened/humaneval.json

Large diffs are not rendered by default.

2,116 changes: 2,116 additions & 0 deletions tests/data/inspect/inspect_shortened/ifeval.json

Large diffs are not rendered by default.

3,133 changes: 3,133 additions & 0 deletions tests/data/inspect/inspect_shortened/ifevalcode.json

Large diffs are not rendered by default.

1,951 changes: 1,951 additions & 0 deletions tests/data/inspect/inspect_shortened/lab-bench-cloning-scenarios.json

Large diffs are not rendered by default.

1,934 changes: 1,934 additions & 0 deletions tests/data/inspect/inspect_shortened/lab-bench-dbqa.json

Large diffs are not rendered by default.

1,933 changes: 1,933 additions & 0 deletions tests/data/inspect/inspect_shortened/lab-bench-litqa.json

Large diffs are not rendered by default.

1,947 changes: 1,947 additions & 0 deletions tests/data/inspect/inspect_shortened/lab-bench-protocolqa.json

Large diffs are not rendered by default.

1,943 changes: 1,943 additions & 0 deletions tests/data/inspect/inspect_shortened/lab-bench-seqqa.json

Large diffs are not rendered by default.

1,938 changes: 1,938 additions & 0 deletions tests/data/inspect/inspect_shortened/lab-bench-suppqa.json

Large diffs are not rendered by default.

1,943 changes: 1,943 additions & 0 deletions tests/data/inspect/inspect_shortened/lingoly-too.json

Large diffs are not rendered by default.

2,075 changes: 2,075 additions & 0 deletions tests/data/inspect/inspect_shortened/lingoly.json

Large diffs are not rendered by default.

3,230 changes: 3,230 additions & 0 deletions tests/data/inspect/inspect_shortened/livecodebench-pro.json

Large diffs are not rendered by default.

2,676 changes: 2,676 additions & 0 deletions tests/data/inspect/inspect_shortened/math.json

Large diffs are not rendered by default.

2,500 changes: 2,500 additions & 0 deletions tests/data/inspect/inspect_shortened/mbpp.json

Large diffs are not rendered by default.

1,911 changes: 1,911 additions & 0 deletions tests/data/inspect/inspect_shortened/medqa.json

Large diffs are not rendered by default.

2,456 changes: 2,456 additions & 0 deletions tests/data/inspect/inspect_shortened/mind2web-sc.json

Large diffs are not rendered by default.

3,162 changes: 3,162 additions & 0 deletions tests/data/inspect/inspect_shortened/mind2web.json

Large diffs are not rendered by default.

1,946 changes: 1,946 additions & 0 deletions tests/data/inspect/inspect_shortened/mmlu-0-shot.json

Large diffs are not rendered by default.

2,020 changes: 2,020 additions & 0 deletions tests/data/inspect/inspect_shortened/mmlu-pro.json

Large diffs are not rendered by default.

2,247 changes: 2,247 additions & 0 deletions tests/data/inspect/inspect_shortened/musr.json

Large diffs are not rendered by default.

123,920 changes: 123,920 additions & 0 deletions tests/data/inspect/inspect_shortened/niah.json

Large diffs are not rendered by default.

2,349 changes: 2,349 additions & 0 deletions tests/data/inspect/inspect_shortened/onet-m6.json

Large diffs are not rendered by default.

1,987 changes: 1,987 additions & 0 deletions tests/data/inspect/inspect_shortened/paws.json

Large diffs are not rendered by default.

2,531 changes: 2,531 additions & 0 deletions tests/data/inspect/inspect_shortened/personality-BFI.json

Large diffs are not rendered by default.

2,497 changes: 2,497 additions & 0 deletions tests/data/inspect/inspect_shortened/personality-TRAIT.json

Large diffs are not rendered by default.

1,876 changes: 1,876 additions & 0 deletions tests/data/inspect/inspect_shortened/piqa.json

Large diffs are not rendered by default.

1,910 changes: 1,910 additions & 0 deletions tests/data/inspect/inspect_shortened/pre-flight.json

Large diffs are not rendered by default.

1,930 changes: 1,930 additions & 0 deletions tests/data/inspect/inspect_shortened/pubmedqa.json

Large diffs are not rendered by default.

1,963 changes: 1,963 additions & 0 deletions tests/data/inspect/inspect_shortened/race-h.json

Large diffs are not rendered by default.

2,408 changes: 2,408 additions & 0 deletions tests/data/inspect/inspect_shortened/sad-facts-human-defaults.json

Large diffs are not rendered by default.

2,408 changes: 2,408 additions & 0 deletions tests/data/inspect/inspect_shortened/sad-facts-llms.json

Large diffs are not rendered by default.

2,414 changes: 2,414 additions & 0 deletions tests/data/inspect/inspect_shortened/sad-influence.json

Large diffs are not rendered by default.

2,612 changes: 2,612 additions & 0 deletions tests/data/inspect/inspect_shortened/sad-stages-full.json

Large diffs are not rendered by default.

2,603 changes: 2,603 additions & 0 deletions tests/data/inspect/inspect_shortened/sad-stages-oversight.json

Large diffs are not rendered by default.

12,099 changes: 12,099 additions & 0 deletions tests/data/inspect/inspect_shortened/scicode.json

Large diffs are not rendered by default.

1,895 changes: 1,895 additions & 0 deletions tests/data/inspect/inspect_shortened/sec-qa-v1.json

Large diffs are not rendered by default.

1,897 changes: 1,897 additions & 0 deletions tests/data/inspect/inspect_shortened/sec-qa-v2.json

Large diffs are not rendered by default.

2,167 changes: 2,167 additions & 0 deletions tests/data/inspect/inspect_shortened/sevenllm-mcq-en.json

Large diffs are not rendered by default.

78 changes: 78 additions & 0 deletions tests/test_inspect_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,3 +298,81 @@ def test_convert_model_path_to_standarized_model_ids():
for model_path, model_id in model_path_to_standarized_id_map.items():
model_info = extract_model_info_from_model_path(model_path)
assert model_info.id == model_id

_INSPECT_SHORTENED_EXPECTATIONS = {
"DS-1000.json": ("DS-1000", "accuracy", 0.0),
"browse-comp.json": ("a8e48c63e8a0202fcbde685141796329", "inspect_evals/browse_comp_accuracy", 0.005529225908372828),
"chembench.json": ("ChemBench", "analytical_chemistry", 0.2565789473684211),
"class-eval.json": ("ClassEval", "mean", 0.82),
"commonsense-qa.json": ("commonsense_qa", "accuracy", 0.8),
"compute-eval.json": ("compute-eval", "accuracy", 0.6),
"cybermetric-10000.json": ("CyberMetric-10000", "accuracy", 1.0),
"drop.json": ("drop", "mean", 0.7846345044572627),
"gaia.json": ("GAIA", "accuracy", 0.0),
"gpqa-diamond.json": ("gpqa_diamond_74187e36ccadd6a06b1d98d13e064fed", "accuracy", 0.5),
"gsm8k.json": ("gsm8k", "accuracy", 1.0),
"hellaswag.json": ("hellaswag", "accuracy", 1.0),
"humaneval.json": ("openai_humaneval", "accuracy", 0.8),
"ifeval.json": ("IFEval", "prompt_strict_acc", 0.7208872458410351),
"ifevalcode.json": ("IfEvalCode-testset", "inspect_evals/overall_accuracy", 0.043209876543209874),
"lab-bench-cloning-scenarios.json": ("lab-bench", "accuracy", 0.15151515151515152),
"lab-bench-dbqa.json": ("lab-bench", "accuracy", 0.026923076923076925),
"lab-bench-litqa.json": ("lab-bench", "accuracy", 0.1457286432160804),
"lab-bench-protocolqa.json": ("lab-bench", "accuracy", 0.28703703703703703),
"lab-bench-seqqa.json": ("lab-bench", "accuracy", 0.3333333333333333),
"lab-bench-suppqa.json": ("lab-bench", "accuracy", 0.036585365853658534),
"lingoly-too.json": ("LingOly-TOO", "inspect_evals/obfuscated_mean", 0.06243145821552145),
"lingoly.json": ("lingoly", "inspect_evals/no_context_delta", 0.0794351279788173),
"livecodebench-pro.json": ("livecodebench_pro", "accuracy", 0.0),
"math.json": ("MATH-lighteval", "accuracy", 0.2),
"mbpp.json": ("mbpp", "accuracy", 1.0),
"medqa.json": ("med_qa", "accuracy", 0.4),
"mind2web-sc.json": ("mind2web_sc", "accuracy", 0.6),
"mind2web.json": ("Multimodal-Mind2Web", "inspect_evals/element_accuracy", 0.3896551724137931),
"mmlu-0-shot.json": ("mmlu", "accuracy", 0.8),
"mmlu-pro.json": ("MMLU-Pro", "accuracy", 0.6),
"musr.json": ("MuSR", "accuracy", 0.4),
"niah.json": ("niah", "target_context_length_10000_accuracy", 5.0),
"onet-m6.json": ("thai-onet-m6-exam", "accuracy", 0.4),
"paws.json": ("paws", "accuracy", 0.8),
"personality-BFI.json": ("bfi_dc398081fd7520dab7af7028dd830d84", "Extraversion", 0.725),
"personality-TRAIT.json": ("personality_TRAIT", "Openness", 0.5955955955955956),
"piqa.json": ("piqa", "accuracy", 0.8),
"pre-flight.json": ("pre-flight-06", "accuracy", 0.8),
"pubmedqa.json": ("PubMedQA", "accuracy", 0.2),
"race-h.json": ("race", "accuracy", 0.8),
"sad-facts-human-defaults.json": ("sad_facts_human_defaults", "accuracy", 0.4),
"sad-facts-llms.json": ("sad_facts_llms", "accuracy", 0.8),
"sad-influence.json": ("sad_influence", "accuracy", 0.6),
"sad-stages-full.json": ("sad_stages_full", "accuracy", 0.2),
"sad-stages-oversight.json": ("sad_stages_oversight", "accuracy", 0.3),
"scicode.json": ("problems_excl_dev", "inspect_evals/percentage_main_problems_solved", 0.0),
"sec-qa-v1.json": ("secqa", "accuracy", 1.0),
"sec-qa-v2.json": ("secqa", "accuracy", 1.0),
"sevenllm-mcq-en.json": ("sevenllm_c2388953e215061b1324c268e3c108a1", "accuracy", 0.0),
}


def test_many():
adapter = InspectAIAdapter()
metadata_args = {
'source_organization_name': 'TestOrg',
'evaluator_relationship': EvaluatorRelationship.first_party,
}

fixture_dir = Path(__file__).parent / "data/inspect/inspect_shortened"
for inspect_eval_path in sorted(fixture_dir.glob("*.json")):
converted_eval = _load_eval(adapter, inspect_eval_path.resolve(), metadata_args)
assert converted_eval.detailed_evaluation_results is not None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to add multiple tests, thanks! Could you also add more assertions to each test in a some smart way? Maybe serialize the expected responses for a few important fields, using the same name as the provided eval file.

We do not need to check every field. At a minimum, please consider:

  • ModelInfo.id
  • ModelInfo.developer
  • SourceDataHf.dataset_name
  • fields from each EvaluationResult, especially MetricConfig.evaluation_description (currently the same as metric.name) and ScoreDetails.score

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll check soon


assert converted_eval.model_info.id == 'Qwen/Qwen2.5-7B-Instruct'
assert converted_eval.model_info.developer == 'Qwen'

expected = _INSPECT_SHORTENED_EXPECTATIONS.get(inspect_eval_path.name)
assert expected is not None, f"No expectations defined for {inspect_eval_path.name}"

expected_dataset_name, expected_eval_description, expected_score = expected
result = converted_eval.evaluation_results[0]
assert result.source_data.dataset_name == expected_dataset_name, inspect_eval_path.name
assert result.metric_config.evaluation_description == expected_eval_description, inspect_eval_path.name
assert result.score_details.score == expected_score, inspect_eval_path.name