Adapter for SWE-Bench Verified leaderboard evaluation results#104
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a utility adapter to ingest SWE-bench Verified leaderboard submissions and emit EvaluationLog JSONs in the EvalEval schema format.
Changes:
- Introduces
utils/swe_bench_verified/adapter.pyto clone the SWE-bench experiments repo, parse submission metadata/results, and write standardized evaluation logs. - Adds
utils/swe_bench_verified/__init__.pyto define the new utils module. - Extends developer inference patterns to recognize additional model families (
devstral,kimi).
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| utils/swe_bench_verified/adapter.py | New adapter script that converts SWE-bench Verified leaderboard submissions into EvalEval schema JSON logs. |
| utils/swe_bench_verified/init.py | Initializes the new swe_bench_verified utils package. |
| every_eval_ever/helpers/developer.py | Adds model-family-to-developer mappings for devstral and kimi. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import yaml | ||
| from datasets import load_dataset | ||
|
|
There was a problem hiding this comment.
yaml and datasets are imported unconditionally, but they are not listed in the project's runtime dependencies (pyproject.toml). This makes the adapter fail immediately in a fresh install. Consider either (a) adding pyyaml and datasets to an appropriate optional dependency group, or (b) importing them inside main() with a clear ImportError message telling contributors what to install to run this adapter.
| return EvaluationLog( | ||
| schema_version="0.2.2", | ||
| evaluation_id=eval_id, | ||
| retrieved_timestamp=retrieved_timestamp, |
There was a problem hiding this comment.
schema_version is hard-coded to "0.2.2" here. Other adapters typically use SCHEMA_VERSION from every_eval_ever.helpers so this stays in sync with eval.schema.json when the schema version bumps. Using the constant would avoid silently emitting logs with an outdated schema_version.
There was a problem hiding this comment.
@copilot I’d import SCHEMA_VERSION from every_eval_ever.helpers and replace the hard-coded "0.2.2"(#104 (comment))
Co-authored-by: Copilot <[email protected]>
|
@copilot apply changes based on the comments in this thread |
|
Not sure why those Copilot delegations didn’t work lol, but no problem. @jatinganhotra thanks for the contribution, once those fixes are in, this LGTM and I’ll merge it. |
Thanks @tommasocerruti . I don't know why I didn't get notifications for comments on this PR. I've made the changes requested in 95b5b35 Please take a look when you get a chance. Thanks |
tommasocerruti
left a comment
There was a problem hiding this comment.
Hey @jatinganhotra, thanks for addressing the issues, I just have one small nit before merging.
| 'olmo': 'allenai', | ||
| 'nova': 'amazon', | ||
| 'grok': 'xai', | ||
| 'kimi': 'moonshot', |
There was a problem hiding this comment.
nit: Sorry for only catching this now, but it’s probably better to use moonshotai here to keep it consistent with the other adapters.
There was a problem hiding this comment.
Sure, done. Will push the new data generated to HF datastore soon
|
Thanks @tommasocerruti . I've addressed the comments and this PR is now ready. FYI, I also opened #110 for 2 more SWE benchmarks and public leaderboards. Please take a look when you get a chance. I'll create datastore PRs for them shortly. |
|
@jatinganhotra lgtm thank you! |
No description provided.