Skip to content

Adapter for SWE-Bench Verified leaderboard evaluation results#104

Merged
tommasocerruti merged 5 commits intoevaleval:mainfrom
jatinganhotra:add_swe_bench_verified_leaderboard
Apr 20, 2026
Merged

Adapter for SWE-Bench Verified leaderboard evaluation results#104
tommasocerruti merged 5 commits intoevaleval:mainfrom
jatinganhotra:add_swe_bench_verified_leaderboard

Conversation

@jatinganhotra
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a utility adapter to ingest SWE-bench Verified leaderboard submissions and emit EvaluationLog JSONs in the EvalEval schema format.

Changes:

  • Introduces utils/swe_bench_verified/adapter.py to clone the SWE-bench experiments repo, parse submission metadata/results, and write standardized evaluation logs.
  • Adds utils/swe_bench_verified/__init__.py to define the new utils module.
  • Extends developer inference patterns to recognize additional model families (devstral, kimi).

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File Description
utils/swe_bench_verified/adapter.py New adapter script that converts SWE-bench Verified leaderboard submissions into EvalEval schema JSON logs.
utils/swe_bench_verified/init.py Initializes the new swe_bench_verified utils package.
every_eval_ever/helpers/developer.py Adds model-family-to-developer mappings for devstral and kimi.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread utils/swe_bench_verified/adapter.py Outdated
Comment on lines +26 to +28
import yaml
from datasets import load_dataset

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaml and datasets are imported unconditionally, but they are not listed in the project's runtime dependencies (pyproject.toml). This makes the adapter fail immediately in a fresh install. Consider either (a) adding pyyaml and datasets to an appropriate optional dependency group, or (b) importing them inside main() with a clear ImportError message telling contributors what to install to run this adapter.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 95b5b35

Comment thread utils/swe_bench_verified/adapter.py Outdated
Comment on lines +187 to +190
return EvaluationLog(
schema_version="0.2.2",
evaluation_id=eval_id,
retrieved_timestamp=retrieved_timestamp,
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema_version is hard-coded to "0.2.2" here. Other adapters typically use SCHEMA_VERSION from every_eval_ever.helpers so this stays in sync with eval.schema.json when the schema version bumps. Using the constant would avoid silently emitting logs with an outdated schema_version.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot I’d import SCHEMA_VERSION from every_eval_ever.helpers and replace the hard-coded "0.2.2"(#104 (comment))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 95b5b35

@tommasocerruti
Copy link
Copy Markdown
Member

@copilot apply changes based on the comments in this thread

@tommasocerruti
Copy link
Copy Markdown
Member

Not sure why those Copilot delegations didn’t work lol, but no problem. @jatinganhotra thanks for the contribution, once those fixes are in, this LGTM and I’ll merge it.

@jatinganhotra
Copy link
Copy Markdown
Contributor Author

Not sure why those Copilot delegations didn’t work lol, but no problem. @jatinganhotra thanks for the contribution, once those fixes are in, this LGTM and I’ll merge it.

Thanks @tommasocerruti . I don't know why I didn't get notifications for comments on this PR. I've made the changes requested in 95b5b35

Please take a look when you get a chance. Thanks

Copy link
Copy Markdown
Member

@tommasocerruti tommasocerruti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jatinganhotra, thanks for addressing the issues, I just have one small nit before merging.

Comment thread every_eval_ever/helpers/developer.py Outdated
'olmo': 'allenai',
'nova': 'amazon',
'grok': 'xai',
'kimi': 'moonshot',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Sorry for only catching this now, but it’s probably better to use moonshotai here to keep it consistent with the other adapters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done. Will push the new data generated to HF datastore soon

@jatinganhotra
Copy link
Copy Markdown
Contributor Author

Thanks @tommasocerruti . I've addressed the comments and this PR is now ready.

FYI, I also opened #110 for 2 more SWE benchmarks and public leaderboards. Please take a look when you get a chance.

I'll create datastore PRs for them shortly.

@tommasocerruti
Copy link
Copy Markdown
Member

@jatinganhotra lgtm thank you!
Regarding the datastore PRs, feel free to open them after the PR of the adapter gets approved, so in case there is something that you should change, you don't have to run them twice.

@tommasocerruti tommasocerruti merged commit fe2cf76 into evaleval:main Apr 20, 2026
@jatinganhotra jatinganhotra deleted the add_swe_bench_verified_leaderboard branch April 20, 2026 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants