Adapter for SWE-Bench Verified leaderboard evaluation results by jatinganhotra · Pull Request #104 · evaleval/every_eval_ever

jatinganhotra · 2026-04-12T21:30:45Z

No description provided.

Copilot

Pull request overview

Adds a utility adapter to ingest SWE-bench Verified leaderboard submissions and emit EvaluationLog JSONs in the EvalEval schema format.

Changes:

Introduces utils/swe_bench_verified/adapter.py to clone the SWE-bench experiments repo, parse submission metadata/results, and write standardized evaluation logs.
Adds utils/swe_bench_verified/__init__.py to define the new utils module.
Extends developer inference patterns to recognize additional model families (devstral, kimi).

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File	Description
utils/swe_bench_verified/adapter.py	New adapter script that converts SWE-bench Verified leaderboard submissions into EvalEval schema JSON logs.
utils/swe_bench_verified/init.py	Initializes the new `swe_bench_verified` utils package.
every_eval_ever/helpers/developer.py	Adds model-family-to-developer mappings for `devstral` and `kimi`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T19:17:29Z

+import yaml
+from datasets import load_dataset
+


yaml and datasets are imported unconditionally, but they are not listed in the project's runtime dependencies (pyproject.toml). This makes the adapter fail immediately in a fresh install. Consider either (a) adding pyyaml and datasets to an appropriate optional dependency group, or (b) importing them inside main() with a clear ImportError message telling contributors what to install to run this adapter.

Addressed in 95b5b35

Copilot · 2026-04-17T19:17:30Z

+    return EvaluationLog(
+        schema_version="0.2.2",
+        evaluation_id=eval_id,
+        retrieved_timestamp=retrieved_timestamp,


schema_version is hard-coded to "0.2.2" here. Other adapters typically use SCHEMA_VERSION from every_eval_ever.helpers so this stays in sync with eval.schema.json when the schema version bumps. Using the constant would avoid silently emitting logs with an outdated schema_version.

@copilot I’d import SCHEMA_VERSION from every_eval_ever.helpers and replace the hard-coded "0.2.2"(#104 (comment))

addressed in 95b5b35

Co-authored-by: Copilot <[email protected]>

tommasocerruti · 2026-04-17T19:29:07Z

@copilot apply changes based on the comments in this thread

tommasocerruti · 2026-04-17T19:34:57Z

Not sure why those Copilot delegations didn’t work lol, but no problem. @jatinganhotra thanks for the contribution, once those fixes are in, this LGTM and I’ll merge it.

jatinganhotra · 2026-04-20T02:20:52Z

Not sure why those Copilot delegations didn’t work lol, but no problem. @jatinganhotra thanks for the contribution, once those fixes are in, this LGTM and I’ll merge it.

Thanks @tommasocerruti . I don't know why I didn't get notifications for comments on this PR. I've made the changes requested in 95b5b35

Please take a look when you get a chance. Thanks

tommasocerruti

Hey @jatinganhotra, thanks for addressing the issues, I just have one small nit before merging.

tommasocerruti · 2026-04-20T07:57:19Z

    'olmo': 'allenai',
    'nova': 'amazon',
    'grok': 'xai',
+    'kimi': 'moonshot',


nit: Sorry for only catching this now, but it’s probably better to use moonshotai here to keep it consistent with the other adapters.

Sure, done. Will push the new data generated to HF datastore soon

…l_ever/pull/104\#discussion_r3109078470

jatinganhotra · 2026-04-20T15:00:35Z

Thanks @tommasocerruti . I've addressed the comments and this PR is now ready.

FYI, I also opened #110 for 2 more SWE benchmarks and public leaderboards. Please take a look when you get a chance.

I'll create datastore PRs for them shortly.

tommasocerruti · 2026-04-20T15:24:56Z

@jatinganhotra lgtm thank you!
Regarding the datastore PRs, feel free to open them after the PR of the adapter gets approved, so in case there is something that you should change, you don't have to run them twice.

jatinganhotra added 2 commits April 12, 2026 17:26

Adapter for SWE-Bench Verified leaderboard evaluation results

0d2c2e8

Remove local path from usage comments

a2a4af5

tommasocerruti requested a review from Copilot April 17, 2026 19:13

Copilot started reviewing on behalf of tommasocerruti April 17, 2026 19:14 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Update utils/swe_bench_verified/adapter.py

a97ed43

Co-authored-by: Copilot <[email protected]>

address PR comments

95b5b35

tommasocerruti requested changes Apr 20, 2026

View reviewed changes

jatinganhotra mentioned this pull request Apr 20, 2026

Add multi swe bench and polybench leaderboards #110

Open

Rename moonshot to moonshotai : https://github.com/evaleval/every_eva…

68856ef

…l_ever/pull/104\#discussion_r3109078470

tommasocerruti merged commit fe2cf76 into evaleval:main Apr 20, 2026

jatinganhotra deleted the add_swe_bench_verified_leaderboard branch April 20, 2026 16:46

Conversation

jatinganhotra commented Apr 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

jatinganhotra Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

tommasocerruti Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

jatinganhotra Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

tommasocerruti commented Apr 17, 2026

Uh oh!

tommasocerruti commented Apr 17, 2026

Uh oh!

jatinganhotra commented Apr 20, 2026

Uh oh!

tommasocerruti left a comment

Choose a reason for hiding this comment

Uh oh!

tommasocerruti Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

jatinganhotra Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

jatinganhotra commented Apr 20, 2026

Uh oh!

tommasocerruti commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants