Support custom issues #1

douglasiacovelli · 2025-10-27T03:17:04Z

This PR can be easily reviewed by commit and here is the changelog:

Add support to docker_image and DockerFile fields, so it pulls the pre-built docker images, with a fallback to building the Dockerfile if the pull is unsuccessful
Use test_cmds and parser_content fields from the dataset to run the exact commands and correctly parse results customized for each test framework and instance.
Update test directives to run the exact tests mentioned in FAIL_TO_PASS and PASS_TO_PASS fields.
Removed the conda activate and the installation commands (pip install -e .) done for each instance, as this was unnecessary since the pre-built image already provided the appropriate environment.
Add flag to automatically fix whitespace errors and special characters when git-applying the patches
Make use of the network flag with host to support issues with tests that rely on setting up dummy web servers

Note

Introduces configurable runtime and evaluation driven by dataset fields, plus a utility to aggregate multi-run results.

Docker/runtime: docker_build.py now supports docker_image pull-and-tag with fallback to building from a provided dockerfile; removes pre-existing containers before create; sets container network_mode to host.
Test execution: Python spec uses test_cmds when present and pulls test selectors from FAIL_TO_PASS/PASS_TO_PASS; skips conda env setup when a docker_image is provided; allows custom Python version; trims conda activation from eval.
Log parsing: grading.py executes parser_content (if provided) to parse logs, with fallback to built-in parsers and whole-log parsing when markers are missing.
Resilience: Default to py when repo ext/specs missing; safer spec lookups; added build_custom_instance_image.
Tooling: New evaluate_pass_at_k.py to run multiple evaluations and compute pass@k; .gitignore now ignores .python-version.

^{Written by Cursor Bugbot for commit a166590. This will update automatically on new commits. Configure here.}

… already ok

…correct key from dataset

cursor

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2025-12-23T16:36:18Z

swebench/harness/test_spec/python.py

+                ],
+                *get_test_directives(instance),
+            ]
+        )


KeyError for custom repos without test_cmds field

The PR adds support for custom repos by changing make_test_spec to use MAP_REPO_VERSION_TO_SPECS.get(repo, {}).get(version, {}). However, when an instance doesn't provide test_cmds, the fallback path in make_eval_script_list_py directly accesses MAP_REPO_VERSION_TO_SPECS[instance["repo"]][instance["version"]]["test_cmd"]. For custom repos not in this mapping, this will raise a KeyError, crashing the evaluation script generation. The fallback should use the same safe .get() pattern or handle the missing key gracefully.

cursor · 2025-12-23T16:36:18Z

swebench/harness/grading.py

+            # Fallback to hardcoded mapping if custom parser not found
+            log_parser = MAP_REPO_TO_PARSER[repo]
+    else:
+        log_parser = MAP_REPO_TO_PARSER[repo]


KeyError for custom repos without parser_content field

When parser_content is not provided or doesn't contain a valid parse function, the code falls back to MAP_REPO_TO_PARSER[repo]. For custom repos not in this mapping, this will raise a KeyError and crash the grading process. Since the PR adds support for custom repos, there needs to be a safe fallback when the repo isn't in MAP_REPO_TO_PARSER, such as using .get() with a default parser or raising a more descriptive error.

swebench/harness/test_spec/python.py

Fix/2nd batch

cursor · 2025-12-23T18:30:43Z

swebench/harness/grading.py

+        if custom_log_parser:
+            # Custom parser returns JSON string, convert to dict
+            def custom_parser_wrapper(log_content: str, _: TestSpec) -> dict[str, str]:
+                return custom_log_parser(log_content)


Comment claims JSON conversion but none performed

The comment states "Custom parser returns JSON string, convert to dict" but the custom_parser_wrapper function simply returns custom_log_parser(log_content) without any JSON conversion. If custom parsers actually return JSON strings as the comment indicates, the result would be a string instead of a dict, causing failures when downstream code tries to use it as a dictionary. Either the comment is incorrect (existing parsers return dicts) and needs to be fixed, or a json.loads() call is missing.

cursor · 2025-12-23T18:30:43Z

swebench/harness/test_spec/python.py

-        "source /opt/miniconda3/bin/activate",
-        f"conda activate {env_name}",
+        # "source /opt/miniconda3/bin/activate",
+        # f"conda activate {env_name}",


Conda activation removed for all instances unconditionally

The conda activation commands (source /opt/miniconda3/bin/activate and conda activate {env_name}) are commented out for ALL instances, not just custom docker_image ones. The eval script runs via /bin/bash /eval.sh which is non-interactive and doesn't source .bashrc. For standard SWE-bench instances that don't use docker_image, tests will run without the conda environment activated, causing them to use the wrong Python version and missing dependencies. The removal of conda activation should be conditional on docker_image being present, similar to how make_env_script_list_py handles it.

cursor · 2025-12-23T18:30:43Z

swebench/harness/test_spec/test_spec.py

+    if not docker_image and "install_config" in instance:
+        docker_image = instance["install_config"].get("docker_image")
+    dockerfile = instance.get("dockerfile") or instance.get("DockerFile")  # Handle both cases
+    test_cmds = instance.get("test_cmds")


test_cmds not parsed from JSON string format

The test_cmds field is extracted with instance.get("test_cmds") without JSON parsing, unlike FAIL_TO_PASS and PASS_TO_PASS which use _from_json_or_obj() to handle JSON string inputs. If a dataset provides test_cmds as a JSON string (e.g., '["pytest", "test.py"]'), the subsequent " && ".join() operation in make_eval_script_list_py would join individual characters instead of commands, producing malformed output.

cursor · 2025-12-31T15:17:30Z

evaluate_pass_at_k.py

+    print(f"{'='*80}")
+    print(f"Total instances: {total_instances}")
+    print(f"Instances with at least 1 pass: {instances_with_at_least_one_pass} ({100*instances_with_at_least_one_pass/total_instances:.1f}%)")
+    print(f"Instances with all {k} passes: {instances_with_all_passes} ({100*instances_with_all_passes/total_instances:.1f}%)")


Division by zero when dataset is empty

The code divides by total_instances on lines 282-283 without checking if it's zero, while line 286 correctly guards against this with a conditional. If the dataset file is empty or contains no valid instances, total_instances will be 0 and a ZeroDivisionError will crash the script.

cursor · 2025-12-31T15:17:31Z

swebench/harness/test_spec/python.py

+    #     f"cd {repo_directory}",
+    #]
+    # if "eval_commands" in specs:
+    #     eval_commands = specs["eval_commands"]


Django eval_commands for locale settings now ignored

The code that processes eval_commands from specs is completely commented out. Django specs (versions 1.7-3.2) define eval_commands with critical locale settings (LANG, LC_ALL, PYTHONIOENCODING, LANGUAGE). These locale exports are required for Django tests to handle string encoding correctly. With this code commented out, Django tests will run without the required locale configuration, likely causing encoding-related test failures.

Douglas Iacovelli and others added 8 commits October 26, 2025 18:36

support new custom fields

8650d67

adapt test directives to run only the list from F2P and P2P

d7bca48

support custom docker image and dockerfile

0266973

support custom log parser

dfd5893

add pyenv to ignore list

4ca7e8f

make git apply more permissive

91a6102

prepare swe bench to deal with unknown repos

91bca85

stop using conda and re-setting up the repo since the docker image is…

e5beb5b

… already ok

LucasArthur94 approved these changes Oct 27, 2025

View reviewed changes

douglasiacovelli and others added 6 commits October 27, 2025 12:14

update grading so the log parser returns a dictionary instead of a json

6f50bfb

adding network mode host to try fixing urllib3__urllib3-3655

733535b

update log parser function name

4f3f7da

fix: set default python version

6699f51

fix: remove whitespace flags for git apply and get docker image from …

f724dd1

…correct key from dataset

remove whitespace option from git apply

b6d9e8d

cursor bot reviewed Dec 23, 2025

View reviewed changes

Merge pull request #2 from andreluizbvs/fix/2nd-batch

b1a5be4

Fix/2nd batch

cursor bot reviewed Dec 23, 2025

View reviewed changes

feat: evaluate pass at k script given the prediction files

a166590

cursor bot reviewed Dec 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support custom issues #1

Support custom issues #1

Uh oh!

douglasiacovelli commented Oct 27, 2025 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Dec 23, 2025

Uh oh!

cursor bot Dec 23, 2025

Uh oh!

Uh oh!

cursor bot Dec 23, 2025

Uh oh!

cursor bot Dec 23, 2025

Uh oh!

cursor bot Dec 23, 2025

Uh oh!

cursor bot Dec 31, 2025

Uh oh!

cursor bot Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Support custom issues #1

Are you sure you want to change the base?

Support custom issues #1

Uh oh!

Conversation

douglasiacovelli commented Oct 27, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Uh oh!

cursor bot Dec 23, 2025

Choose a reason for hiding this comment

KeyError for custom repos without test_cmds field

Uh oh!

cursor bot Dec 23, 2025

Choose a reason for hiding this comment

KeyError for custom repos without parser_content field

Uh oh!

Uh oh!

cursor bot Dec 23, 2025

Choose a reason for hiding this comment

Comment claims JSON conversion but none performed

Uh oh!

cursor bot Dec 23, 2025

Choose a reason for hiding this comment

Conda activation removed for all instances unconditionally

Uh oh!

cursor bot Dec 23, 2025

Choose a reason for hiding this comment

test_cmds not parsed from JSON string format

Uh oh!

cursor bot Dec 31, 2025

Choose a reason for hiding this comment

Division by zero when dataset is empty

Uh oh!

cursor bot Dec 31, 2025

Choose a reason for hiding this comment

Django eval_commands for locale settings now ignored

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

douglasiacovelli commented Oct 27, 2025 •

edited by cursor bot

Loading