Skip to content

Conversation

@douglasiacovelli
Copy link
Member

@douglasiacovelli douglasiacovelli commented Oct 27, 2025

This PR can be easily reviewed by commit and here is the changelog:

  • Add support to docker_image and DockerFile fields, so it pulls the pre-built docker images, with a fallback to building the Dockerfile if the pull is unsuccessful
  • Use test_cmds and parser_content fields from the dataset to run the exact commands and correctly parse results customized for each test framework and instance.
  • Update test directives to run the exact tests mentioned in FAIL_TO_PASS and PASS_TO_PASS fields.
  • Removed the conda activate and the installation commands (pip install -e .) done for each instance, as this was unnecessary since the pre-built image already provided the appropriate environment.
  • Add flag to automatically fix whitespace errors and special characters when git-applying the patches
  • Make use of the network flag with host to support issues with tests that rely on setting up dummy web servers

Note

Introduces configurable runtime and evaluation driven by dataset fields, plus a utility to aggregate multi-run results.

  • Docker/runtime: docker_build.py now supports docker_image pull-and-tag with fallback to building from a provided dockerfile; removes pre-existing containers before create; sets container network_mode to host.
  • Test execution: Python spec uses test_cmds when present and pulls test selectors from FAIL_TO_PASS/PASS_TO_PASS; skips conda env setup when a docker_image is provided; allows custom Python version; trims conda activation from eval.
  • Log parsing: grading.py executes parser_content (if provided) to parse logs, with fallback to built-in parsers and whole-log parsing when markers are missing.
  • Resilience: Default to py when repo ext/specs missing; safer spec lookups; added build_custom_instance_image.
  • Tooling: New evaluate_pass_at_k.py to run multiple evaluations and compute pass@k; .gitignore now ignores .python-version.

Written by Cursor Bugbot for commit a166590. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

],
*get_test_directives(instance),
]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyError for custom repos without test_cmds field

The PR adds support for custom repos by changing make_test_spec to use MAP_REPO_VERSION_TO_SPECS.get(repo, {}).get(version, {}). However, when an instance doesn't provide test_cmds, the fallback path in make_eval_script_list_py directly accesses MAP_REPO_VERSION_TO_SPECS[instance["repo"]][instance["version"]]["test_cmd"]. For custom repos not in this mapping, this will raise a KeyError, crashing the evaluation script generation. The fallback should use the same safe .get() pattern or handle the missing key gracefully.

Fix in Cursor Fix in Web

# Fallback to hardcoded mapping if custom parser not found
log_parser = MAP_REPO_TO_PARSER[repo]
else:
log_parser = MAP_REPO_TO_PARSER[repo]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyError for custom repos without parser_content field

When parser_content is not provided or doesn't contain a valid parse function, the code falls back to MAP_REPO_TO_PARSER[repo]. For custom repos not in this mapping, this will raise a KeyError and crash the grading process. Since the PR adds support for custom repos, there needs to be a safe fallback when the repo isn't in MAP_REPO_TO_PARSER, such as using .get() with a default parser or raising a more descriptive error.

Fix in Cursor Fix in Web

if custom_log_parser:
# Custom parser returns JSON string, convert to dict
def custom_parser_wrapper(log_content: str, _: TestSpec) -> dict[str, str]:
return custom_log_parser(log_content)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment claims JSON conversion but none performed

The comment states "Custom parser returns JSON string, convert to dict" but the custom_parser_wrapper function simply returns custom_log_parser(log_content) without any JSON conversion. If custom parsers actually return JSON strings as the comment indicates, the result would be a string instead of a dict, causing failures when downstream code tries to use it as a dictionary. Either the comment is incorrect (existing parsers return dicts) and needs to be fixed, or a json.loads() call is missing.

Fix in Cursor Fix in Web

"source /opt/miniconda3/bin/activate",
f"conda activate {env_name}",
# "source /opt/miniconda3/bin/activate",
# f"conda activate {env_name}",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conda activation removed for all instances unconditionally

The conda activation commands (source /opt/miniconda3/bin/activate and conda activate {env_name}) are commented out for ALL instances, not just custom docker_image ones. The eval script runs via /bin/bash /eval.sh which is non-interactive and doesn't source .bashrc. For standard SWE-bench instances that don't use docker_image, tests will run without the conda environment activated, causing them to use the wrong Python version and missing dependencies. The removal of conda activation should be conditional on docker_image being present, similar to how make_env_script_list_py handles it.

Fix in Cursor Fix in Web

if not docker_image and "install_config" in instance:
docker_image = instance["install_config"].get("docker_image")
dockerfile = instance.get("dockerfile") or instance.get("DockerFile") # Handle both cases
test_cmds = instance.get("test_cmds")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_cmds not parsed from JSON string format

The test_cmds field is extracted with instance.get("test_cmds") without JSON parsing, unlike FAIL_TO_PASS and PASS_TO_PASS which use _from_json_or_obj() to handle JSON string inputs. If a dataset provides test_cmds as a JSON string (e.g., '["pytest", "test.py"]'), the subsequent " && ".join() operation in make_eval_script_list_py would join individual characters instead of commands, producing malformed output.

Fix in Cursor Fix in Web

print(f"{'='*80}")
print(f"Total instances: {total_instances}")
print(f"Instances with at least 1 pass: {instances_with_at_least_one_pass} ({100*instances_with_at_least_one_pass/total_instances:.1f}%)")
print(f"Instances with all {k} passes: {instances_with_all_passes} ({100*instances_with_all_passes/total_instances:.1f}%)")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Division by zero when dataset is empty

The code divides by total_instances on lines 282-283 without checking if it's zero, while line 286 correctly guards against this with a conditional. If the dataset file is empty or contains no valid instances, total_instances will be 0 and a ZeroDivisionError will crash the script.

Fix in Cursor Fix in Web

# f"cd {repo_directory}",
#]
# if "eval_commands" in specs:
# eval_commands = specs["eval_commands"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Django eval_commands for locale settings now ignored

The code that processes eval_commands from specs is completely commented out. Django specs (versions 1.7-3.2) define eval_commands with critical locale settings (LANG, LC_ALL, PYTHONIOENCODING, LANGUAGE). These locale exports are required for Django tests to handle string encoding correctly. With this code commented out, Django tests will run without the required locale configuration, likely causing encoding-related test failures.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants