Skip to content

Commit 5793428

Browse files
benchmark: SWT-bench infer (#55)
* benchmark: SWT-bench * benchmark: SWT-bench * Added metrics * SWT-bench: eval works but not ready for merge * fix: swt-bench dependency * fix: convert Metrics object to dict for EvalOutput The EvalOutput class expects metrics to be dict[str, Any] | None, but conversation.conversation_stats.get_combined_metrics() returns a Metrics object. Use the .get() method to convert it to a dictionary. Co-authored-by: openhands <[email protected]> * Fix failing tests in CI - Add missing test infrastructure (tests.yml workflow, tests/ directory) - Add pytest dependencies to pyproject.toml - Fix swt_bench test data issues (missing repo field and metadata) - Add metrics collection to all benchmarks (swe_bench, gaia, swt_bench) - Convert Metrics objects to dictionaries for EvalOutput compatibility - Add template mocking for swt_bench tests Co-authored-by: openhands <[email protected]> * SWT fix precommit error * swt: fix test * swt: fix test * SWT: fix README example * SWT: set default eval dataset to Verified * SWT: added remote workspace * SWT: move use evaluation_utils.py get_default_on_result_writer --------- Co-authored-by: openhands <[email protected]>
1 parent 3ff80dc commit 5793428

File tree

8 files changed

+854
-0
lines changed

8 files changed

+854
-0
lines changed

benchmarks/swt_bench/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# OpenHands SWT-Bench
2+
3+
4+
## Prerequisites
5+
6+
Before running any benchmarks, you need to set up the environment see main README.md
7+
8+
### 1. Run SWT-Bench Evaluation
9+
```bash
10+
# Run evaluation with your configured LLM
11+
uv run swtbench-infer .llm_config/sonnet-4.json --critic pass
12+
```
13+
14+
### 2. Selecting Specific Instances
15+
16+
You can run evaluation on a specific subset of instances using the `--select` option:
17+
18+
1. Create a text file with one instance ID per line:
19+
20+
**instances.txt:**
21+
```
22+
django__django-11333
23+
astropy__astropy-12345
24+
requests__requests-5555
25+
```
26+
27+
2. Run evaluation with the selection file:
28+
```bash
29+
python -m benchmarks.swt_bench.run_infer \
30+
--llm-config llm_config.toml \
31+
--max-iterations 30 \
32+
--select instances.txt \
33+
--eval-output-dir ./evaluation_results \
34+
--max-attempts 3 \
35+
--critic finish_with_patch
36+
```
37+
38+
This will only evaluate the instances listed in the file.
39+
40+
## Links
41+
42+
- **Original OpenHands**: https://github.com/All-Hands-AI/OpenHands/
43+
- **Agent SDK**: https://github.com/All-Hands-AI/agent-sdk
44+
- **SWT-Bench**: https://www.swtbench.com/

benchmarks/swt_bench/__init__.py

Whitespace-only changes.

benchmarks/swt_bench/eval_infer.py

Lines changed: 317 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,317 @@
1+
#!/usr/bin/env python3
2+
"""
3+
SWT-Bench Evaluation Script
4+
5+
This script converts OpenHands output.jsonl format to SWT-Bench prediction format
6+
and runs the SWT-Bench evaluation.
7+
8+
Usage:
9+
uv run swtbench-eval <path_to_output.jsonl>
10+
"""
11+
12+
import argparse
13+
import json
14+
import os
15+
import shutil
16+
import subprocess
17+
import sys
18+
from pathlib import Path
19+
20+
from benchmarks.utils.patch_utils import remove_files_from_patch
21+
from openhands.sdk import get_logger
22+
23+
24+
logger = get_logger(__name__)
25+
26+
27+
def convert_to_swtbench_format(
28+
input_file: str, output_file: str, model_name: str = "OpenHands"
29+
) -> None:
30+
"""
31+
Convert OpenHands output.jsonl to SWT-Bench prediction format.
32+
33+
OpenHands format:
34+
{
35+
"instance_id": "sympy__sympy-20590",
36+
"test_result": {
37+
"git_patch": "diff --git a/file.py b/file.py\n..."
38+
},
39+
"instruction": "...",
40+
"error": null,
41+
"history": [...]
42+
}
43+
44+
SWT-Bench format:
45+
{
46+
"instance_id": "sympy__sympy-20590",
47+
"model_patch": "diff --git a/file.py b/file.py\n...",
48+
"model_name_or_path": "OpenHands"
49+
}
50+
"""
51+
logger.info(f"Converting {input_file} to SWT-Bench format: {output_file}")
52+
53+
converted_count = 0
54+
error_count = 0
55+
56+
with open(input_file, "r") as infile, open(output_file, "w") as outfile:
57+
for line_num, line in enumerate(infile, 1):
58+
try:
59+
line = line.strip()
60+
if not line:
61+
continue
62+
63+
data = json.loads(line)
64+
65+
# Extract required fields
66+
instance_id = data.get("instance_id")
67+
if not instance_id:
68+
logger.warning(f"Line {line_num}: Missing instance_id")
69+
error_count += 1
70+
continue
71+
72+
# Extract git_patch from test_result
73+
test_result = data.get("test_result", {})
74+
git_patch = test_result.get("git_patch", "")
75+
76+
if not git_patch:
77+
logger.warning(
78+
f"Line {line_num}: Missing or empty git_patch for {instance_id}"
79+
)
80+
# Still create entry with empty patch
81+
git_patch = ""
82+
83+
# postprocess git_patch
84+
setup_files = ["pyproject.toml", "tox.ini", "setup.py"]
85+
git_patch = remove_files_from_patch(git_patch, setup_files)
86+
87+
# Create SWT-Bench format entry
88+
swtbench_entry = {
89+
"instance_id": instance_id,
90+
"model_patch": git_patch,
91+
"model_name_or_path": model_name,
92+
}
93+
94+
# Write to output file
95+
outfile.write(json.dumps(swtbench_entry) + "\n")
96+
converted_count += 1
97+
98+
except json.JSONDecodeError as e:
99+
logger.error(f"Line {line_num}: Invalid JSON - {e}")
100+
error_count += 1
101+
except Exception as e:
102+
logger.error(f"Line {line_num}: Unexpected error - {e}")
103+
error_count += 1
104+
105+
logger.info(
106+
f"Conversion complete: {converted_count} entries converted, "
107+
f"{error_count} errors"
108+
)
109+
110+
if converted_count == 0:
111+
raise ValueError("No valid entries were converted")
112+
113+
114+
def run_swtbench_evaluation(
115+
predictions_file: str,
116+
dataset: str = "princeton-nlp/SWE-bench_Verified",
117+
workers: str = "12",
118+
) -> None:
119+
"""
120+
Run SWT-Bench evaluation on the predictions file.
121+
122+
Note: The swt-bench package is included as a dependency in pyproject.toml
123+
to ensure all its dependencies are available, but the package itself is not
124+
properly structured for import. We use subprocess to run it from a cached
125+
clone since that's how the upstream package is designed to work.
126+
127+
Args:
128+
predictions_file: Path to the SWT-Bench format predictions file
129+
dataset: SWT-Bench dataset to evaluate against
130+
workers: Number of workers to use for evaluation
131+
"""
132+
logger.info(f"Running SWT-Bench evaluation on {predictions_file}")
133+
134+
try:
135+
# Use a global cache directory for SWT-Bench source
136+
cache_dir = Path.home() / ".cache" / "openhands" / "swt-bench"
137+
swt_bench_dir = cache_dir / "swt-bench"
138+
139+
# Clone SWT-Bench repository if it doesn't exist
140+
if not swt_bench_dir.exists():
141+
logger.info("Setting up SWT-Bench source in global cache...")
142+
cache_dir.mkdir(parents=True, exist_ok=True)
143+
144+
logger.info("Cloning SWT-Bench repository...")
145+
clone_cmd = [
146+
"git",
147+
"clone",
148+
"https://github.com/logic-star-ai/swt-bench.git",
149+
str(swt_bench_dir),
150+
]
151+
result = subprocess.run(clone_cmd, text=True)
152+
if result.returncode != 0:
153+
raise subprocess.CalledProcessError(result.returncode, clone_cmd)
154+
155+
logger.info(f"SWT-Bench source installed at {swt_bench_dir}")
156+
157+
# Get the directory and filename of the predictions file
158+
predictions_path = Path(predictions_file).resolve()
159+
predictions_filename = predictions_path.name
160+
161+
# Copy predictions file to swt-bench directory
162+
swt_predictions_file = swt_bench_dir / predictions_filename
163+
shutil.copy2(predictions_file, swt_predictions_file)
164+
165+
# Run SWT-Bench evaluation by running python directly from the swt-bench directory
166+
# but using the uv environment's python executable which has all dependencies
167+
benchmarks_dir = Path(__file__).parent.parent.parent
168+
169+
# Get the python executable from the uv environment
170+
python_executable = subprocess.run(
171+
[
172+
"uv",
173+
"run",
174+
"--directory",
175+
str(benchmarks_dir),
176+
"python",
177+
"-c",
178+
"import sys; print(sys.executable)",
179+
],
180+
capture_output=True,
181+
text=True,
182+
cwd=benchmarks_dir,
183+
).stdout.strip()
184+
185+
# Set up environment with PYTHONPATH to include swt-bench directory
186+
env = os.environ.copy()
187+
env["PYTHONPATH"] = str(swt_bench_dir)
188+
189+
cmd = [
190+
python_executable,
191+
"src/main.py", # Run as script instead of module
192+
"--dataset_name",
193+
dataset,
194+
"--predictions_path",
195+
predictions_filename,
196+
"--filter_swt",
197+
"--max_workers",
198+
str(workers),
199+
"--run_id",
200+
f"eval_{predictions_path.stem}",
201+
]
202+
203+
logger.info(f"Using Python executable: {python_executable}")
204+
logger.info(f"Running command: {' '.join(cmd)}")
205+
logger.info(f"Working directory: {swt_bench_dir}")
206+
logger.info(f"PYTHONPATH: {env['PYTHONPATH']}")
207+
logger.info("SWT-Bench evaluation output:")
208+
print("-" * 80)
209+
210+
# Stream output directly to console, running from swt-bench directory
211+
result = subprocess.run(cmd, text=True, cwd=swt_bench_dir, env=env)
212+
213+
print("-" * 80)
214+
if result.returncode == 0:
215+
logger.info("SWT-Bench evaluation completed successfully")
216+
else:
217+
logger.error(
218+
f"SWT-Bench evaluation failed with return code {result.returncode}"
219+
)
220+
raise subprocess.CalledProcessError(result.returncode, cmd)
221+
222+
except FileNotFoundError:
223+
logger.error(
224+
"SWT-Bench evaluation command not found. "
225+
"Make sure git and python are available."
226+
)
227+
raise
228+
except Exception as e:
229+
logger.error(f"Error running SWT-Bench evaluation: {e}")
230+
raise
231+
232+
233+
def main() -> None:
234+
"""Main entry point for the script."""
235+
parser = argparse.ArgumentParser(
236+
description="Convert OpenHands output to SWT-Bench format and run evaluation",
237+
formatter_class=argparse.RawDescriptionHelpFormatter,
238+
epilog="""
239+
Examples:
240+
uv run swtbench-eval output.jsonl
241+
uv run swtbench-eval /path/to/output.jsonl --dataset princeton-nlp/SWE-bench_Lite
242+
uv run swtbench-eval output.jsonl --model-name "MyModel-v1.0"
243+
""",
244+
)
245+
246+
parser.add_argument("input_file", help="Path to the OpenHands output.jsonl file")
247+
248+
parser.add_argument(
249+
"--dataset",
250+
default="princeton-nlp/SWE-bench_Verified",
251+
help="SWT-Bench dataset to evaluate against "
252+
"(default: princeton-nlp/SWE-bench_Verified)",
253+
)
254+
255+
parser.add_argument(
256+
"--output-file",
257+
help="Output file for SWT-Bench format "
258+
"(default: input_file with .swtbench.jsonl extension)",
259+
)
260+
261+
parser.add_argument(
262+
"--skip-evaluation",
263+
action="store_true",
264+
help="Only convert format, skip running evaluation",
265+
)
266+
267+
parser.add_argument(
268+
"--model-name",
269+
default="OpenHands",
270+
help="Model name to use in the model_name_or_path field (default: OpenHands)",
271+
)
272+
273+
parser.add_argument(
274+
"--workers",
275+
default="12",
276+
help="Number of workers to use when evaluating",
277+
)
278+
279+
args = parser.parse_args()
280+
281+
# Validate input file
282+
input_file = Path(args.input_file)
283+
if not input_file.exists():
284+
logger.error(f"Input file does not exist: {input_file}")
285+
sys.exit(1)
286+
287+
if not input_file.suffix == ".jsonl":
288+
logger.warning(f"Input file does not have .jsonl extension: {input_file}")
289+
290+
# Determine output file
291+
if args.output_file:
292+
output_file = Path(args.output_file)
293+
else:
294+
output_file = input_file.with_suffix(".swtbench.jsonl")
295+
296+
logger.info(f"Input file: {input_file}")
297+
logger.info(f"Output file: {output_file}")
298+
logger.info(f"Dataset: {args.dataset}")
299+
logger.info(f"Model name: {args.model_name}")
300+
301+
try:
302+
# Convert format
303+
convert_to_swtbench_format(str(input_file), str(output_file), args.model_name)
304+
305+
if not args.skip_evaluation:
306+
# Run evaluation
307+
run_swtbench_evaluation(str(output_file), args.dataset, args.workers)
308+
309+
logger.info("Script completed successfully!")
310+
311+
except Exception as e:
312+
logger.error(f"Script failed: {e}")
313+
sys.exit(1)
314+
315+
316+
if __name__ == "__main__":
317+
main()
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
<uploaded_files>
2+
/workspace/{{ workspace_dir_name }}
3+
</uploaded_files>
4+
I've uploaded a python code repository in the directory {{ workspace_dir_name }}. Consider the following issue description:
5+
6+
<issue_description>
7+
{{ instance.problem_statement }}
8+
</issue_description>
9+
10+
11+
Can you help me implement the necessary changes to the repository to test whether the issue in <issue_description> was resolved?
12+
I will take care of all changes to any of the non-test files. This means you DON'T have to modify the actual logic and ONLY have to update test logic and tests!
13+
Your task is to make the minimal changes to tests files in the /workspace directory to reproduce the issue in the <issue_description>, i.e., such that the generated tests fail in the current state (where the issue is unresolved) and pass when the issue will be resolved.
14+
Follow these steps to reproduce the issue:
15+
1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
16+
2. Create a script `reproduction.py` to reproduce the error and execute it with `python reproduction.py` using the BashTool, to confirm the error
17+
3. Edit the sourcecode of the repo to integrate your reproduction script into the test framework
18+
4. Run the test framework and make sure your tests fail! Only submit FAILING tests! Never submit passing tests.
19+
Your thinking should be thorough and so it's fine if it's very long.

0 commit comments

Comments
 (0)