benchmark: SWT-bench infer #55

juanmichelini · 2025-10-28T18:42:14Z

fixes: #53

xingyaoww · 2025-11-05T16:12:09Z

benchmarks/swt_bench/run_infer.py

+            },
+            instruction=instruction,
+            error=None,
+            history=history,


Do you need to assign metrics here?

agreed, I'll push it once #67 is merged

merged #67 and added metrics to swt_bench

xingyaoww · 2025-11-05T16:12:28Z

@ryanhoangt can you take a look?

juanmichelini · 2025-11-17T21:45:49Z

@OpenHands can you please fix the Pre-commit checks / pre-commit (pull_request) only do the minimal changes to fix the precommit hooks, no other changes

openhands-ai · 2025-11-17T21:46:01Z

I'm on it! juanmichelini can track my progress at all-hands.dev

The EvalOutput class expects metrics to be dict[str, Any] | None, but conversation.conversation_stats.get_combined_metrics() returns a Metrics object. Use the .get() method to convert it to a dictionary. Co-authored-by: openhands <[email protected]>

openhands-ai · 2025-11-17T21:49:48Z

Summary

I successfully fixed the failing pre-commit checks for PR #55 with a minimal change:

Issue Fixed: Type error in benchmarks/swt_bench/run_infer.py where EvalOutput expected metrics parameter to be dict[str, Any] | None but was receiving a Metrics object.

Solution: Added .get() method call to convert the Metrics object to a dictionary:

# Changed line 263 from:
metrics=conversation.conversation_stats.get_combined_metrics(),
# To:
metrics=conversation.conversation_stats.get_combined_metrics().get(),

Verification:
✅ All pre-commit checks now pass (Ruff format, Ruff lint, PEP8, Pyright type checking)
✅ Only one line changed - minimal fix as requested
✅ Changes committed and pushed to jmj/swt-bench branch
✅ No extraneous changes included

The PR should now pass the pre-commit CI checks that were previously failing.

_{View full conversation}

- Add missing test infrastructure (tests.yml workflow, tests/ directory) - Add pytest dependencies to pyproject.toml - Fix swt_bench test data issues (missing repo field and metadata) - Add metrics collection to all benchmarks (swe_bench, gaia, swt_bench) - Convert Metrics objects to dictionaries for EvalOutput compatibility - Add template mocking for swt_bench tests Co-authored-by: openhands <[email protected]>

openhands-ai · 2025-11-19T20:54:36Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #55 at branch `jmj/swt-bench`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

ryanhoangt

LGTM overall, I left some nits below. Also should we run on a subset like 50 instances to verify the performance match the V0 (to some extent)?

benchmarks/swt_bench/README.md

benchmarks/swt_bench/run_infer.py

ryanhoangt

LGTM

juanmichelini added 3 commits October 28, 2025 15:41

benchmark: SWT-bench

5d8389f

benchmark: SWT-bench

ad7407c

Fix extra param in getdatasets

311e9d8

juanmichelini requested a review from xingyaoww November 5, 2025 14:06

xingyaoww reviewed Nov 5, 2025

View reviewed changes

xingyaoww requested a review from ryanhoangt November 5, 2025 16:12

juanmichelini added 2 commits November 6, 2025 14:50

Added metrics

5add816

SWT-bench: eval works but not ready for merge

5823fad

juanmichelini mentioned this pull request Nov 7, 2025

benchmark: SWT bench eval #80

Closed

3 tasks

juanmichelini changed the title ~~benchmark: SWT-bench~~ benchmark: SWT-bench infer Nov 7, 2025

fix: swt-bench dependency

b72f528

openhands-agent and others added 5 commits November 17, 2025 21:57

Merge branch 'main' into jmj/swt-bench-eval

91270a5

Merge remote-tracking branch 'origin/main' into jmj/swt-bench

b7faee2

SWT fix precommit error

730a0d3

swt: fix test

66a5d7f

juanmichelini added 2 commits November 19, 2025 18:02

swt: fix test

e1c2ec9

SWT merging swt-eval from other branch

330e548

juanmichelini requested a review from simonrosenberg November 19, 2025 22:38

ryanhoangt reviewed Nov 20, 2025

View reviewed changes

benchmarks/swt_bench/README.md Outdated Show resolved Hide resolved

benchmarks/swt_bench/run_infer.py Show resolved Hide resolved

benchmarks/swt_bench/run_infer.py Outdated Show resolved Hide resolved

juanmichelini added 4 commits November 20, 2025 13:31

SWT: fix README example

4b72ec2

SWT: set default eval dataset to Verified

8a52e41

SWT: added remote workspace

3a7b076

SWT: move use evaluation_utils.py get_default_on_result_writer

7d2e5b9

juanmichelini requested a review from ryanhoangt November 20, 2025 22:35

ryanhoangt approved these changes Nov 21, 2025

View reviewed changes

simonrosenberg approved these changes Nov 21, 2025

View reviewed changes

juanmichelini merged commit 5793428 into main Nov 21, 2025
2 checks passed

benchmark: SWT-bench infer #55

benchmark: SWT-bench infer #55

Uh oh!

Conversation

juanmichelini commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xingyaoww Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

juanmichelini Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

juanmichelini Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

xingyaoww commented Nov 5, 2025

Uh oh!

juanmichelini commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Uh oh!

openhands-ai bot commented Nov 17, 2025

Summary

Uh oh!

openhands-ai bot commented Nov 19, 2025

Uh oh!

ryanhoangt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanhoangt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

juanmichelini commented Oct 28, 2025 •

edited

Loading