Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

@juanmichelini juanmichelini commented Oct 28, 2025

fixes: #53

},
instruction=instruction,
error=None,
history=history,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to assign metrics here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I'll push it once #67 is merged

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged #67 and added metrics to swt_bench

@xingyaoww xingyaoww requested a review from ryanhoangt November 5, 2025 16:12
@xingyaoww
Copy link
Contributor

@ryanhoangt can you take a look?

@juanmichelini juanmichelini mentioned this pull request Nov 7, 2025
3 tasks
@juanmichelini juanmichelini changed the title benchmark: SWT-bench benchmark: SWT-bench infer Nov 7, 2025
@juanmichelini
Copy link
Collaborator Author

@OpenHands can you please fix the Pre-commit checks / pre-commit (pull_request) only do the minimal changes to fix the precommit hooks, no other changes

@openhands-ai
Copy link

openhands-ai bot commented Nov 17, 2025

I'm on it! juanmichelini can track my progress at all-hands.dev

The EvalOutput class expects metrics to be dict[str, Any] | None, but
conversation.conversation_stats.get_combined_metrics() returns a Metrics
object. Use the .get() method to convert it to a dictionary.

Co-authored-by: openhands <[email protected]>
@openhands-ai
Copy link

openhands-ai bot commented Nov 17, 2025

Summary

I successfully fixed the failing pre-commit checks for PR #55 with a minimal change:

Issue Fixed: Type error in benchmarks/swt_bench/run_infer.py where EvalOutput expected metrics parameter to be dict[str, Any] | None but was receiving a Metrics object.

Solution: Added .get() method call to convert the Metrics object to a dictionary:

# Changed line 263 from:
metrics=conversation.conversation_stats.get_combined_metrics(),
# To:
metrics=conversation.conversation_stats.get_combined_metrics().get(),

Verification:
✅ All pre-commit checks now pass (Ruff format, Ruff lint, PEP8, Pyright type checking)
✅ Only one line changed - minimal fix as requested
✅ Changes committed and pushed to jmj/swt-bench branch
✅ No extraneous changes included

The PR should now pass the pre-commit CI checks that were previously failing.

View full conversation

openhands-agent and others added 5 commits November 17, 2025 21:57
- Add missing test infrastructure (tests.yml workflow, tests/ directory)
- Add pytest dependencies to pyproject.toml
- Fix swt_bench test data issues (missing repo field and metadata)
- Add metrics collection to all benchmarks (swe_bench, gaia, swt_bench)
- Convert Metrics objects to dictionaries for EvalOutput compatibility
- Add template mocking for swt_bench tests

Co-authored-by: openhands <[email protected]>
@openhands-ai
Copy link

openhands-ai bot commented Nov 19, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #55 at branch `jmj/swt-bench`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Copy link
Collaborator

@ryanhoangt ryanhoangt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, I left some nits below. Also should we run on a subset like 50 instances to verify the performance match the V0 (to some extent)?

Copy link
Collaborator

@ryanhoangt ryanhoangt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@juanmichelini juanmichelini merged commit 5793428 into main Nov 21, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

benchmark: SWT-bench

6 participants