-
Notifications
You must be signed in to change notification settings - Fork 8
benchmark: SWT-bench infer #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
benchmarks/swt_bench/run_infer.py
Outdated
| }, | ||
| instruction=instruction, | ||
| error=None, | ||
| history=history, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to assign metrics here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, I'll push it once #67 is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged #67 and added metrics to swt_bench
|
@ryanhoangt can you take a look? |
|
@OpenHands can you please fix the Pre-commit checks / pre-commit (pull_request) only do the minimal changes to fix the precommit hooks, no other changes |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
The EvalOutput class expects metrics to be dict[str, Any] | None, but conversation.conversation_stats.get_combined_metrics() returns a Metrics object. Use the .get() method to convert it to a dictionary. Co-authored-by: openhands <[email protected]>
SummaryI successfully fixed the failing pre-commit checks for PR #55 with a minimal change: Issue Fixed: Type error in Solution: Added # Changed line 263 from:
metrics=conversation.conversation_stats.get_combined_metrics(),
# To:
metrics=conversation.conversation_stats.get_combined_metrics().get(),Verification: The PR should now pass the pre-commit CI checks that were previously failing. |
- Add missing test infrastructure (tests.yml workflow, tests/ directory) - Add pytest dependencies to pyproject.toml - Fix swt_bench test data issues (missing repo field and metadata) - Add metrics collection to all benchmarks (swe_bench, gaia, swt_bench) - Convert Metrics objects to dictionaries for EvalOutput compatibility - Add template mocking for swt_bench tests Co-authored-by: openhands <[email protected]>
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
ryanhoangt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, I left some nits below. Also should we run on a subset like 50 instances to verify the performance match the V0 (to some extent)?
ryanhoangt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fixes: #53