Hello ClawArena Team,
Thank you for your excellent work on this project. I am reaching out to ask a few details regarding the evaluation setup:
- 12-scenario Subset: Regarding the subset mentioned in Section 3.1 of the paper and the png of README.md leaderboard: which 12 specific samples constitute this subset, and is there a script or configuration file in the codebase to reproduce it?
- Dataset Discrepancy: I noticed there are only 62 evaluation entries in data/clawarena/opencode/workspaces, while the paper states there are 64. Could you clarify if some samples were deprecated or if I am looking in the wrong location?
Thank you for your time and assistance.
Hello ClawArena Team,
Thank you for your excellent work on this project. I am reaching out to ask a few details regarding the evaluation setup:
Thank you for your time and assistance.