Skip to content

Commit 4a7fc8a

Browse files
authored
Create new_tests_results.md
1 parent a5d4785 commit 4a7fc8a

1 file changed

Lines changed: 17 additions & 0 deletions

File tree

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Results: Newly Added Tests
2+
3+
Assertion-weighted mean scores (0-100) on the **36 newly added tests** only: 17 Box, 5 Google Calendar, 7 Linear, and 6 Slack. All runs included API documentation.
4+
5+
| Model | Weighted Mean
6+
|---|---|---|
7+
| openai/gpt-5 | 88.10
8+
| openai/gpt-5-mini | 87.61
9+
| deepseek/deepseek-v3.2 | 84.26
10+
| mistralai/devstral-2512 | 80.38
11+
| google/gemini-3.1-pro-preview | 77.69
12+
| moonshotai/kimi-k2-0905 | 74.06
13+
| qwen/qwen3-vl-235b-a22b-instruct | 74.06
14+
| google/gemini-3-flash-preview | 67.78
15+
| x-ai/grok-4.1-fast | 65.44
16+
| meta-llama/llama-4-scout | 28.63
17+
| openai/gpt-oss-120b | 27.90

0 commit comments

Comments
 (0)