πΊπ¦ββ¬ LLM Comparison/Test: Phi-4, Qwen2 VL 72B Instruct, Aya Expanse 32B in my updated MMLU-Pro CS benchmark
Introduction
I've updated my MMLU-Pro Computer Science LLM benchmark results with new data from recently tested models: three Phi-4 variants (Microsoft's official weights, plus Unsloth's fixed HF and GGUF versions), Qwen2 VL 72B Instruct, and Aya Expanse 32B.
While adding these new models, I streamlined the graph by removing all QwQ-32B-Preview variants except for the main model, which consistently showed superior performance. I also incorporated the results from a third evaluation run for Claude, gemini-1.5-pro-002 and Athene-V2-Chat, which shifted Athene's rank downward due to fluctuations in its scores.
Here's an additional visualization that represents each model as a 3D bar where the height shows the MMLU score (%), the depth represents the model size in billions of parameters, and for quantized models, the bar is split into a full-color front section proportional to the quantized size and a lighter-colored back section showing the memory savings compared to full precision (16-bit) models:
New Models Tested
Phi-4:
- Unsloth's fixed Transformers implementation showed minimal performance differences in benchmarks, with the GGUF version achieving marginally higher accuracy. Further testing would be needed to determine whether this improvement represents a statistically significant trend or random variation.
- Temperature settings had a notable impact on performance: At 0, responses were consistent but repetitive, while at 1, outputs became erratic and unpredictable.
- German language performance improved substantially compared to previous versions. Though it's a small model, which occasionally results in overly literal translations, the overall quality of German outputs is good enough for most purposes.
- Censorship can be completely circumvented through basic prompt engineering techniques.
- Still undecided on how good Phi-4 is for general-purpose tasks in real use - but generally recommend you run a bigger, better model if you can.
Qwen2 VL 72B Instruct:
- Given its relatively low scores and reliance on the older Qwen2 series rather than the superior 2.5 architecture, I look forward to the release of a Qwen2.5 VL 72B model.
Aya Expanse 32B:
- While this model shows the lowest score on the graph, keep in mind that I only included models scoring above 50%. There are other, worse models that fell below this threshold and didn't make it onto the visualization at all.
- Its main advantage is the support of 23 languages, making it a solid choice when you need those multilingual capabilities and have no better alternatives. Of course, if your target language is supported by a better model, use that instead.
About the Benchmark
The MMLU-Pro benchmark is a comprehensive evaluation of large language models across various categories, including computer science, mathematics, physics, chemistry, and more. It's designed to assess a model's ability to understand and apply knowledge across a wide range of subjects, providing a robust measure of general intelligence. While it is a multiple choice test, instead of 4 answer options like in its predecessor MMLU, there are now 10 options per question, which drastically reduces the probability of correct answers by chance. Additionally, the focus is increasingly on complex reasoning tasks rather than pure factual knowledge.
For my benchmarks, I currently limit myself to the Computer Science category with its 410 questions. This pragmatic decision is based on several factors: First, I place particular emphasis on responses from my usual work environment, since I frequently use these models in this context during my daily work. Second, with local models running on consumer hardware, there are practical constraints around computation time - a single run already takes several hours with larger models, and I generally conduct at least two runs to ensure consistency.
Unlike typical benchmarks that only report single scores, I conduct multiple test runs for each model to capture performance variability. This comprehensive approach delivers a more accurate and nuanced understanding of each model's true capabilities. By executing at least two benchmark runs per model, I establish a robust assessment of both performance levels and consistency. The results feature error bars that show standard deviation, illustrating how performance varies across different test runs.
The benchmarks for this study alone required over 103 hours of runtime. With additional categories or runs, the testing duration would have become so long with the available resources that the tested models would have been outdated by the time the study was completed. Therefore, establishing practical framework conditions and boundaries is essential to achieve meaningful results within a reasonable timeframe.
Detailed Results
Here's the complete table, including results from previous reports:
Model | HF Main Model Name | HF Draft Model Name (speculative decoding) | Size | Format | API | GPU | GPU Mem | Run | Duration | Total | % | Correct Random Guesses | Prompt tokens | tk/s | Completion tokens | tk/s |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
claude-3-5-sonnet-20241022 | - | - | - | - | Anthropic | - | - | 1/3 | 31m 50s | 340/410 | 82.93% | 694458 | 362.78 | 97438 | 50.90 | |
claude-3-5-sonnet-20241022 | - | - | - | - | Anthropic | - | - | 2/3 | 31m 39s | 338/410 | 82.44% | 694458 | 364.82 | 97314 | 51.12 | |
π claude-3-5-sonnet-20241022 | - | - | - | - | Anthropic | - | - | 3/3 | 28m 56s | 337/410 | 82.20% | 867478 | 498.45 | 84785 | 48.72 | |
gemini-1.5-pro-002 | - | - | - | - | Gemini | - | - | 1/3 | 31m 7s | 335/410 | 81.71% | 648675 | 346.82 | 78311 | 41.87 | |
π gemini-1.5-pro-002 | - | - | - | - | Gemini | - | - | 2/3 | 29m 52s | 333/410 | 81.22% | 648675 | 361.38 | 77030 | 42.91 | |
gemini-1.5-pro-002 | - | - | - | - | Gemini | - | - | 3/3 | 30m 40s | 327/410 | 79.76% | 648675 | 351.73 | 76063 | 41.24 | |
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) | bartowski/QwQ-32B-Preview-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 38436MiB | 1/2 | 2h 3m 30s | 325/410 | 79.27% | 0/2, 0.00% | 656716 | 88.58 | 327825 | 44.22 |
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) | bartowski/QwQ-32B-Preview-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 38436MiB | 2/2 | 2h 3m 35s | 324/410 | 79.02% | 656716 | 88.52 | 343440 | 46.29 | |
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) | wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | RTX 6000 | 44496MiB | 1/3 | 2h 13m 5s | 326/410 | 79.51% | 656716 | 82.21 | 142256 | 17.81 | |
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) | wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | RTX 6000 | 44496MiB | 2/3 | 2h 14m 53s | 317/410 | 77.32% | 656716 | 81.11 | 143659 | 17.74 | |
π Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) | wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | RTX 6000 | 44496MiB | 3/3 | 1h 49m 40s | 312/410 | 76.10% | 805136 | 122.30 | 115284 | 17.51 | |
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) | LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | 2x RTX 3090 | 41150MiB | 1/2 | 3h 7m 58s | 320/410 | 78.05% | 656716 | 58.21 | 139499 | 12.36 | |
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) | LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | 2x RTX 3090 | 41150MiB | 2/2 | 3h 5m 19s | 319/410 | 77.80% | 656716 | 59.04 | 138135 | 12.42 | |
DeepSeek-V3 | deepseek-ai/DeepSeek-V3 | - | 671B | HF | DeepSeek | - | - | 1/4 | 20m 22s | 320/410 | 78.05% | 628029 | 512.38 | 66807 | 54.50 | |
DeepSeek-V3 | deepseek-ai/DeepSeek-V3 | - | 671B | HF | DeepSeek | - | - | 2/4 | 27m 43s | 320/410 | 78.05% | 628029 | 376.59 | 66874 | 40.10 | |
DeepSeek-V3 | deepseek-ai/DeepSeek-V3 | - | 671B | HF | DeepSeek | - | - | 3/4 | 19m 45s | 319/410 | 77.80% | 628029 | 528.39 | 64470 | 54.24 | |
DeepSeek-V3 | deepseek-ai/DeepSeek-V3 | - | 671B | HF | DeepSeek | - | - | 4/4 | 19m 45s | 319/410 | 77.80% | 628029 | 375.73 | 69531 | 41.60 | |
gpt-4o-2024-08-06 | - | - | - | - | OpenAI | - | - | 1/2 | 34m 54s | 320/410 | 78.05% | 1/2, 50.00% | 631448 | 300.79 | 99103 | 47.21 |
gpt-4o-2024-08-06 | - | - | - | - | OpenAI | - | - | 2/2 | 42m 41s | 316/410 | 77.07% | 1/3, 33.33% | 631448 | 246.02 | 98466 | 38.36 |
mistral-large-2407 (123B) | mistralai/Mistral-Large-Instruct-2407 | - | 123B | HF | Mistral | - | - | 1/2 | 40m 23s | 310/410 | 75.61% | 696798 | 287.13 | 79444 | 32.74 | |
mistral-large-2407 (123B) | mistralai/Mistral-Large-Instruct-2407 | - | 123B | HF | Mistral | - | - | 2/2 | 46m 55s | 308/410 | 75.12% | 0/1, 0.00% | 696798 | 247.21 | 75971 | 26.95 |
Llama-3.1-405B-Instruct-FP8 | meta-llama/Llama-3.1-405B-Instruct-FP8 | - | 405B | HF | IONOS | - | - | 1/2 | 2h 5m 28s | 311/410 | 75.85% | 648580 | 86.11 | 79191 | 10.51 | |
Llama-3.1-405B-Instruct-FP8 | meta-llama/Llama-3.1-405B-Instruct-FP8 | - | 405B | HF | IONOS | - | - | 2/2 | 2h 10m 19s | 307/410 | 74.88% | 648580 | 82.90 | 79648 | 10.18 | |
mistral-large-2411 (123B) | mistralai/Mistral-Large-Instruct-2411 | - | 123B | HF | Mistral | - | - | 1/2 | 41m 46s | 302/410 | 73.66% | 1/3, 33.33% | 696798 | 277.70 | 82028 | 32.69 |
mistral-large-2411 (123B) | mistralai/Mistral-Large-Instruct-2411 | - | 123B | HF | Mistral | - | - | 2/2 | 32m 47s | 300/410 | 73.17% | 0/1, 0.00% | 696798 | 353.53 | 77998 | 39.57 |
chatgpt-4o-latest @ 2024-11-18 | - | - | - | - | OpenAI | - | - | 1/2 | 28m 17s | 302/410 | 73.66% | 2/4, 50.00% | 631448 | 371.33 | 146558 | 86.18 |
chatgpt-4o-latest @ 2024-11-18 | - | - | - | - | OpenAI | - | - | 2/2 | 28m 31s | 298/410 | 72.68% | 2/2, 100.00% | 631448 | 368.19 | 146782 | 85.59 |
gpt-4o-2024-11-20 | - | - | - | - | OpenAI | - | - | 1/2 | 25m 35s | 296/410 | 72.20% | 1/7, 14.29% | 631448 | 410.38 | 158694 | 103.14 |
gpt-4o-2024-11-20 | - | - | - | - | OpenAI | - | - | 2/2 | 26m 10s | 294/410 | 71.71% | 1/7, 14.29% | 631448 | 400.95 | 160378 | 101.84 |
Llama-3.3-70B-Instruct (4.0bpw EXL2) | LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2 | - | 70B | EXL2 | TabbyAPI | RTX 6000 | 47148MiB | 1/2 | 2h 2m 33s | 293/410 | 71.46% | 648580 | 88.15 | 87107 | 11.84 | |
Llama-3.3-70B-Instruct (4.0bpw EXL2) | LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2 | - | 70B | EXL2 | TabbyAPI | RTX 6000 | 47148MiB | 2/2 | 1h 33m 59s | 293/410 | 71.46% | 534360 | 94.70 | 89510 | 15.86 | |
Llama-3.1-70B-Instruct | meta-llama/Llama-3.1-70B-Instruct | - | 70B | HF | IONOS | - | - | 1/2 | 41m 12s | 291/410 | 70.98% | 3/12, 25.00% | 648580 | 261.88 | 102559 | 41.41 |
Llama-3.1-70B-Instruct | meta-llama/Llama-3.1-70B-Instruct | - | 70B | HF | IONOS | - | - | 2/2 | 39m 48s | 287/410 | 70.00% | 3/14, 21.43% | 648580 | 271.12 | 106644 | 44.58 |
Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2) | bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25 | - | 70B | EXL2 | TabbyAPI | RTX 6000 | 40104MiB | 1/2 | 2h 13m 3s | 290/410 | 70.73% | 640380 | 80.18 | 157235 | 19.69 | |
Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2) | bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25 | - | 70B | EXL2 | TabbyAPI | RTX 6000 | 40104MiB | 2/2 | 2h 13m 15s | 287/410 | 70.00% | 0/1, 0.00% | 640380 | 80.07 | 157471 | 19.69 |
QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384) | wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 72B | EXL2 | TabbyAPI | RTX 6000 | 46260MiB | 1/2 | 3h 43m 12s | 290/410 | 70.73% | 1/3, 33.33% | 656716 | 49.02 | 441187 | 32.93 |
QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384) | wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 72B | EXL2 | TabbyAPI | RTX 6000 | 46260MiB | 2/2 | 3h 47m 29s | 284/410 | 69.27% | 0/2, 0.00% | 656716 | 48.10 | 450363 | 32.99 |
gemini-1.5-flash-002 | - | - | - | - | Gemini | - | - | 1/2 | 13m 19s | 288/410 | 70.24% | 1/6, 16.67% | 648675 | 808.52 | 80535 | 100.38 |
gemini-1.5-flash-002 | - | - | - | - | Gemini | - | - | 2/2 | 22m 30s | 285/410 | 69.51% | 2/7, 28.57% | 648675 | 479.42 | 80221 | 59.29 |
Llama-3.2-90B-Vision-Instruct | meta-llama/Llama-3.2-90B-Vision-Instruct | - | 90B | HF | Azure | - | - | 1/2 | 33m 6s | 289/410 | 70.49% | 4/7, 57.14% | 640380 | 321.96 | 88997 | 44.74 |
Llama-3.2-90B-Vision-Instruct | meta-llama/Llama-3.2-90B-Vision-Instruct | - | 90B | HF | Azure | - | - | 2/2 | 31m 31s | 281/410 | 68.54% | 2/5, 40.00% | 640380 | 338.10 | 85381 | 45.08 |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-3B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 45880MiB | 1/7 | 41m 59s | 289/410 | 70.49% | 656716 | 260.29 | 92126 | 36.51 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 40036MiB | 2/7 | 34m 24s | 286/410 | 69.76% | 656716 | 317.48 | 89487 | 43.26 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-3B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 45880MiB | 3/7 | 41m 27s | 283/410 | 69.02% | 0/1, 0.00% | 656716 | 263.62 | 90349 | 36.27 |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 | 32B | EXL2 | TabbyAPI | RTX 6000 | 43688MiB | 4/7 | 42m 32s | 283/410 | 69.02% | 0/1, 0.00% | 656716 | 256.77 | 90899 | 35.54 |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 | 32B | EXL2 | TabbyAPI | RTX 6000 | 43688MiB | 5/7 | 44m 34s | 282/410 | 68.78% | 0/1, 0.00% | 656716 | 245.24 | 96470 | 36.03 |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 38620MiB | 6/7 | 1h 2m 8s | 282/410 | 68.78% | 656716 | 175.98 | 92767 | 24.86 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 40036MiB | 7/7 | 34m 56s | 280/410 | 68.29% | 656716 | 312.66 | 91926 | 43.76 | |
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) | MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 47068MiB | 1/2 | 1h 26m 26s | 284/410 | 69.27% | 1/3, 33.33% | 696798 | 134.23 | 79925 | 15.40 |
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) | MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 47068MiB | 2/2 | 1h 26m 10s | 275/410 | 67.07% | 0/2, 0.00% | 696798 | 134.67 | 79778 | 15.42 |
π Phi-4 (14B, Unsloth, GGUF) | unsloth/phi-4-GGUF | - | 14B | GGUF | llama.cpp | RTX 6000 | 31978MiB | 1/2 | 1h 19m 51s | 278/410 | 67.80% | 1/6, 16.67% | 639591 | 133.40 | 133610 | 27.87 |
π Phi-4 (14B, Unsloth, GGUF) | unsloth/phi-4-GGUF | - | 14B | GGUF | llama.cpp | RTX 6000 | 31978MiB | 2/2 | 1h 19m 41s | 278/410 | 67.80% | 1/6, 16.67% | 639591 | 133.67 | 133610 | 27.92 |
π Phi-4 (14B, Unsloth, HF) | unsloth/phi-4 | - | 14B | HF | TabbyAPI | RTX 6000 | 1/2 | 1h 38m 29s | 274/410 | 66.83% | 1/3, 33.33% | 635081 | 107.42 | 113731 | 19.24 | |
π Phi-4 (14B, Unsloth, HF) | unsloth/phi-4 | - | 14B | HF | TabbyAPI | RTX 6000 | 2/2 | 1h 39m 32s | 273/410 | 66.59% | 1/3, 33.33% | 635081 | 106.29 | 113712 | 19.03 | |
π Phi-4 (14B, Microsoft, HF) | microsoft/phi-4 | - | 14B | HF | TabbyAPI | RTX 6000 | 31394MiB | 1/2 | 1h 7m 44s | 272/410 | 66.34% | 1/3, 33.33% | 635081 | 156.15 | 113358 | 27.87 |
π Phi-4 (14B, Microsoft, HF) | microsoft/phi-4 | - | 14B | HF | TabbyAPI | RTX 6000 | 31394MiB | 2/2 | 1h 7m 44s | 271/410 | 66.10% | 1/3, 33.33% | 635081 | 156.10 | 113384 | 27.87 |
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) | turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 1/2 | 1h 8m 8s | 271/410 | 66.10% | 696798 | 170.29 | 66670 | 16.29 | |
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) | turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 2/2 | 1h 10m 38s | 268/410 | 65.37% | 1/3, 33.33% | 696798 | 164.23 | 69182 | 16.31 |
π Qwen2-VL-72B-Instruct (4.5bpw EXL2) | turboderp/Qwen2-VL-72B-Instruct-exl2_4.5bpw | - | 72B | EXL2 | TabbyAPI | RTX 6000 | 43554MiB | 1/2 | 1h 10m 51s | 255/410 | 62.20% | 30/3, 0.00% | 656716 | 154.36 | 71752 | 16.87 |
π Qwen2-VL-72B-Instruct (4.5bpw EXL2) | turboderp/Qwen2-VL-72B-Instruct-exl2_4.5bpw | - | 72B | EXL2 | TabbyAPI | RTX 6000 | 43554MiB | 2/2 | 1h 26m 40s | 255/410 | 62.20% | 1/6, 16.67% | 656716 | 126.20 | 88249 | 16.96 |
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) | wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 1/2 | 1h 11m 50s | 267/410 | 65.12% | 1/4, 25.00% | 696798 | 161.53 | 70538 | 16.35 |
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) | wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 2/2 | 1h 13m 50s | 243/410 | 59.27% | 0/4, 0.00% | 696798 | 157.18 | 72718 | 16.40 |
Falcon3-10B-Instruct | tiiuae/Falcon3-10B-Instruct | - | 10B | HF | Ollama | RTX 6000 | 20906MiB | 1/2 | 35m 15s | 251/410 | 61.22% | 2/5, 40.00% | 702578 | 331.57 | 75501 | 35.63 |
Falcon3-10B-Instruct | tiiuae/Falcon3-10B-Instruct | - | 10B | HF | Ollama | RTX 6000 | 20906MiB | 2/2 | 35m 21s | 251/410 | 61.22% | 2/5, 40.00% | 702578 | 330.66 | 75501 | 35.53 |
mistral-small-2409 (22B) | mistralai/Mistral-Small-Instruct-2409 | - | 22B | HF | Mistral | - | - | 1/2 | 25m 3s | 243/410 | 59.27% | 1/4, 25.00% | 696798 | 462.38 | 73212 | 48.58 |
mistral-small-2409 (22B) | mistralai/Mistral-Small-Instruct-2409 | - | 22B | HF | Mistral | - | - | 2/2 | 20m 45s | 239/410 | 58.29% | 1/4, 25.00% | 696798 | 558.10 | 76017 | 60.89 |
π Aya-Expanse-32B (8.0bpw EXL2) | lucyknada/CohereForAI_aya-expanse-32b-exl2_8.0bpw | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 33686MiB | 1/2 | 43m 18s | 212/410 | 51.71% | 0/1, 0.00% | 661930 | 254.04 | 60728 | 23.31 |
π Aya-Expanse-32B (8.0bpw EXL2) | lucyknada/CohereForAI_aya-expanse-32b-exl2_8.0bpw | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 33686MiB | 2/2 | 42m 27s | 211/410 | 51.46% | 0/4, 0.00% | 661930 | 259.50 | 59557 | 23.35 |
- Model: Model name (with relevant parameter and setting details)
- HF Main Model Name: Full name of the tested model as listed on Hugging Face
- HF Draft Model Name (speculative decoding): Draft model used for speculative decoding (if applicable)
- Size: Parameter count
- Format: Model format type (HF, EXL2, etc.)
- API: Service provider (TabbyAPI indicates local deployment)
- GPU: Graphics card used for this benchmark run
- GPU Mem: VRAM allocated to model and configuration
- Run: Benchmark run sequence number
- Duration: Total runtime of benchmark
- Total: Number of correct answers (determines ranking!)
- %: Percentage of correct answers
- Correct Random Guesses: When MMLU-Pro cannot definitively identify a model's answer choice, it defaults to random guessing and reports both the number of these random guesses and their accuracy (a high proportion of random guessing indicates problems with following the response format)
- Prompt tokens: Token count of input text
- tk/s: Tokens processed per second
- Completion tokens: Token count of generated response
- tk/s: Tokens generated per second
Wolfram Ravenwolf is a German AI Engineer and an internationally active consultant and renowned researcher who's particularly passionate about local language models. You can follow him on X and Bluesky, read his previous LLM tests and comparisons on HF and Reddit, check out his models on Hugging Face, tip him on Ko-fi, or book him for a consultation.