🐺🐦‍⬛ LLM Comparison/Test: Phi-4, Qwen2 VL 72B Instruct, Aya Expanse 32B in my updated MMLU-Pro CS benchmark

Community Article Published January 10, 2025

Introduction

I've updated my MMLU-Pro Computer Science LLM benchmark results with new data from recently tested models: three Phi-4 variants (Microsoft's official weights, plus Unsloth's fixed HF and GGUF versions), Qwen2 VL 72B Instruct, and Aya Expanse 32B.

While adding these new models, I streamlined the graph by removing all QwQ-32B-Preview variants except for the main model, which consistently showed superior performance. I also incorporated the results from a third evaluation run for Claude, gemini-1.5-pro-002 and Athene-V2-Chat, which shifted Athene's rank downward due to fluctuations in its scores.

Here's an additional visualization that represents each model as a 3D bar where the height shows the MMLU score (%), the depth represents the model size in billions of parameters, and for quantized models, the bar is split into a full-color front section proportional to the quantized size and a lighter-colored back section showing the memory savings compared to full precision (16-bit) models:

New Models Tested

Phi-4:
- Unsloth's fixed Transformers implementation showed minimal performance differences in benchmarks, with the GGUF version achieving marginally higher accuracy. Further testing would be needed to determine whether this improvement represents a statistically significant trend or random variation.
- Temperature settings had a notable impact on performance: At 0, responses were consistent but repetitive, while at 1, outputs became erratic and unpredictable.
- German language performance improved substantially compared to previous versions. Though it's a small model, which occasionally results in overly literal translations, the overall quality of German outputs is good enough for most purposes.
- Censorship can be completely circumvented through basic prompt engineering techniques.
- Still undecided on how good Phi-4 is for general-purpose tasks in real use - but generally recommend you run a bigger, better model if you can.
Qwen2 VL 72B Instruct:
- Given its relatively low scores and reliance on the older Qwen2 series rather than the superior 2.5 architecture, I look forward to the release of a Qwen2.5 VL 72B model.
Aya Expanse 32B:
- While this model shows the lowest score on the graph, keep in mind that I only included models scoring above 50%. There are other, worse models that fell below this threshold and didn't make it onto the visualization at all.
- Its main advantage is the support of 23 languages, making it a solid choice when you need those multilingual capabilities and have no better alternatives. Of course, if your target language is supported by a better model, use that instead.

About the Benchmark

The MMLU-Pro benchmark is a comprehensive evaluation of large language models across various categories, including computer science, mathematics, physics, chemistry, and more. It's designed to assess a model's ability to understand and apply knowledge across a wide range of subjects, providing a robust measure of general intelligence. While it is a multiple choice test, instead of 4 answer options like in its predecessor MMLU, there are now 10 options per question, which drastically reduces the probability of correct answers by chance. Additionally, the focus is increasingly on complex reasoning tasks rather than pure factual knowledge.

For my benchmarks, I currently limit myself to the Computer Science category with its 410 questions. This pragmatic decision is based on several factors: First, I place particular emphasis on responses from my usual work environment, since I frequently use these models in this context during my daily work. Second, with local models running on consumer hardware, there are practical constraints around computation time - a single run already takes several hours with larger models, and I generally conduct at least two runs to ensure consistency.

Unlike typical benchmarks that only report single scores, I conduct multiple test runs for each model to capture performance variability. This comprehensive approach delivers a more accurate and nuanced understanding of each model's true capabilities. By executing at least two benchmark runs per model, I establish a robust assessment of both performance levels and consistency. The results feature error bars that show standard deviation, illustrating how performance varies across different test runs.

The benchmarks for this study alone required over 103 hours of runtime. With additional categories or runs, the testing duration would have become so long with the available resources that the tested models would have been outdated by the time the study was completed. Therefore, establishing practical framework conditions and boundaries is essential to achieve meaningful results within a reasonable timeframe.

Detailed Results

Here's the complete table, including results from previous reports:

Model	HF Main Model Name	HF Draft Model Name (speculative decoding)	Size	Format	API	GPU	GPU Mem	Run	Duration	Total	%	Correct Random Guesses	Prompt tokens	tk/s	Completion tokens	tk/s
claude-3-5-sonnet-20241022	-	-	-	-	Anthropic	-	-	1/3	31m 50s	340/410	82.93%		694458	362.78	97438	50.90
claude-3-5-sonnet-20241022	-	-	-	-	Anthropic	-	-	2/3	31m 39s	338/410	82.44%		694458	364.82	97314	51.12
🆕 claude-3-5-sonnet-20241022	-	-	-	-	Anthropic	-	-	3/3	28m 56s	337/410	82.20%		867478	498.45	84785	48.72
gemini-1.5-pro-002	-	-	-	-	Gemini	-	-	1/3	31m 7s	335/410	81.71%		648675	346.82	78311	41.87
🆕 gemini-1.5-pro-002	-	-	-	-	Gemini	-	-	2/3	29m 52s	333/410	81.22%		648675	361.38	77030	42.91
gemini-1.5-pro-002	-	-	-	-	Gemini	-	-	3/3	30m 40s	327/410	79.76%		648675	351.73	76063	41.24
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384)	bartowski/QwQ-32B-Preview-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	38436MiB	1/2	2h 3m 30s	325/410	79.27%	0/2, 0.00%	656716	88.58	327825	44.22
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384)	bartowski/QwQ-32B-Preview-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	38436MiB	2/2	2h 3m 35s	324/410	79.02%		656716	88.52	343440	46.29
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache)	wolfram/Athene-V2-Chat-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	RTX 6000	44496MiB	1/3	2h 13m 5s	326/410	79.51%		656716	82.21	142256	17.81
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache)	wolfram/Athene-V2-Chat-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	RTX 6000	44496MiB	2/3	2h 14m 53s	317/410	77.32%		656716	81.11	143659	17.74
🆕 Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache)	wolfram/Athene-V2-Chat-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	RTX 6000	44496MiB	3/3	1h 49m 40s	312/410	76.10%		805136	122.30	115284	17.51
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache)	LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	2x RTX 3090	41150MiB	1/2	3h 7m 58s	320/410	78.05%		656716	58.21	139499	12.36
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache)	LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	2x RTX 3090	41150MiB	2/2	3h 5m 19s	319/410	77.80%		656716	59.04	138135	12.42
DeepSeek-V3	deepseek-ai/DeepSeek-V3	-	671B	HF	DeepSeek	-	-	1/4	20m 22s	320/410	78.05%		628029	512.38	66807	54.50
DeepSeek-V3	deepseek-ai/DeepSeek-V3	-	671B	HF	DeepSeek	-	-	2/4	27m 43s	320/410	78.05%		628029	376.59	66874	40.10
DeepSeek-V3	deepseek-ai/DeepSeek-V3	-	671B	HF	DeepSeek	-	-	3/4	19m 45s	319/410	77.80%		628029	528.39	64470	54.24
DeepSeek-V3	deepseek-ai/DeepSeek-V3	-	671B	HF	DeepSeek	-	-	4/4	19m 45s	319/410	77.80%		628029	375.73	69531	41.60
gpt-4o-2024-08-06	-	-	-	-	OpenAI	-	-	1/2	34m 54s	320/410	78.05%	1/2, 50.00%	631448	300.79	99103	47.21
gpt-4o-2024-08-06	-	-	-	-	OpenAI	-	-	2/2	42m 41s	316/410	77.07%	1/3, 33.33%	631448	246.02	98466	38.36
mistral-large-2407 (123B)	mistralai/Mistral-Large-Instruct-2407	-	123B	HF	Mistral	-	-	1/2	40m 23s	310/410	75.61%		696798	287.13	79444	32.74
mistral-large-2407 (123B)	mistralai/Mistral-Large-Instruct-2407	-	123B	HF	Mistral	-	-	2/2	46m 55s	308/410	75.12%	0/1, 0.00%	696798	247.21	75971	26.95
Llama-3.1-405B-Instruct-FP8	meta-llama/Llama-3.1-405B-Instruct-FP8	-	405B	HF	IONOS	-	-	1/2	2h 5m 28s	311/410	75.85%		648580	86.11	79191	10.51
Llama-3.1-405B-Instruct-FP8	meta-llama/Llama-3.1-405B-Instruct-FP8	-	405B	HF	IONOS	-	-	2/2	2h 10m 19s	307/410	74.88%		648580	82.90	79648	10.18
mistral-large-2411 (123B)	mistralai/Mistral-Large-Instruct-2411	-	123B	HF	Mistral	-	-	1/2	41m 46s	302/410	73.66%	1/3, 33.33%	696798	277.70	82028	32.69
mistral-large-2411 (123B)	mistralai/Mistral-Large-Instruct-2411	-	123B	HF	Mistral	-	-	2/2	32m 47s	300/410	73.17%	0/1, 0.00%	696798	353.53	77998	39.57
chatgpt-4o-latest @ 2024-11-18	-	-	-	-	OpenAI	-	-	1/2	28m 17s	302/410	73.66%	2/4, 50.00%	631448	371.33	146558	86.18
chatgpt-4o-latest @ 2024-11-18	-	-	-	-	OpenAI	-	-	2/2	28m 31s	298/410	72.68%	2/2, 100.00%	631448	368.19	146782	85.59
gpt-4o-2024-11-20	-	-	-	-	OpenAI	-	-	1/2	25m 35s	296/410	72.20%	1/7, 14.29%	631448	410.38	158694	103.14
gpt-4o-2024-11-20	-	-	-	-	OpenAI	-	-	2/2	26m 10s	294/410	71.71%	1/7, 14.29%	631448	400.95	160378	101.84
Llama-3.3-70B-Instruct (4.0bpw EXL2)	LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2	-	70B	EXL2	TabbyAPI	RTX 6000	47148MiB	1/2	2h 2m 33s	293/410	71.46%		648580	88.15	87107	11.84
Llama-3.3-70B-Instruct (4.0bpw EXL2)	LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2	-	70B	EXL2	TabbyAPI	RTX 6000	47148MiB	2/2	1h 33m 59s	293/410	71.46%		534360	94.70	89510	15.86
Llama-3.1-70B-Instruct	meta-llama/Llama-3.1-70B-Instruct	-	70B	HF	IONOS	-	-	1/2	41m 12s	291/410	70.98%	3/12, 25.00%	648580	261.88	102559	41.41
Llama-3.1-70B-Instruct	meta-llama/Llama-3.1-70B-Instruct	-	70B	HF	IONOS	-	-	2/2	39m 48s	287/410	70.00%	3/14, 21.43%	648580	271.12	106644	44.58
Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2)	bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25	-	70B	EXL2	TabbyAPI	RTX 6000	40104MiB	1/2	2h 13m 3s	290/410	70.73%		640380	80.18	157235	19.69
Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2)	bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25	-	70B	EXL2	TabbyAPI	RTX 6000	40104MiB	2/2	2h 13m 15s	287/410	70.00%	0/1, 0.00%	640380	80.07	157471	19.69
QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384)	wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2	Qwen/Qwen2.5-Coder-0.5B-Instruct	72B	EXL2	TabbyAPI	RTX 6000	46260MiB	1/2	3h 43m 12s	290/410	70.73%	1/3, 33.33%	656716	49.02	441187	32.93
QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384)	wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2	Qwen/Qwen2.5-Coder-0.5B-Instruct	72B	EXL2	TabbyAPI	RTX 6000	46260MiB	2/2	3h 47m 29s	284/410	69.27%	0/2, 0.00%	656716	48.10	450363	32.99
gemini-1.5-flash-002	-	-	-	-	Gemini	-	-	1/2	13m 19s	288/410	70.24%	1/6, 16.67%	648675	808.52	80535	100.38
gemini-1.5-flash-002	-	-	-	-	Gemini	-	-	2/2	22m 30s	285/410	69.51%	2/7, 28.57%	648675	479.42	80221	59.29
Llama-3.2-90B-Vision-Instruct	meta-llama/Llama-3.2-90B-Vision-Instruct	-	90B	HF	Azure	-	-	1/2	33m 6s	289/410	70.49%	4/7, 57.14%	640380	321.96	88997	44.74
Llama-3.2-90B-Vision-Instruct	meta-llama/Llama-3.2-90B-Vision-Instruct	-	90B	HF	Azure	-	-	2/2	31m 31s	281/410	68.54%	2/5, 40.00%	640380	338.10	85381	45.08
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-3B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	45880MiB	1/7	41m 59s	289/410	70.49%		656716	260.29	92126	36.51
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	40036MiB	2/7	34m 24s	286/410	69.76%		656716	317.48	89487	43.26
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-3B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	45880MiB	3/7	41m 27s	283/410	69.02%	0/1, 0.00%	656716	263.62	90349	36.27
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0	32B	EXL2	TabbyAPI	RTX 6000	43688MiB	4/7	42m 32s	283/410	69.02%	0/1, 0.00%	656716	256.77	90899	35.54
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0	32B	EXL2	TabbyAPI	RTX 6000	43688MiB	5/7	44m 34s	282/410	68.78%	0/1, 0.00%	656716	245.24	96470	36.03
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	-	32B	EXL2	TabbyAPI	RTX 6000	38620MiB	6/7	1h 2m 8s	282/410	68.78%		656716	175.98	92767	24.86
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	40036MiB	7/7	34m 56s	280/410	68.29%		656716	312.66	91926	43.76
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2)	MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	47068MiB	1/2	1h 26m 26s	284/410	69.27%	1/3, 33.33%	696798	134.23	79925	15.40
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2)	MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	47068MiB	2/2	1h 26m 10s	275/410	67.07%	0/2, 0.00%	696798	134.67	79778	15.42
🆕 Phi-4 (14B, Unsloth, GGUF)	unsloth/phi-4-GGUF	-	14B	GGUF	llama.cpp	RTX 6000	31978MiB	1/2	1h 19m 51s	278/410	67.80%	1/6, 16.67%	639591	133.40	133610	27.87
🆕 Phi-4 (14B, Unsloth, GGUF)	unsloth/phi-4-GGUF	-	14B	GGUF	llama.cpp	RTX 6000	31978MiB	2/2	1h 19m 41s	278/410	67.80%	1/6, 16.67%	639591	133.67	133610	27.92
🆕 Phi-4 (14B, Unsloth, HF)	unsloth/phi-4	-	14B	HF	TabbyAPI	RTX 6000		1/2	1h 38m 29s	274/410	66.83%	1/3, 33.33%	635081	107.42	113731	19.24
🆕 Phi-4 (14B, Unsloth, HF)	unsloth/phi-4	-	14B	HF	TabbyAPI	RTX 6000		2/2	1h 39m 32s	273/410	66.59%	1/3, 33.33%	635081	106.29	113712	19.03
🆕 Phi-4 (14B, Microsoft, HF)	microsoft/phi-4	-	14B	HF	TabbyAPI	RTX 6000	31394MiB	1/2	1h 7m 44s	272/410	66.34%	1/3, 33.33%	635081	156.15	113358	27.87
🆕 Phi-4 (14B, Microsoft, HF)	microsoft/phi-4	-	14B	HF	TabbyAPI	RTX 6000	31394MiB	2/2	1h 7m 44s	271/410	66.10%	1/3, 33.33%	635081	156.10	113384	27.87
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2)	turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	1/2	1h 8m 8s	271/410	66.10%		696798	170.29	66670	16.29
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2)	turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	2/2	1h 10m 38s	268/410	65.37%	1/3, 33.33%	696798	164.23	69182	16.31
🆕 Qwen2-VL-72B-Instruct (4.5bpw EXL2)	turboderp/Qwen2-VL-72B-Instruct-exl2_4.5bpw	-	72B	EXL2	TabbyAPI	RTX 6000	43554MiB	1/2	1h 10m 51s	255/410	62.20%	30/3, 0.00%	656716	154.36	71752	16.87
🆕 Qwen2-VL-72B-Instruct (4.5bpw EXL2)	turboderp/Qwen2-VL-72B-Instruct-exl2_4.5bpw	-	72B	EXL2	TabbyAPI	RTX 6000	43554MiB	2/2	1h 26m 40s	255/410	62.20%	1/6, 16.67%	656716	126.20	88249	16.96
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2)	wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	1/2	1h 11m 50s	267/410	65.12%	1/4, 25.00%	696798	161.53	70538	16.35
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2)	wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	2/2	1h 13m 50s	243/410	59.27%	0/4, 0.00%	696798	157.18	72718	16.40
Falcon3-10B-Instruct	tiiuae/Falcon3-10B-Instruct	-	10B	HF	Ollama	RTX 6000	20906MiB	1/2	35m 15s	251/410	61.22%	2/5, 40.00%	702578	331.57	75501	35.63
Falcon3-10B-Instruct	tiiuae/Falcon3-10B-Instruct	-	10B	HF	Ollama	RTX 6000	20906MiB	2/2	35m 21s	251/410	61.22%	2/5, 40.00%	702578	330.66	75501	35.53
mistral-small-2409 (22B)	mistralai/Mistral-Small-Instruct-2409	-	22B	HF	Mistral	-	-	1/2	25m 3s	243/410	59.27%	1/4, 25.00%	696798	462.38	73212	48.58
mistral-small-2409 (22B)	mistralai/Mistral-Small-Instruct-2409	-	22B	HF	Mistral	-	-	2/2	20m 45s	239/410	58.29%	1/4, 25.00%	696798	558.10	76017	60.89
🆕 Aya-Expanse-32B (8.0bpw EXL2)	lucyknada/CohereForAI_aya-expanse-32b-exl2_8.0bpw	-	32B	EXL2	TabbyAPI	RTX 6000	33686MiB	1/2	43m 18s	212/410	51.71%	0/1, 0.00%	661930	254.04	60728	23.31
🆕 Aya-Expanse-32B (8.0bpw EXL2)	lucyknada/CohereForAI_aya-expanse-32b-exl2_8.0bpw	-	32B	EXL2	TabbyAPI	RTX 6000	33686MiB	2/2	42m 27s	211/410	51.46%	0/4, 0.00%	661930	259.50	59557	23.35

Model: Model name (with relevant parameter and setting details)
HF Main Model Name: Full name of the tested model as listed on Hugging Face
HF Draft Model Name (speculative decoding): Draft model used for speculative decoding (if applicable)
Size: Parameter count
Format: Model format type (HF, EXL2, etc.)
API: Service provider (TabbyAPI indicates local deployment)
GPU: Graphics card used for this benchmark run
GPU Mem: VRAM allocated to model and configuration
Run: Benchmark run sequence number
Duration: Total runtime of benchmark
Total: Number of correct answers (determines ranking!)
%: Percentage of correct answers
Correct Random Guesses: When MMLU-Pro cannot definitively identify a model's answer choice, it defaults to random guessing and reports both the number of these random guesses and their accuracy (a high proportion of random guessing indicates problems with following the response format)
Prompt tokens: Token count of input text
tk/s: Tokens processed per second
Completion tokens: Token count of generated response
tk/s: Tokens generated per second

Wolfram Ravenwolf is a German AI Engineer and an internationally active consultant and renowned researcher who's particularly passionate about local language models. You can follow him on X and Bluesky, read his previous LLM tests and comparisons on HF and Reddit, check out his models on Hugging Face, tip him on Ko-fi, or book him for a consultation.

Upvote