@@ -467,31 +467,6 @@ Authorization: Bearer your_secret_api_key
467
467
468
468
## SOTA results on benchmarks with optillm
469
469
470
- ### CePO on math and code benchmarks (Mar 2025)
471
-
472
- | Method | Math-L5 | MMLU-Pro (Math) | CRUX | LiveCodeBench (pass@1) | Simple QA |
473
- | -----------------------------: | :-----: | :-------------: | :----: | :--------------------: | :-------: |
474
- | Llama 3.3 70B | 51.0 | 78.6 | 72.6 | 27.1 | 20.9 |
475
- | Llama 3.1 405B | 49.8 | 79.2 | 73.0 | 31.8 | 13.5 |
476
- | CePO (using Llama 3.3 70B) | 69.6 | 84.8 | 80.1 | 31.9 | ** 22.6** |
477
- | QwQ 32B | 61.4 | 90.8 | 82.5 | 44.3 | 7.8 |
478
- | CePO (using QwQ 32B) | 88.1 | ** 92.0** | 86.3 | ** 51.5** | 8.2 |
479
- | DeepSeek R1 Llama | 83.1 | 82.0 | 84.0 | 47.3 | 14.6 |
480
- | CePO (using DeepSeek R1 Llama) | ** 90.2** | 84.0 | ** 89.4** | 47.2 | 15.5 |
481
-
482
- ### coc-claude-3-5-sonnet-20241022 on AIME 2024 pass@1 (Nov 2024)
483
-
484
- | Model | Score |
485
- | -------| -----:|
486
- | o1-mini | 56.67 |
487
- | coc-claude-3-5-sonnet-20241022 | 46.67 |
488
- | coc-gemini/gemini-exp-1121 | 46.67 |
489
- | o1-preview | 40.00 |
490
- | gemini-exp-1114 | 36.67 |
491
- | claude-3-5-sonnet-20241022 | 20.00 |
492
- | gemini-1.5-pro-002 | 20.00 |
493
- | gemini-1.5-flash-002 | 16.67 |
494
-
495
470
### LongCePO on LongBench v2 (Apr 2025)
496
471
497
472
| Model¹ | Context window | Short samples (up to 32K words) | Medium samples (32–128K words) |
@@ -518,6 +493,31 @@ Authorization: Bearer your_secret_api_key
518
493
519
494
¹ Numbers in parentheses for LongCePO indicate accuracy of majority voting from 5 runs.
520
495
496
+ ### CePO on math and code benchmarks (Mar 2025)
497
+
498
+ | Method | Math-L5 | MMLU-Pro (Math) | CRUX | LiveCodeBench (pass@1) | Simple QA |
499
+ | -----------------------------: | :-----: | :-------------: | :----: | :--------------------: | :-------: |
500
+ | Llama 3.3 70B | 51.0 | 78.6 | 72.6 | 27.1 | 20.9 |
501
+ | Llama 3.1 405B | 49.8 | 79.2 | 73.0 | 31.8 | 13.5 |
502
+ | CePO (using Llama 3.3 70B) | 69.6 | 84.8 | 80.1 | 31.9 | ** 22.6** |
503
+ | QwQ 32B | 61.4 | 90.8 | 82.5 | 44.3 | 7.8 |
504
+ | CePO (using QwQ 32B) | 88.1 | ** 92.0** | 86.3 | ** 51.5** | 8.2 |
505
+ | DeepSeek R1 Llama | 83.1 | 82.0 | 84.0 | 47.3 | 14.6 |
506
+ | CePO (using DeepSeek R1 Llama) | ** 90.2** | 84.0 | ** 89.4** | 47.2 | 15.5 |
507
+
508
+ ### coc-claude-3-5-sonnet-20241022 on AIME 2024 pass@1 (Nov 2024)
509
+
510
+ | Model | Score |
511
+ | -------| -----:|
512
+ | o1-mini | 56.67 |
513
+ | coc-claude-3-5-sonnet-20241022 | 46.67 |
514
+ | coc-gemini/gemini-exp-1121 | 46.67 |
515
+ | o1-preview | 40.00 |
516
+ | gemini-exp-1114 | 36.67 |
517
+ | claude-3-5-sonnet-20241022 | 20.00 |
518
+ | gemini-1.5-pro-002 | 20.00 |
519
+ | gemini-1.5-flash-002 | 16.67 |
520
+
521
521
### readurls&memory-gpt-4o-mini on Google FRAMES Benchmark (Oct 2024)
522
522
| Model | Accuracy |
523
523
| ----- | -------- |
0 commit comments