Spaces:
Running
on
CPU Upgrade
Reproducing the results
We are attempting to reproduce the benchmark results reported in the Leaderboard. But, despite using the provided code, we are observing significant discrepancies between our results and those reported on the leaderboard.
For this, we used the provided evaluation code from the repository (here) and attempted to reproduce the results of gemma-7b
, gemma-2-9b-it
, Mistral-7B-v0.3
, Mistral-7B-Instrcut-v0.3
, Llama-3.1-8B
and a few other models on arcx
and hellaswagx
benchmarks.
But, we observed that our results consistently differ from the reported scores. For example, using gemma-7b
on arcx
we obtain 0.54
while the reported value in the leaderboard is 0.61
. Or, Llama-3.1-8B
gets 0.56
on hellaswagx
while the reported score for this model in the leaderboard is 0.59
.
It is worth noting that for these evaluations we set the number of few-shots as the default value provided by the task. We even tried a few models with the number of few-shots reported in the open LLM Leaderboard, for example 25-shot
for arc
. While using this setting helps to get our scores closer to the ones reported in the European-LLM leaderboard, still there is a significant gap that we are investigating the reason.
We were wondering if there are any important details that we are missing in our evaluations. So, can you please provide the configurations that you used to run the evaluations for the leaderboard in more details?
Would it be possible to provide a minimal working example that reproduces the reported results for one specific case? This would greatly help in identifying any potential differences in our evaluation setup.