Code for staging LLM evaluation benchmarks in a variety of standard formats for common evaluation.
The focus of this library is reading and writing benchmark data, but
it includes one example benchmark dataset in data/eng for
illustration purposes. Please do not use these files for
fine-tuning, since that compromises their ability to measure LLM
performance fairly.
CommonEval © 2025 by Biblica, Inc is licensed under CC BY 4.0.