Skip to content

zoe-yyx/CapaBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CapaBench: Who's the MVP? A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents

MULTI

🌐 Website | 📃 Paper

简体中文 | English

🔥 News

📖 Overview

Modular architectures in Large Language Model (LLM) agents integrate components like planning, reasoning, and reflection, yet quantifying their individual contributions remains challenging. We introduce CapaBench, a Shapley Value-based evaluation framework that systematically measures capability modules' marginal impacts. With 1,000+ multi-domain task scenarios, CapaBench enables combinatorial analysis through module substitution and interaction testing.

📊 Data

Some part of CapaBench is open-source, we also release the fully evaluated results of the models in the paper.

The other part of CapaBench is not open-source, for each benchmark, we provide 5 problems and 1 traj per problem as examples.

📝 How to Evaluate

Some part of CapaBench is open-source, they're coming soon!

📑 Citation

If you find our work useful, please cite us!

@misc{yang2025whosmvpgametheoreticevaluation,
      title={Who's the MVP? A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents}, 
      author={Yingxuan Yang and Bo Huang and Siyuan Qi and Chao Feng and Haoyi Hu and Yuxuan Zhu and Jinbo Hu and Haoran Zhao and Ziyi He and Xiao Liu and Zongyu Wang and Lin Qiu and Xuezhi Cao and Xunliang Cai and Yong Yu and Weinan Zhang},
      year={2025},
      eprint={2502.00510},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.00510}, 
}

📧 Contact Us

If you have any questions, please feel free to contact us via email [email protected] and [email protected]

About

Capabench:A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •