-
Leveraging Large Language Models for NLG Evaluation: A Survey,
arXiv, 2401.07103
, arxiv, pdf, cication: -1Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Chongyang Tao
-
ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?,
arXiv, 2311.16989
, arxiv, pdf, cication: -1Hailin Chen, Fangkai Jiao, Xingxuan Li, Chengwei Qin, Mathieu Ravaut, Ruochen Zhao, Caiming Xiong, Shafiq Joty
-
Evaluating Large Language Models: A Comprehensive Survey,
arXiv, 2310.19736
, arxiv, pdf, cication: -1Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong · (Awesome-LLMs-Evaluation-Papers - tjunlp-lab)
-
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment,
arXiv, 2308.05374
, arxiv, pdf, cication: 12Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, Hang Li · [jiqizhixin]
-
A Survey on Evaluation of Large Language Models,
arXiv, 2307.03109
, arxiv, pdf, cication: -1Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang
-
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains,
arXiv, 2402.00559
, arxiv, pdf, cication: -1Alon Jacovi, Yonatan Bitton, Bernd Bohnet, Jonathan Herzig, Or Honovich, Michael Tseng, Michael Collins, Roee Aharoni, Mor Geva · (huggingface)
-
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models,
arXiv, 2401.15927
, arxiv, pdf, cication: -1Jinchang Hou, Chang Ao, Haihong Wu, Xiangtao Kong, Zhigang Zheng, Daijia Tang, Chengming Li, Xiping Hu, Ruifeng Xu, Shiwen Ni
-
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities,
arXiv, 2401.15071
, arxiv, pdf, cication: -1Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, Jing Shao, Jingyi Deng, Jinlan Fu, Kexin Huang
-
Benchmarking LLMs via Uncertainty Quantification,
arXiv, 2401.12794
, arxiv, pdf, cication: -1Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F. Wong, Emine Yilmaz, Shuming Shi, Zhaopeng Tu · (llm-uncertainty-bench - smartyfh)
-
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark,
arXiv, 2401.11944
, arxiv, pdf, cication: -1Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo
-
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models,
arXiv, 2401.09002
, arxiv, pdf, cication: -1Dong shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, Yongfeng Zhang
-
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers,
arXiv, 2401.04695
, arxiv, pdf, cication: -1Gal Yona, Roee Aharoni, Mor Geva
-
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution,
arXiv, 2401.03065
, arxiv, pdf, cication: -1Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang
-
Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach,
arXiv, 2401.02987
, arxiv, pdf, cication: -1Prince Aboagye, Yan Zheng, Junpeng Wang, Uday Singh Saini, Xin Dai, Michael Yeh, Yujie Fan, Zhongfang Zhuang, Shubham Jain, Liang Wang
-
Can AI Be as Creative as Humans?,
arXiv, 2401.01623
, arxiv, pdf, cication: -1Haonan Wang, James Zou, Michael Mozer, Anirudh Goyal, Alex Lamb, Linjun Zhang, Weijie J Su, Zhun Deng, Michael Qizhe Xie, Hannah Brown
-
Task Contamination: Language Models May Not Be Few-Shot Anymore,
arXiv, 2312.16337
, arxiv, pdf, cication: -1Changmao Li, Jeffrey Flanigan · (mp.weixin.qq)
-
Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models,
arXiv, 2312.17661
, arxiv, pdf, cication: -1Yuqing Wang, Yun Zhao
-
Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks,
arXiv, 2311.09247
, arxiv, pdf, cication: -1Melanie Mitchell, Alessandro B. Palmarini, Arseny Moskvichev · (mp.weixin.qq)
-
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases,
arXiv, 2312.15011
, arxiv, pdf, cication: -1Zhangyang Qi, Ye Fang, Mengchen Zhang, Zeyi Sun, Tong Wu, Ziwei Liu, Dahua Lin, Jiaqi Wang, Hengshuang Zhao
· (gemini-vs-gpt4v - qi-zhangyang)
-
Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4,
arXiv, 2312.16171
, arxiv, pdf, cication: -1Sondos Mahmoud Bsharat, Aidar Myrzakhan, Zhiqiang Shen
· (ATLAS - VILA-Lab)
-
LLM4VG: Large Language Models Evaluation for Video Grounding,
arXiv, 2312.14206
, arxiv, pdf, cication: -1Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Zihan Song, Yuwei Zhou, Wenwu Zhu
-
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model,
arXiv, 2312.12423
, arxiv, pdf, cication: -1Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi · (shramanpramanick.github)
-
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise,
arXiv, 2312.12436
, arxiv, pdf, cication: -1Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen · (Awesome-Multimodal-Large-Language-Models - BradyFU)
· (mp.weixin.qq)
· (mp.weixin.qq)
-
An In-depth Look at Gemini's Language Abilities,
arXiv, 2312.11444
, arxiv, pdf, cication: -1Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex Bäuerle, Ángel Alexander Cabrera, Krish Dholakia, Chenyan Xiong, Graham Neubig · (gemini-benchmark - neulab)
-
Catwalk: A Unified Language Model Evaluation Framework for Many Datasets,
arXiv, 2312.10253
, arxiv, pdf, cication: -1Dirk Groeneveld, Anas Awadalla, Iz Beltagy, Akshita Bhagia, Ian Magnusson, Hao Peng, Oyvind Tafjord, Pete Walsh, Kyle Richardson, Jesse Dodge · (catwalk - allenai)
-
PromptBench: A Unified Library for Evaluation of Large Language Models,
arXiv, 2312.07910
, arxiv, pdf, cication: -1Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie
· (promptbench - microsoft)
-
CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation,
arXiv, 2311.18702
, arxiv, pdf, cication: -1Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang
· (critiquellm - thu-coai)
-
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation,
arXiv, 2312.07424
, arxiv, pdf, cication: -1Zhongyi Han, Guanglin Zhou, Rundong He, Jindong Wang, Tailin Wu, Yilong Yin, Salman Khan, Lina Yao, Tongliang Liu, Kun Zhang
· (gpt-4v-distribution-shift - jameszhou-gl)
-
Catch me if you can! How to beat GPT-4 with a 13B model | LMSYS Org
· (youtube)
-
Evaluating and Mitigating Discrimination in Language Model Decisions,
arXiv, 2312.03689
, arxiv, pdf, cication: -1Alex Tamkin, Amanda Askell, Liane Lovitt, Esin Durmus, Nicholas Joseph, Shauna Kravec, Karina Nguyen, Jared Kaplan, Deep Ganguli · (huggingface)
-
Instruction-Following Evaluation for Large Language Models,
arXiv, 2311.07911
, arxiv, pdf, cication: -1Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou · (google-research - google-research)
-
Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text,
arXiv, 2311.18805
, arxiv, pdf, cication: -1Qi Cao, Takeshi Kojima, Yutaka Matsuo, Yusuke Iwasawa · (qbitai)
-
GPQA: A Graduate-Level Google-Proof Q&A Benchmark,
arXiv, 2311.12022
, arxiv, pdf, cication: -1David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman
-
SelfEval: Leveraging the discriminative nature of generative models for evaluation,
arXiv, 2311.10708
, arxiv, pdf, cication: -1Sai Saketh Rambhatla, Ishan Misra
-
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples,
arXiv, 2311.04850
, arxiv, pdf, cication: -1Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica · (llm-decontaminator - lm-sys)
· (jiqizhixin)
-
Fusion-Eval: Integrating Evaluators with LLMs,
arXiv, 2311.09204
, arxiv, pdf, cication: -1Lei Shu, Nevan Wichers, Liangchen Luo, Yun Zhu, Yinxiao Liu, Jindong Chen, Lei Meng
-
Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation,
arXiv, 2311.08877
, arxiv, pdf, cication: -1Vaishnavi Shrivastava, Percy Liang, Ananya Kumar
-
Instruction-Following Evaluation for Large Language Models,
arXiv, 2311.07911
, arxiv, pdf, cication: -1Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou · (google-research - google-research)
-
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks,
arXiv, 2311.07463
, arxiv, pdf, cication: -1Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali
-
Generative Judge for Evaluating Alignment,
arXiv, 2310.05470
, arxiv, pdf, cication: -1Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu · (auto-j - GAIR-NLP)
· (gair-nlp.github)
-
Can LLMs Follow Simple Rules?,
arXiv, 2311.04235
, arxiv, pdf, cication: -1Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner
· (llm_rules - normster)
-
Don't Make Your LLM an Evaluation Benchmark Cheater,
arXiv, 2311.01964
, arxiv, pdf, cication: -1Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, Jiawei Han
-
PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion,
arXiv, 2311.01767
, arxiv, pdf, cication: -1Yiduo Guo, Zekai Zhang, Yaobo Liang, Dongyan Zhao, Nan Duan
-
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond,
arXiv, 2309.16583
, arxiv, pdf, cication: 1Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang · (gpt-fathom - gpt-fathom)
· [qbitai]
-
Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models,
arXiv, 2310.20499
, arxiv, pdf, cication: -1Tian Liang, Zhiwei He, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi, Xing Wang
-
Does GPT-4 Pass the Turing Test?,
arXiv, 2310.20216
, arxiv, pdf, cication: -1Cameron Jones, Benjamin Bergen
-
ALCUNA: Large Language Models Meet New Knowledge,
arXiv, 2310.14820
, arxiv, pdf, cication: -1Xunjian Yin, Baizhou Huang, Xiaojun Wan
-
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks,
arXiv, 2310.13800
, arxiv, pdf, cication: 2Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan
-
A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications,
arXiv, 2310.17750
, arxiv, pdf, cication: -1Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz
-
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation,
arXiv, 2310.16809
, arxiv, pdf, cication: 2Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xinhong Chen, Chongyu Liu, Yuyi Zhang, Lianwen Jin · (gpt-4v_ocr - scut-dlvclab)
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges,
arXiv, 2310.17631
, arxiv, pdf, cication: -1Lianghui Zhu, Xinggang Wang, Xinlong Wang
-
An Early Evaluation of GPT-4V(ision),
arXiv, 2310.16534
, arxiv, pdf, cication: 1Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, Bing Qin
-
Generative Judge for Evaluating Alignment,
arXiv, 2310.05470
, arxiv, pdf, cication: -1Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu
-
The Foundation Model Transparency Index,
arXiv, 2310.12941
, arxiv, pdf, cication: 3Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, Percy Liang · [qbitai] · [crfm.stanford]
-
ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks,
arXiv, 2310.02569
, arxiv, pdf, cication: -1Zejun Li, Ye Wang, Mengfei Du, Qingwen Liu, Binhao Wu, Jiwen Zhang, Chengxing Zhou, Zhihao Fan, Jie Fu, Jingjing Chen · [jiqizhixin]
-
CLEVA: Chinese Language Models EVAluation Platform,
arXiv, 2308.04813
, arxiv, pdf, cication: -1Yanyang Li, Jianqiao Zhao, Duo Zheng, Zi-Yuan Hu, Zhi Chen, Xiaohui Su, Yongfeng Huang, Shijia Huang, Dahua Lin, Michael R. Lyu · [qbitai]
-
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models,
arXiv, 2310.08491
, arxiv, pdf, cication: -1Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne
-
A Closer Look into Automatic Evaluation Using Large Language Models,
arXiv, 2310.05657
, arxiv, pdf, cication: -1Cheng-Han Chiang, Hung-yi Lee · (mp.weixin.qq) · (A-Closer-Look-To-LLM-Evaluation - d223302)
-
Probing the Moral Development of Large Language Models through Defining Issues Test,
arXiv, 2309.13356
, arxiv, pdf, cication: -1Kumar Tanmay, Aditi Khandelwal, Utkarsh Agarwal, Monojit Choudhury · [mp.weixin.qq]
-
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond,
arXiv, 2309.16583
, arxiv, pdf, cication: 1Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang
-
Calibrating LLM-Based Evaluator,
arXiv, 2309.13308
, arxiv, pdf, cication: -1Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
-
Probing the Moral Development of Large Language Models through Defining Issues Test,
arXiv, 2309.13356
, arxiv, pdf, cication: -1Kumar Tanmay, Aditi Khandelwal, Utkarsh Agarwal, Monojit Choudhury
-
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?,
arXiv, 2309.08963
, arxiv, pdf, cication: 1Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, Mark Gerstein
-
Investigating Answerability of LLMs for Long-Form Question Answering,
arXiv, 2309.08210
, arxiv, pdf, cication: -1Meghana Moorthy Bhat, Rui Meng, Ye Liu, Yingbo Zhou, Semih Yavuz
-
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?,
arXiv, 2309.07462
, arxiv, pdf, cication: 4Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram
-
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs,
arXiv, 2308.13387
, arxiv, pdf, cication: 3Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, Timothy Baldwin · [jiqizhixin]
-
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,
arXiv, 2306.11698
, arxiv, pdf, cication: -1Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,
arXiv, 2306.05685
, arxiv, pdf, cication: -1Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing
-
The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain,
arXiv, 2305.07141
, arxiv, pdf, cication: 10Arseny Moskvichev, Victor Vikram Odouard, Melanie Mitchell · [mp.weixin.qq]
-
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities,
arXiv, 2308.02490
, arxiv, pdf, cication: 10Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang
-
ARB: Advanced Reasoning Benchmark for Large Language Models,
arXiv, 2307.13692
, arxiv, pdf, cication: 6Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, Aran Komatsuzaki
-
L-Eval: Instituting Standardized Evaluation for Long Context Language Models,
arXiv, 2307.11088
, arxiv, pdf, cication: -1Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, Xipeng Qiu · (leval - openlmlab)
-
Instruction-following Evaluation through Verbalizer Manipulation,
arXiv, 2307.10558
, arxiv, pdf, cication: 4Shiyang Li, Jun Yan, Hai Wang, Zheng Tang, Xiang Ren, Vijay Srinivasan, Hongxia Jin
-
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets,
arXiv, 2307.10928
, arxiv, pdf, cication: 5Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo
-
How is ChatGPT's behavior changing over time?,
arXiv, 2307.09009
, arxiv, pdf, cication: 64Lingjiao Chen, Matei Zaharia, James Zou
-
Generating Benchmarks for Factuality Evaluation of Language Models,
arXiv, 2307.06908
, arxiv, pdf, cication: 6Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, Yoav Shoham
· (factor - AI21Labs)
-
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts,
arXiv, 2306.04528
, arxiv, pdf, cication: 32Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong · (promptbench - microsoft)
-
Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features,
arXiv, 2307.05454
, arxiv, pdf, cication: -1Ester Hlavnova, Sebastian Ruder
-
GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective,
arXiv, 2211.08073
, arxiv, pdf, cication: 21Linyi Yang, Shuibai Zhang, Libo Qin, Yafu Li, Yidong Wang, Hanmeng Liu, Jindong Wang, Xing Xie, Yue Zhang
-
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models,
arXiv, 2306.05179
, arxiv, pdf, cication: 7Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, Lidong Bing · [jiqizhixin] · (M3Exam - DAMO-NLP-SG)
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,
arXiv, 2306.13394
, arxiv, pdf, cication: 32Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng · (Awesome-Multimodal-Large-Language-Models - BradyFU)
· [mp.weixin.qq]
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,
arXiv, 2306.05685
, arxiv, pdf, cication: 136Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing · [twitter] · [lmsys]
-
Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors,
international journal of management, 2023
, arxiv, pdf, cication: 8Tung Phung, Victor-Alexandru Pădurean, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, Gustavo Soares
-
Benchmarking Large Language Model Capabilities for Conditional Generation,
arXiv, 2306.16793
, arxiv, pdf, cication: 2Joshua Maynez, Priyanka Agrawal, Sebastian Gehrmann
-
CMMLU: Measuring massive multitask language understanding in Chinese,
arXiv, 2306.09212
, arxiv, pdf, cication: 14Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, Timothy Baldwin · [jiqizhixin] · (CMMLU - haonan-li)
-
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models,
arXiv, 2306.13651
, arxiv, pdf, cication: 7Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein · (byod - neelsjain)
-
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models,
arXiv, 2306.04757
, arxiv, pdf, cication: 19Yew Ken Chia, Pengfei Hong, Lidong Bing, Soujanya Poria · [jiqizhixin]
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,
arXiv, 2306.05685
, arxiv, pdf, cication: 136Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing
-
NPHardEval-leaderboard - NPHardEval 🤗
· (huggingface)
-
Introducing the Enterprise Scenarios Leaderboard: a Leaderboard for Real World Use Cases
-
finetuning-subnet - NousResearch
· (huggingface)
-
llm_contamination_detector - Yeyito 🤗
-
detect-pretrain-code-contamination - swj0419
-
deepeval - confident-ai
The Evaluation Framework for LLMs
-
Nexus_Function_Calling_Leaderboard - Nexusflow 🤗
-
LLM Leaderboard best models ❤️🔥 - a open-llm-leaderboard Collection
-
opencompass - InternLM
OpenCompass is an LLM evaluation platform, supporting a wide range of models (LLaMA, ChatGLM2, ChatGPT, Claude, etc) over 50+ datasets. · [opencompass.org]
-
FlagAI - FlagAI-Open
· [qbitai]
-
toolqa - night-chen
ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels (easy/hard) across eight real-life scenarios.
-
glue-x - yanglinyi
We leverage 14 datasets as OOD test data and conduct evaluations on 8 NLU tasks over 21 popularly used models.
-
alpaca_farm - tatsu-lab
A Simulation Framework for RLHF and alternatives. · [jiqizhixin] · [mp.weixin.qq]
-
Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4
· [huggingface]
-
Anthropic \ Challenges in evaluating AI systems
· [jiqizhixin]
-
How Long Can Open-Source LLMs Truly Promise on Context Length? | LMSYS Org
· [mp.weixin.qq]
-
Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B | LMSYS Org
· [mp.weixin.qq]
-
awesome-llms-evaluation-papers - tjunlp-lab
The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey. · [jiqizhixin]