Skip to content

Latest commit

 

History

History
404 lines (282 loc) · 42.6 KB

awesome_llm_evaluation.md

File metadata and controls

404 lines (282 loc) · 42.6 KB

Awesome llm evaluation

Survey

  • Leveraging Large Language Models for NLG Evaluation: A Survey, arXiv, 2401.07103, arxiv, pdf, cication: -1

    Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Chongyang Tao

  • ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?, arXiv, 2311.16989, arxiv, pdf, cication: -1

    Hailin Chen, Fangkai Jiao, Xingxuan Li, Chengwei Qin, Mathieu Ravaut, Ruochen Zhao, Caiming Xiong, Shafiq Joty

  • Evaluating Large Language Models: A Comprehensive Survey, arXiv, 2310.19736, arxiv, pdf, cication: -1

    Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong · (Awesome-LLMs-Evaluation-Papers - tjunlp-lab) Star

  • Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment, arXiv, 2308.05374, arxiv, pdf, cication: 12

    Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, Hang Li · [jiqizhixin]

  • A Survey on Evaluation of Large Language Models, arXiv, 2307.03109, arxiv, pdf, cication: -1

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang

Papers

  • A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains, arXiv, 2402.00559, arxiv, pdf, cication: -1

    Alon Jacovi, Yonatan Bitton, Bernd Bohnet, Jonathan Herzig, Or Honovich, Michael Tseng, Michael Collins, Roee Aharoni, Mor Geva · (huggingface)

  • E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models, arXiv, 2401.15927, arxiv, pdf, cication: -1

    Jinchang Hou, Chang Ao, Haihong Wu, Xiangtao Kong, Zhigang Zheng, Daijia Tang, Chengming Li, Xiping Hu, Ruifeng Xu, Shiwen Ni

  • From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities, arXiv, 2401.15071, arxiv, pdf, cication: -1

    Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, Jing Shao, Jingyi Deng, Jinlan Fu, Kexin Huang

  • Benchmarking LLMs via Uncertainty Quantification, arXiv, 2401.12794, arxiv, pdf, cication: -1

    Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F. Wong, Emine Yilmaz, Shuming Shi, Zhaopeng Tu · (llm-uncertainty-bench - smartyfh) Star

  • CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark, arXiv, 2401.11944, arxiv, pdf, cication: -1

    Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo

  • AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models, arXiv, 2401.09002, arxiv, pdf, cication: -1

    Dong shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, Yongfeng Zhang

  • Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers, arXiv, 2401.04695, arxiv, pdf, cication: -1

    Gal Yona, Roee Aharoni, Mor Geva

  • CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution, arXiv, 2401.03065, arxiv, pdf, cication: -1

    Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang

  • Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach, arXiv, 2401.02987, arxiv, pdf, cication: -1

    Prince Aboagye, Yan Zheng, Junpeng Wang, Uday Singh Saini, Xin Dai, Michael Yeh, Yujie Fan, Zhongfang Zhuang, Shubham Jain, Liang Wang

  • Can AI Be as Creative as Humans?, arXiv, 2401.01623, arxiv, pdf, cication: -1

    Haonan Wang, James Zou, Michael Mozer, Anirudh Goyal, Alex Lamb, Linjun Zhang, Weijie J Su, Zhun Deng, Michael Qizhe Xie, Hannah Brown

  • Task Contamination: Language Models May Not Be Few-Shot Anymore, arXiv, 2312.16337, arxiv, pdf, cication: -1

    Changmao Li, Jeffrey Flanigan · (mp.weixin.qq)

  • Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models, arXiv, 2312.17661, arxiv, pdf, cication: -1

    Yuqing Wang, Yun Zhao

  • Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks, arXiv, 2311.09247, arxiv, pdf, cication: -1

    Melanie Mitchell, Alessandro B. Palmarini, Arseny Moskvichev · (mp.weixin.qq)

  • Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases, arXiv, 2312.15011, arxiv, pdf, cication: -1

    Zhangyang Qi, Ye Fang, Mengchen Zhang, Zeyi Sun, Tong Wu, Ziwei Liu, Dahua Lin, Jiaqi Wang, Hengshuang Zhao

    · (gemini-vs-gpt4v - qi-zhangyang) Star

  • Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4, arXiv, 2312.16171, arxiv, pdf, cication: -1

    Sondos Mahmoud Bsharat, Aidar Myrzakhan, Zhiqiang Shen

    · (ATLAS - VILA-Lab) Star

  • LLM4VG: Large Language Models Evaluation for Video Grounding, arXiv, 2312.14206, arxiv, pdf, cication: -1

    Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Zihan Song, Yuwei Zhou, Wenwu Zhu

  • Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model, arXiv, 2312.12423, arxiv, pdf, cication: -1

    Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi · (shramanpramanick.github)

  • A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise, arXiv, 2312.12436, arxiv, pdf, cication: -1

    Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen · (Awesome-Multimodal-Large-Language-Models - BradyFU) Star

    · (mp.weixin.qq)

    · (mp.weixin.qq)

  • An In-depth Look at Gemini's Language Abilities, arXiv, 2312.11444, arxiv, pdf, cication: -1

    Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex Bäuerle, Ángel Alexander Cabrera, Krish Dholakia, Chenyan Xiong, Graham Neubig · (gemini-benchmark - neulab) Star

  • Catwalk: A Unified Language Model Evaluation Framework for Many Datasets, arXiv, 2312.10253, arxiv, pdf, cication: -1

    Dirk Groeneveld, Anas Awadalla, Iz Beltagy, Akshita Bhagia, Ian Magnusson, Hao Peng, Oyvind Tafjord, Pete Walsh, Kyle Richardson, Jesse Dodge · (catwalk - allenai) Star

  • PromptBench: A Unified Library for Evaluation of Large Language Models, arXiv, 2312.07910, arxiv, pdf, cication: -1

    Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie

    · (promptbench - microsoft) Star

  • CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation, arXiv, 2311.18702, arxiv, pdf, cication: -1

    Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang

    · (critiquellm - thu-coai) Star

  • How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation, arXiv, 2312.07424, arxiv, pdf, cication: -1

    Zhongyi Han, Guanglin Zhou, Rundong He, Jindong Wang, Tailin Wu, Yilong Yin, Salman Khan, Lina Yao, Tongliang Liu, Kun Zhang

    · (gpt-4v-distribution-shift - jameszhou-gl) Star

  • Catch me if you can! How to beat GPT-4 with a 13B model | LMSYS Org

    · (youtube)

  • Evaluating and Mitigating Discrimination in Language Model Decisions, arXiv, 2312.03689, arxiv, pdf, cication: -1

    Alex Tamkin, Amanda Askell, Liane Lovitt, Esin Durmus, Nicholas Joseph, Shauna Kravec, Karina Nguyen, Jared Kaplan, Deep Ganguli · (huggingface)

  • Instruction-Following Evaluation for Large Language Models, arXiv, 2311.07911, arxiv, pdf, cication: -1

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou · (google-research - google-research) Star

  • Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text, arXiv, 2311.18805, arxiv, pdf, cication: -1

    Qi Cao, Takeshi Kojima, Yutaka Matsuo, Yusuke Iwasawa · (qbitai)

  • GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv, 2311.12022, arxiv, pdf, cication: -1

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman

  • SelfEval: Leveraging the discriminative nature of generative models for evaluation, arXiv, 2311.10708, arxiv, pdf, cication: -1

    Sai Saketh Rambhatla, Ishan Misra

  • Rethinking Benchmark and Contamination for Language Models with Rephrased Samples, arXiv, 2311.04850, arxiv, pdf, cication: -1

    Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica · (llm-decontaminator - lm-sys) Star · (jiqizhixin)

  • Fusion-Eval: Integrating Evaluators with LLMs, arXiv, 2311.09204, arxiv, pdf, cication: -1

    Lei Shu, Nevan Wichers, Liangchen Luo, Yun Zhu, Yinxiao Liu, Jindong Chen, Lei Meng

  • Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation, arXiv, 2311.08877, arxiv, pdf, cication: -1

    Vaishnavi Shrivastava, Percy Liang, Ananya Kumar

  • Instruction-Following Evaluation for Large Language Models, arXiv, 2311.07911, arxiv, pdf, cication: -1

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou · (google-research - google-research) Star

  • MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks, arXiv, 2311.07463, arxiv, pdf, cication: -1

    Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali

  • Generative Judge for Evaluating Alignment, arXiv, 2310.05470, arxiv, pdf, cication: -1

    Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu · (auto-j - GAIR-NLP) Star · (gair-nlp.github)

  • Can LLMs Follow Simple Rules?, arXiv, 2311.04235, arxiv, pdf, cication: -1

    Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner

    · (llm_rules - normster) Star

  • Don't Make Your LLM an Evaluation Benchmark Cheater, arXiv, 2311.01964, arxiv, pdf, cication: -1

    Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, Jiawei Han

  • PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion, arXiv, 2311.01767, arxiv, pdf, cication: -1

    Yiduo Guo, Zekai Zhang, Yaobo Liang, Dongyan Zhao, Nan Duan

  • GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond, arXiv, 2309.16583, arxiv, pdf, cication: 1

    Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang · (gpt-fathom - gpt-fathom) Star · [qbitai]

  • Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models, arXiv, 2310.20499, arxiv, pdf, cication: -1

    Tian Liang, Zhiwei He, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi, Xing Wang

  • Does GPT-4 Pass the Turing Test?, arXiv, 2310.20216, arxiv, pdf, cication: -1

    Cameron Jones, Benjamin Bergen

  • ALCUNA: Large Language Models Meet New Knowledge, arXiv, 2310.14820, arxiv, pdf, cication: -1

    Xunjian Yin, Baizhou Huang, Xiaojun Wan

  • Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks, arXiv, 2310.13800, arxiv, pdf, cication: 2

    Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan

  • A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications, arXiv, 2310.17750, arxiv, pdf, cication: -1

    Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz

  • Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation, arXiv, 2310.16809, arxiv, pdf, cication: 2

    Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xinhong Chen, Chongyu Liu, Yuyi Zhang, Lianwen Jin · (gpt-4v_ocr - scut-dlvclab) Star

  • JudgeLM: Fine-tuned Large Language Models are Scalable Judges, arXiv, 2310.17631, arxiv, pdf, cication: -1

    Lianghui Zhu, Xinggang Wang, Xinlong Wang

  • An Early Evaluation of GPT-4V(ision), arXiv, 2310.16534, arxiv, pdf, cication: 1

    Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, Bing Qin

  • Generative Judge for Evaluating Alignment, arXiv, 2310.05470, arxiv, pdf, cication: -1

    Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu

  • The Foundation Model Transparency Index, arXiv, 2310.12941, arxiv, pdf, cication: 3

    Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, Percy Liang · [qbitai] · [crfm.stanford]

  • ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks, arXiv, 2310.02569, arxiv, pdf, cication: -1

    Zejun Li, Ye Wang, Mengfei Du, Qingwen Liu, Binhao Wu, Jiwen Zhang, Chengxing Zhou, Zhihao Fan, Jie Fu, Jingjing Chen · [jiqizhixin]

  • CLEVA: Chinese Language Models EVAluation Platform, arXiv, 2308.04813, arxiv, pdf, cication: -1

    Yanyang Li, Jianqiao Zhao, Duo Zheng, Zi-Yuan Hu, Zhi Chen, Xiaohui Su, Yongfeng Huang, Shijia Huang, Dahua Lin, Michael R. Lyu · [qbitai]

  • Prometheus: Inducing Fine-grained Evaluation Capability in Language Models, arXiv, 2310.08491, arxiv, pdf, cication: -1

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne

  • A Closer Look into Automatic Evaluation Using Large Language Models, arXiv, 2310.05657, arxiv, pdf, cication: -1

    Cheng-Han Chiang, Hung-yi Lee · (mp.weixin.qq) · (A-Closer-Look-To-LLM-Evaluation - d223302) Star

  • Probing the Moral Development of Large Language Models through Defining Issues Test, arXiv, 2309.13356, arxiv, pdf, cication: -1

    Kumar Tanmay, Aditi Khandelwal, Utkarsh Agarwal, Monojit Choudhury · [mp.weixin.qq]

  • GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond, arXiv, 2309.16583, arxiv, pdf, cication: 1

    Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang

  • Calibrating LLM-Based Evaluator, arXiv, 2309.13308, arxiv, pdf, cication: -1

    Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang

  • Probing the Moral Development of Large Language Models through Defining Issues Test, arXiv, 2309.13356, arxiv, pdf, cication: -1

    Kumar Tanmay, Aditi Khandelwal, Utkarsh Agarwal, Monojit Choudhury

  • Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?, arXiv, 2309.08963, arxiv, pdf, cication: 1

    Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, Mark Gerstein

  • Investigating Answerability of LLMs for Long-Form Question Answering, arXiv, 2309.08210, arxiv, pdf, cication: -1

    Meghana Moorthy Bhat, Rui Meng, Ye Liu, Yingbo Zhou, Semih Yavuz

  • Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?, arXiv, 2309.07462, arxiv, pdf, cication: 4

    Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram

  • Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs, arXiv, 2308.13387, arxiv, pdf, cication: 3

    Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, Timothy Baldwin · [jiqizhixin]

  • DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models, arXiv, 2306.11698, arxiv, pdf, cication: -1

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer

    · (decodingtrust.github)

  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv, 2306.05685, arxiv, pdf, cication: -1

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing

  • The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain, arXiv, 2305.07141, arxiv, pdf, cication: 10

    Arseny Moskvichev, Victor Vikram Odouard, Melanie Mitchell · [mp.weixin.qq]

  • MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities, arXiv, 2308.02490, arxiv, pdf, cication: 10

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang

  • ARB: Advanced Reasoning Benchmark for Large Language Models, arXiv, 2307.13692, arxiv, pdf, cication: 6

    Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, Aran Komatsuzaki

  • L-Eval: Instituting Standardized Evaluation for Long Context Language Models, arXiv, 2307.11088, arxiv, pdf, cication: -1

    Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, Xipeng Qiu · (leval - openlmlab) Star

  • Instruction-following Evaluation through Verbalizer Manipulation, arXiv, 2307.10558, arxiv, pdf, cication: 4

    Shiyang Li, Jun Yan, Hai Wang, Zheng Tang, Xiang Ren, Vijay Srinivasan, Hongxia Jin

  • FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets, arXiv, 2307.10928, arxiv, pdf, cication: 5

    Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo

  • How is ChatGPT's behavior changing over time?, arXiv, 2307.09009, arxiv, pdf, cication: 64

    Lingjiao Chen, Matei Zaharia, James Zou

  • Generating Benchmarks for Factuality Evaluation of Language Models, arXiv, 2307.06908, arxiv, pdf, cication: 6

    Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, Yoav Shoham

    · (factor - AI21Labs) Star

  • PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts, arXiv, 2306.04528, arxiv, pdf, cication: 32

    Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong · (promptbench - microsoft) Star

  • Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features, arXiv, 2307.05454, arxiv, pdf, cication: -1

    Ester Hlavnova, Sebastian Ruder

  • GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective, arXiv, 2211.08073, arxiv, pdf, cication: 21

    Linyi Yang, Shuibai Zhang, Libo Qin, Yafu Li, Yidong Wang, Hanmeng Liu, Jindong Wang, Xing Xie, Yue Zhang

  • M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models, arXiv, 2306.05179, arxiv, pdf, cication: 7

    Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, Lidong Bing · [jiqizhixin] · (M3Exam - DAMO-NLP-SG) Star

  • MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models, arXiv, 2306.13394, arxiv, pdf, cication: 32

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng · (Awesome-Multimodal-Large-Language-Models - BradyFU) Star · [mp.weixin.qq]

  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv, 2306.05685, arxiv, pdf, cication: 136

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing · [twitter] · [lmsys]

  • Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors, international journal of management, 2023, arxiv, pdf, cication: 8

    Tung Phung, Victor-Alexandru Pădurean, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, Gustavo Soares

  • Benchmarking Large Language Model Capabilities for Conditional Generation, arXiv, 2306.16793, arxiv, pdf, cication: 2

    Joshua Maynez, Priyanka Agrawal, Sebastian Gehrmann

  • CMMLU: Measuring massive multitask language understanding in Chinese, arXiv, 2306.09212, arxiv, pdf, cication: 14

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, Timothy Baldwin · [jiqizhixin] · (CMMLU - haonan-li) Star

  • Bring Your Own Data! Self-Supervised Evaluation for Large Language Models, arXiv, 2306.13651, arxiv, pdf, cication: 7

    Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein · (byod - neelsjain) Star

  • INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models, arXiv, 2306.04757, arxiv, pdf, cication: 19

    Yew Ken Chia, Pengfei Hong, Lidong Bing, Soujanya Poria · [jiqizhixin]

  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv, 2306.05685, arxiv, pdf, cication: 136

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing

Projects

Other

Extra reference