Skip to content

Latest commit

 

History

History
232 lines (163 loc) · 23.1 KB

awesome_llm_data.md

File metadata and controls

232 lines (163 loc) · 23.1 KB

Awesome llm data

Survey

  • Data Management For Large Language Models: A Survey, arXiv, 2312.01700, arxiv, pdf, cication: -1

    Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, Qun Liu

Techs

  • An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1: Pretraining

  • Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, arXiv, 2401.16380, arxiv, pdf, cication: -1

    Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly

  • Genie: Achieving Human Parity in Content-Grounded Datasets Generation, arXiv, 2401.14367, arxiv, pdf, cication: -1

    Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, Leshem Choshen

  • Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI, arXiv, 2401.14019, arxiv, pdf, cication: -1

    Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed, Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera, Leshem Choshen · (unitxt - IBM) Star

  • The Unreasonable Effectiveness of Easy Training Data for Hard Tasks, arXiv, 2401.06751, arxiv, pdf, cication: -1

    Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe · (easy-to-hard-generalization - allenai) Star

  • A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism, arXiv, 2401.05749, arxiv, pdf, cication: -1

    Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico

  • What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning, arXiv, 2312.15685, arxiv, pdf, cication: -1

    Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, Junxian He · (deita - hkust-nlp) Star

  • Order Matters in the Presence of Dataset Imbalance for Multilingual Learning, arXiv, 2312.06134, arxiv, pdf, cication: -1

    Dami Choi, Derrick Xin, Hamid Dadkhahi, Justin Gilmer, Ankush Garg, Orhan Firat, Chih-Kuan Yeh, Andrew M. Dai, Behrooz Ghorbani

  • AlpaGasus: Training A Better Alpaca with Fewer Data, arXiv, 2307.08701, arxiv, pdf, cication: -1

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang

  • Scaling Data-Constrained Language Models, arXiv, 2305.16264, arxiv, pdf, cication: -1

    Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel

    · (datablations - huggingface) Star

Datasets

Misc

  • Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research, arXiv, 2402.00159, arxiv, pdf, cication: -1

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar

  • openhathi_instruct - pacman100 Star

    This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resulting model is meant to follow instructions and chat in Hindi and Hinglish.

  • MADLAD-400: A Multilingual And Document-Level Large Audited Dataset, arXiv, 2309.04662, arxiv, pdf, cication: -1

    Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna

    · (google-research - google-research) Star

  • Phi-2: The surprising power of small language models - Microsoft Research

  • What's In My Big Data?, arXiv, 2310.20707, arxiv, pdf, cication: -1

    Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh

  • orca - nuochenpku Star

    Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension

  • UltraFeedback - OpenBMB Star

    A large-scale, fine-grained, diverse preference dataset (and models).

  • How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition, arXiv, 2310.05492, arxiv, pdf, cication: -1

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, Jingren Zhou

  • LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset, arXiv, 2309.11998, arxiv, pdf, cication: 3

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing

  • SlimPajama-DC: Understanding Data Combinations for LLM Training, arXiv, 2309.10818, arxiv, pdf, cication: -1

    Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva

  • CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages, arXiv, 2309.09400, arxiv, pdf, cication: -1

    Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen

  • Textbooks Are All You Need II: phi-1.5 technical report, arXiv, 2309.05463, arxiv, pdf, cication: 9

    Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

  • The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, arXiv, 2306.01116, arxiv, pdf, cication: 108

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay

  • FunQA: Towards Surprising Video Comprehension, arXiv, 2306.14899, arxiv, pdf, cication: 1

    Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu · (mp.weixin.qq)

  • The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants, arXiv, 2308.16884, arxiv, pdf, cication: -1

    Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa

  • MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records, arXiv, 2308.14089, arxiv, pdf, cication: 2

    Scott L. Fleming, Alejandro Lozano, William J. Haberkorn, Jenelle A. Jindal, Eduardo P. Reis, Rahul Thapa, Louis Blankemeier, Julian Z. Genkins, Ethan Steinberg, Ashwin Nayak

  • Platypus: Quick, Cheap, and Powerful Refinement of LLMs, arXiv, 2308.07317, arxiv, pdf, cication: 5

    Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz

  • Leveraging Implicit Feedback from Deployment Data in Dialogue, arXiv, 2307.14117, arxiv, pdf, cication: 1

    Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, Jason Weston

  • UltraChat - thunlp Star

    Large-scale, Informative, and Diverse Multi-round Chat Data (and Models)

  • Textbooks Are All You Need, arXiv, 2306.11644, arxiv, pdf, cication: 51

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi · (jiqizhixin) · (jiqizhixin)

MulitiMod

  • Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding, arXiv, 2401.04575, arxiv, pdf, cication: -1

    Yatong Bai, Utsav Garg, Apaar Shanker, Haoming Zhang, Samyak Parajuli, Erhan Bas, Isidora Filipovic, Amelia N. Chu, Eugenia D Fomitcheva, Elliot Branson

  • MADLAD-400: A Multilingual And Document-Level Large Audited Dataset, arXiv, 2309.04662, arxiv, pdf, cication: 1

    Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna

  • OBELICS - HuggingFaceM4 🤗

  • Improving Multimodal Datasets with Image Captioning, arXiv, 2307.10350, arxiv, pdf, cication: 7

    Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, Ludwig Schmidt

  • InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, arXiv, 2307.06942, arxiv, pdf, cication: 4

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu

  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models, arXiv, 2306.05424, arxiv, pdf, cication: 30

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan

  • Paper page - Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

  • M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning, arXiv, 2306.04387, arxiv, pdf, cication: 13

    Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun

Reasoning & Action

  • Can Large Language Models Infer Causation from Correlation?, arXiv, 2306.05836, arxiv, pdf, cication: 11

    Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, Bernhard Schölkopf

  • Mind2Web: Towards a Generalist Agent for the Web, arXiv, 2306.06070, arxiv, pdf, cication: 16

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, Yu Su

Synthetic

  • Synthetic Dialogue Dataset Generation using LLM Agents, arXiv, 2401.17461, arxiv, pdf, cication: -1

    Yelaman Abdullin, Diego Molla-Aliod, Bahadorreza Ofoghi, John Yearwood, Qingyang Li

  • Learning Vision from Models Rivals Learning Vision from Data, arXiv, 2312.17742, arxiv, pdf, cication: -1

    Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, Phillip Isola · (mp.weixin.qq)

  • Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs, arXiv, 2310.13961, arxiv, pdf, cication: -1

    Young-Suk Lee, Md Arafat Sultan, Yousef El-Kurdi, Tahira Naseem Asim Munawar, Radu Florian, Salim Roukos, Ramón Fernandez Astudillo

  • Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models, arXiv, 2310.13671, arxiv, pdf, cication: -1

    Ruida Wang, Wangchunshu Zhou, Mrinmaya Sachan

  • PIPPA: A Partially Synthetic Conversational Dataset, arXiv, 2308.05884, arxiv, pdf, cication: -1

    Tear Gosling, Alpin Dale, Yinhe Zheng

  • Simple synthetic data reduces sycophancy in large language models, arXiv, 2308.03958, arxiv, pdf, cication: 6

    Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le · (qbitai)

  • DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI, arXiv, 2307.10172, arxiv, pdf, cication: -1

    Jianguo Zhang, Kun Qian, Zhiwei Liu, Shelby Heinecke, Rui Meng, Ye Liu, Zhou Yu, Huan Wang, Silvio Savarese, Caiming Xiong

  • Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias, arXiv, 2306.15895, arxiv, pdf, cication: 10

    Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, Chao Zhang · (attrprompt - yueyu1030) Star

  • GPT Self-Supervision for a Better Data Annotator, arXiv, 2306.04349, arxiv, pdf, cication: 1

    Xiaohuan Pei, Yanxi Li, Chang Xu · (mp.weixin.qq)

  • The Curse of Recursion: Training on Generated Data Makes Models Forget, arXiv, 2305.17493, arxiv, pdf, cication: 3

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson

    · (mp.weixin.qq)

  • Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions, arXiv, 2306.04140, arxiv, pdf, cication: 8

    John Joon Young Chung, Ece Kamar, Saleema Amershi

  • Harnessing large-language models to generate private synthetic text, arXiv, 2306.01684, arxiv, pdf, cication: 1

    Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, Andreas Terzis

  • LIMA: Less Is More for Alignment, arXiv, 2305.11206, arxiv, pdf, cication: 116

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu

  • Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning, arXiv, 2305.09246, arxiv, pdf, cication: 6

    Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yifan Yanggong, Junbo Zhao

Toolkits

  • dsir - p-lambda Star

    DSIR large-scale data selection framework

  • data-juicer - alibaba Star

    A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!

Extra reference

  • awesome-instruction-datasets - jianzhnie Star

    A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。