Awesome llm data

Awesome llm data

Survey

Data Management For Large Language Models: A Survey, arXiv, 2312.01700, arxiv, pdf, cication: -1

Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, Qun Liu

Techs

An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1: Pretraining
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, arXiv, 2401.16380, arxiv, pdf, cication: -1

Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly
Genie: Achieving Human Parity in Content-Grounded Datasets Generation, arXiv, 2401.14367, arxiv, pdf, cication: -1

Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, Leshem Choshen
Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI, arXiv, 2401.14019, arxiv, pdf, cication: -1

Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed, Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera, Leshem Choshen · (unitxt - IBM)
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks, arXiv, 2401.06751, arxiv, pdf, cication: -1

Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe · (easy-to-hard-generalization - allenai)
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism, arXiv, 2401.05749, arxiv, pdf, cication: -1

Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico
What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning, arXiv, 2312.15685, arxiv, pdf, cication: -1

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, Junxian He · (deita - hkust-nlp)
Order Matters in the Presence of Dataset Imbalance for Multilingual Learning, arXiv, 2312.06134, arxiv, pdf, cication: -1

Dami Choi, Derrick Xin, Hamid Dadkhahi, Justin Gilmer, Ankush Garg, Orhan Firat, Chih-Kuan Yeh, Andrew M. Dai, Behrooz Ghorbani
AlpaGasus: Training A Better Alpaca with Fewer Data, arXiv, 2307.08701, arxiv, pdf, cication: -1

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang
Scaling Data-Constrained Language Models, arXiv, 2305.16264, arxiv, pdf, cication: -1

Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel

· (datablations - huggingface)

Datasets

WebSight - HuggingFaceM4 🤗
oasst2 - OpenAssistant 🤗
wikisource - wikimedia 🤗
pii-masking-200k - ai4privacy 🤗
SlimPajama-627B - cerebras 🤗

· (modelzoo - Cerebras)
MADLAD-400 - allenai 🤗
peS2o - allenai 🤗

Misc

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research, arXiv, 2402.00159, arxiv, pdf, cication: -1

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar
openhathi_instruct - pacman100

This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resulting model is meant to follow instructions and chat in Hindi and Hinglish.
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset, arXiv, 2309.04662, arxiv, pdf, cication: -1

Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna

· (google-research - google-research)
Phi-2: The surprising power of small language models - Microsoft Research
What's In My Big Data?, arXiv, 2310.20707, arxiv, pdf, cication: -1

Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh
orca - nuochenpku

Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension
UltraFeedback - OpenBMB

A large-scale, fine-grained, diverse preference dataset (and models).
How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition, arXiv, 2310.05492, arxiv, pdf, cication: -1

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, Jingren Zhou
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset, arXiv, 2309.11998, arxiv, pdf, cication: 3

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing
SlimPajama-DC: Understanding Data Combinations for LLM Training, arXiv, 2309.10818, arxiv, pdf, cication: -1

Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages, arXiv, 2309.09400, arxiv, pdf, cication: -1

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen
Textbooks Are All You Need II: phi-1.5 technical report, arXiv, 2309.05463, arxiv, pdf, cication: 9

Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, arXiv, 2306.01116, arxiv, pdf, cication: 108

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay
FunQA: Towards Surprising Video Comprehension, arXiv, 2306.14899, arxiv, pdf, cication: 1

Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu · (mp.weixin.qq)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants, arXiv, 2308.16884, arxiv, pdf, cication: -1

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records, arXiv, 2308.14089, arxiv, pdf, cication: 2

Scott L. Fleming, Alejandro Lozano, William J. Haberkorn, Jenelle A. Jindal, Eduardo P. Reis, Rahul Thapa, Louis Blankemeier, Julian Z. Genkins, Ethan Steinberg, Ashwin Nayak
Platypus: Quick, Cheap, and Powerful Refinement of LLMs, arXiv, 2308.07317, arxiv, pdf, cication: 5

Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz
Leveraging Implicit Feedback from Deployment Data in Dialogue, arXiv, 2307.14117, arxiv, pdf, cication: 1

Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, Jason Weston
UltraChat - thunlp

Large-scale, Informative, and Diverse Multi-round Chat Data (and Models)
Textbooks Are All You Need, arXiv, 2306.11644, arxiv, pdf, cication: 51

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi · (jiqizhixin) · (jiqizhixin)

MulitiMod

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding, arXiv, 2401.04575, arxiv, pdf, cication: -1

Yatong Bai, Utsav Garg, Apaar Shanker, Haoming Zhang, Samyak Parajuli, Erhan Bas, Isidora Filipovic, Amelia N. Chu, Eugenia D Fomitcheva, Elliot Branson
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset, arXiv, 2309.04662, arxiv, pdf, cication: 1

Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna
OBELICS - HuggingFaceM4 🤗
Improving Multimodal Datasets with Image Captioning, arXiv, 2307.10350, arxiv, pdf, cication: 7

Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, Ludwig Schmidt
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, arXiv, 2307.06942, arxiv, pdf, cication: 4

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models, arXiv, 2306.05424, arxiv, pdf, cication: 30

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
Paper page - Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning, arXiv, 2306.04387, arxiv, pdf, cication: 13

Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun

Reasoning & Action

Can Large Language Models Infer Causation from Correlation?, arXiv, 2306.05836, arxiv, pdf, cication: 11

Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, Bernhard Schölkopf
Mind2Web: Towards a Generalist Agent for the Web, arXiv, 2306.06070, arxiv, pdf, cication: 16

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, Yu Su

Synthetic

Synthetic Dialogue Dataset Generation using LLM Agents, arXiv, 2401.17461, arxiv, pdf, cication: -1

Yelaman Abdullin, Diego Molla-Aliod, Bahadorreza Ofoghi, John Yearwood, Qingyang Li
Learning Vision from Models Rivals Learning Vision from Data, arXiv, 2312.17742, arxiv, pdf, cication: -1

Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, Phillip Isola · (mp.weixin.qq)
Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs, arXiv, 2310.13961, arxiv, pdf, cication: -1

Young-Suk Lee, Md Arafat Sultan, Yousef El-Kurdi, Tahira Naseem Asim Munawar, Radu Florian, Salim Roukos, Ramón Fernandez Astudillo
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models, arXiv, 2310.13671, arxiv, pdf, cication: -1

Ruida Wang, Wangchunshu Zhou, Mrinmaya Sachan
PIPPA: A Partially Synthetic Conversational Dataset, arXiv, 2308.05884, arxiv, pdf, cication: -1

Tear Gosling, Alpin Dale, Yinhe Zheng
Simple synthetic data reduces sycophancy in large language models, arXiv, 2308.03958, arxiv, pdf, cication: 6

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le · (qbitai)
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI, arXiv, 2307.10172, arxiv, pdf, cication: -1

Jianguo Zhang, Kun Qian, Zhiwei Liu, Shelby Heinecke, Rui Meng, Ye Liu, Zhou Yu, Huan Wang, Silvio Savarese, Caiming Xiong
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias, arXiv, 2306.15895, arxiv, pdf, cication: 10

Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, Chao Zhang · (attrprompt - yueyu1030)
GPT Self-Supervision for a Better Data Annotator, arXiv, 2306.04349, arxiv, pdf, cication: 1

Xiaohuan Pei, Yanxi Li, Chang Xu · (mp.weixin.qq)
The Curse of Recursion: Training on Generated Data Makes Models Forget, arXiv, 2305.17493, arxiv, pdf, cication: 3

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson

· (mp.weixin.qq)
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions, arXiv, 2306.04140, arxiv, pdf, cication: 8

John Joon Young Chung, Ece Kamar, Saleema Amershi
Harnessing large-language models to generate private synthetic text, arXiv, 2306.01684, arxiv, pdf, cication: 1

Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, Andreas Terzis
LIMA: Less Is More for Alignment, arXiv, 2305.11206, arxiv, pdf, cication: 116

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu
- GitHub - h2oai/h2o-llmstudio: H2O LLM Studio - a framework and no-code GUI for fine-tuning LLMs
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning, arXiv, 2305.09246, arxiv, pdf, cication: 6

Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yifan Yanggong, Junbo Zhao

Toolkits

dsir - p-lambda

DSIR large-scale data selection framework
data-juicer - alibaba

A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据！

Extra reference

awesome-instruction-datasets - jianzhnie

A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome_llm_data.md

awesome_llm_data.md

Awesome llm data

Survey

Techs

Datasets

Misc

MulitiMod

Reasoning & Action

Synthetic

Toolkits

Extra reference

Files

awesome_llm_data.md

Latest commit

History

awesome_llm_data.md

File metadata and controls

Awesome llm data

Survey

Techs

Datasets

Misc

MulitiMod

Reasoning & Action

Synthetic

Toolkits

Extra reference