Releases: OpenDCAI/DataFlex
Initial Release v1.0.0
DataFlex v1.0.0 Release Notes
🎉🎉🎉 We are thrilled to release our Data-Centric Dynamic Training System, DataFlex! 🎉🎉🎉
Version: v1.0.0
A unified, data-centric dynamic training framework for Large Language Models, built on top of LLaMA-Factory.
🚀 Introduction
DataFlex is an advanced dynamic training framework built on top of LLaMA-Factory. Unlike traditional training methods that adopt fixed data order and proportions, DataFlex intelligently schedules training data during optimization — supporting Dynamic Data Selection, Dynamic Data Mixture, and Dynamic Data Reweighting — to improve both training efficiency and final model performance.
DataFlex seamlessly replaces LLaMA-Factory's training layer while preserving all of its core capabilities, offering researchers and developers more flexible and powerful training control.
- 📘 Documentation: DataFlex-Doc
- 📄 Technical Report: arXiv:2603.26164 (🏆 #1 on HuggingFace Daily Papers, 2026-04-04 & Weekly #1, Mar 29 – Apr 4)
- 📄 Survey: arXiv:2603.14712
🧠 Core Features
- 🔁 Modular Component Design: Selectors, Mixers, and Weighters are plug-and-play, with a registry system for easy extension.
- 🔄 Seamless LLaMA-Factory Integration: Drop-in replacement for LLaMA-Factory trainers — no changes needed to existing model management, data processing, or optimizer configurations.
- 📊 Three Dynamic Training Modes: Data Selection, Data Mixture, and Data Reweighting, each with dedicated trainers and algorithm components.
- ⚡ DeepSpeed ZeRO-3 Support: Gradient computation under DeepSpeed ZeRO-3, enabling training and analysis of larger-scale models.
- 🧩 Reproducible Implementations: Unified reproductions of hard-to-reproduce algorithms (LESS, NICE, DoReMi, ODM, etc.) in one consistent framework.
- 🛠️ Simple CLI: One-command training via
dataflex-cli train <config.yaml>, with YAML-based configuration and OmegaConf override support. - 🔌 Multi-GPU & Distributed Training: Automatic
torchrundispatch for multi-GPU setups, with DeepSpeed and Accelerate integration.
🧱 Framework Overview
DataFlex adopts a modular, layered architecture:
┌───────────────────────────────────────────────────────────────────────────────┐
│ LlamaFactory Framework │
├───────────────────────────────────────────────────────────────────────────────┤
│ Model Management · Data Processing · Optimizers │
├───────────────────────────────────────────────────────────────────────────────┤
│ Training Layer (DataFlex replaces LlamaFactory trainer) │
│ ┌────────────────────────┬────────────────────────┬────────────────────────┐ │
│ │ Select Trainer │ Mix Trainer │ Weight Trainer │ │
│ │ (Dynamic Selection) │ (Dynamic Ratio) │ (Dynamic Weights) │ │
│ ├────────────────────────┼────────────────────────┼────────────────────────┤ │
│ │ Selector Components │ Mixer Components │ Weighter Components │ │
│ │ ┌──────────────────┐ │ ┌──────────────────┐ │ ┌───────────────────┐ │ │
│ │ │ Loss Selector │ │ │ DoReMi Mixer │ │ │ Loss Weighter │ │ │
│ │ │ LESS Selector │ │ │ ODM Mixer │ │ │ Custom Weighter │ │ │
│ │ │ NICE Selector │ │ │ Static Mixer │ │ │ ... │ │ │
│ │ │ Custom ... │ │ │ ... │ │ │ │ │ │
│ │ └──────────────────┘ │ └──────────────────┘ │ └───────────────────┘ │ │
│ └────────────────────────┴────────────────────────┴────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────────────┘
| Module | Description |
|---|---|
| Trainers | SelectTrainer, MixTrainer, WeightTrainer — replace LlamaFactory's default trainers to enable dynamic data scheduling. |
| Selectors | Algorithm components for dynamic sample selection (LESS, NICE, Loss, Delta Loss, NEAR, TSDS, Random, Custom). |
| Mixers | Algorithm components for dynamic domain mixture (DoReMi, ODM, Static, Random). |
| Weighters | Algorithm components for dynamic sample reweighting (Loss-based with multiple strategies, Custom). |
| Registry | Central registration system for plug-and-play component management. |
| CLI | dataflex-cli entry point for training with YAML configs and CLI overrides. |
📦 Supported Algorithms
Data Selection
| Method | Category | Requires Model-in-the-Loop? |
|---|---|---|
| LESS | Gradient-Based | ✅ Yes |
| NICE | Gradient-Based | ✅ Yes |
| Loss | Loss-Based | ✅ Yes |
| Delta Loss | Loss-Based | ✅ Yes |
| NEAR | Data Distribution-Based | ❌ No |
| TSDS | Data Distribution-Based | ❌ No |
| Static | No Selection | ❌ No |
| Random | Random Sampling | ❌ No |
Data Mixture
| Method | Category | Requires Model-in-the-Loop? |
|---|---|---|
| DoReMi | Offline Mixture | ✅ Yes |
| ODM | Online Mixture | ✅ Yes |
Data Reweighting
| Method | Category | Requires Model-in-the-Loop? |
|---|---|---|
| Loss Reweighting | Loss-Based | ✅ Yes |
🛠️ Quick Start
Installation
pip install dataflexOr install from source:
git clone https://github.com/OpenDCAI/DataFlex.git
cd DataFlex
pip install -e .Note: Python 3.11+ is recommended. The core dependencies (including
llamafactoryanddeepspeed) will be installed automatically. If you are using Python 3.10, you need to install a compatible version ofllamafactorymanually.
For full configuration details, please refer to DataFlex-Doc.
🔍 Why DataFlex?
| Feature | Benefit |
|---|---|
| LLaMA-Factory Native | Seamless drop-in replacement, zero migration cost |
| Unified Algorithm Framework | Reproducible implementations of LESS, NICE, DoReMi, ODM, and more in one place |
| Modular & Extensible | Registry-based architecture makes adding new algorithms straightforward |
| Multi-GPU Ready | Built-in torchrun, DeepSpeed ZeRO-3, and Accelerate support |
| YAML-Driven | Simple, declarative training configuration with CLI overrides |
| Pairs with DataFlow | End-to-end pipeline from raw data (via DataFlow) to optimized training |
🧩 Ecosystem
DataFlex focuses on data scheduling during training. For a complete pipeline starting from raw data, it pairs with DataFlow:
DataFlow (upstream) — raw data → cleaned, augmented, formatted training data
DataFlex (downstream) — training data → intelligent, dynamic LLM training
The two projects are independent with no code dependency, connected by standard data formats.
📫 Contact & Community
- GitHub: https://github.com/OpenDCAI/DataFlex
- Documentation: https://opendcai.github.io/DataFlex-Doc/
- Issues: GitHub Issues
- Email: hao.liang@stu.pku.edu.cn
📜 Citation
@article{liang2026dataflex,
title={DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models},
author={Liang, Hao and Zhao, Zhengyang and Qiang, Meiyi and Chen, Mingrui and Ma, Lu and Yu, Rongyi and Feng, Hengyi and Sun, Shixuan and Meng, Zimo and Ma, Xiaochen and others},
journal={arXiv preprint arXiv:2603.26164},
year={2026}
}📄 License
DataFlex is released under the Apache-2.0 License.