Skip to content

Releases: OpenDCAI/DataFlex

Initial Release v1.0.0

16 Apr 14:45

Choose a tag to compare

DataFlex v1.0.0 Release Notes

🎉🎉🎉 We are thrilled to release our Data-Centric Dynamic Training System, DataFlex! 🎉🎉🎉

Version: v1.0.0
A unified, data-centric dynamic training framework for Large Language Models, built on top of LLaMA-Factory.


🚀 Introduction

DataFlex is an advanced dynamic training framework built on top of LLaMA-Factory. Unlike traditional training methods that adopt fixed data order and proportions, DataFlex intelligently schedules training data during optimization — supporting Dynamic Data Selection, Dynamic Data Mixture, and Dynamic Data Reweighting — to improve both training efficiency and final model performance.

DataFlex seamlessly replaces LLaMA-Factory's training layer while preserving all of its core capabilities, offering researchers and developers more flexible and powerful training control.


🧠 Core Features

  • 🔁 Modular Component Design: Selectors, Mixers, and Weighters are plug-and-play, with a registry system for easy extension.
  • 🔄 Seamless LLaMA-Factory Integration: Drop-in replacement for LLaMA-Factory trainers — no changes needed to existing model management, data processing, or optimizer configurations.
  • 📊 Three Dynamic Training Modes: Data Selection, Data Mixture, and Data Reweighting, each with dedicated trainers and algorithm components.
  • DeepSpeed ZeRO-3 Support: Gradient computation under DeepSpeed ZeRO-3, enabling training and analysis of larger-scale models.
  • 🧩 Reproducible Implementations: Unified reproductions of hard-to-reproduce algorithms (LESS, NICE, DoReMi, ODM, etc.) in one consistent framework.
  • 🛠️ Simple CLI: One-command training via dataflex-cli train <config.yaml>, with YAML-based configuration and OmegaConf override support.
  • 🔌 Multi-GPU & Distributed Training: Automatic torchrun dispatch for multi-GPU setups, with DeepSpeed and Accelerate integration.

🧱 Framework Overview

DataFlex adopts a modular, layered architecture:

┌───────────────────────────────────────────────────────────────────────────────┐
│                           LlamaFactory Framework                              │
├───────────────────────────────────────────────────────────────────────────────┤
│                  Model Management · Data Processing · Optimizers              │
├───────────────────────────────────────────────────────────────────────────────┤
│            Training Layer (DataFlex replaces LlamaFactory trainer)            │
│  ┌────────────────────────┬────────────────────────┬────────────────────────┐ │
│  │      Select Trainer    │       Mix Trainer      │     Weight Trainer     │ │
│  │   (Dynamic Selection)  │      (Dynamic Ratio)   │     (Dynamic Weights)  │ │
│  ├────────────────────────┼────────────────────────┼────────────────────────┤ │
│  │  Selector Components   │    Mixer Components    │   Weighter Components  │ │
│  │  ┌──────────────────┐  │  ┌──────────────────┐  │  ┌───────────────────┐ │ │
│  │  │  Loss Selector   │  │  │  DoReMi Mixer    │  │  │   Loss Weighter   │ │ │
│  │  │  LESS Selector   │  │  │    ODM Mixer     │  │  │  Custom Weighter  │ │ │
│  │  │  NICE Selector   │  │  │  Static Mixer    │  │  │        ...        │ │ │
│  │  │   Custom ...     │  │  │       ...        │  │  │                   │ │ │
│  │  └──────────────────┘  │  └──────────────────┘  │  └───────────────────┘ │ │
│  └────────────────────────┴────────────────────────┴────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────────────┘
Module Description
Trainers SelectTrainer, MixTrainer, WeightTrainer — replace LlamaFactory's default trainers to enable dynamic data scheduling.
Selectors Algorithm components for dynamic sample selection (LESS, NICE, Loss, Delta Loss, NEAR, TSDS, Random, Custom).
Mixers Algorithm components for dynamic domain mixture (DoReMi, ODM, Static, Random).
Weighters Algorithm components for dynamic sample reweighting (Loss-based with multiple strategies, Custom).
Registry Central registration system for plug-and-play component management.
CLI dataflex-cli entry point for training with YAML configs and CLI overrides.

📦 Supported Algorithms

Data Selection

Method Category Requires Model-in-the-Loop?
LESS Gradient-Based ✅ Yes
NICE Gradient-Based ✅ Yes
Loss Loss-Based ✅ Yes
Delta Loss Loss-Based ✅ Yes
NEAR Data Distribution-Based ❌ No
TSDS Data Distribution-Based ❌ No
Static No Selection ❌ No
Random Random Sampling ❌ No

Data Mixture

Method Category Requires Model-in-the-Loop?
DoReMi Offline Mixture ✅ Yes
ODM Online Mixture ✅ Yes

Data Reweighting

Method Category Requires Model-in-the-Loop?
Loss Reweighting Loss-Based ✅ Yes

🛠️ Quick Start

Installation

pip install dataflex

Or install from source:

git clone https://github.com/OpenDCAI/DataFlex.git
cd DataFlex
pip install -e .

Note: Python 3.11+ is recommended. The core dependencies (including llamafactory and deepspeed) will be installed automatically. If you are using Python 3.10, you need to install a compatible version of llamafactory manually.

For full configuration details, please refer to DataFlex-Doc.


🔍 Why DataFlex?

Feature Benefit
LLaMA-Factory Native Seamless drop-in replacement, zero migration cost
Unified Algorithm Framework Reproducible implementations of LESS, NICE, DoReMi, ODM, and more in one place
Modular & Extensible Registry-based architecture makes adding new algorithms straightforward
Multi-GPU Ready Built-in torchrun, DeepSpeed ZeRO-3, and Accelerate support
YAML-Driven Simple, declarative training configuration with CLI overrides
Pairs with DataFlow End-to-end pipeline from raw data (via DataFlow) to optimized training

🧩 Ecosystem

DataFlex focuses on data scheduling during training. For a complete pipeline starting from raw data, it pairs with DataFlow:

DataFlow (upstream) — raw data → cleaned, augmented, formatted training data
DataFlex (downstream) — training data → intelligent, dynamic LLM training

The two projects are independent with no code dependency, connected by standard data formats.


📫 Contact & Community


📜 Citation

@article{liang2026dataflex,
  title={DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models},
  author={Liang, Hao and Zhao, Zhengyang and Qiang, Meiyi and Chen, Mingrui and Ma, Lu and Yu, Rongyi and Feng, Hengyi and Sun, Shixuan and Meng, Zimo and Ma, Xiaochen and others},
  journal={arXiv preprint arXiv:2603.26164},
  year={2026}
}

📄 License

DataFlex is released under the Apache-2.0 License.