Skip to content

Baicaihaochi/test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 

Repository files navigation

title date slug keywords author category tags categories summary description featured weight
InfiAlign: A Scalable and Sample-Efficient Framework for Enhancing LLM Reasoning
2025-08-12
infialign
LLM Alignment
Reasoning Enhancement
Data Efficiency
Shuo Cai et al.
publication
LLM Alignment
Reasoning
SFT
DPO
AI Research
AI
Machine Learning
InfiAlign is a novel framework that combines SFT and DPO with an advanced data selection pipeline to efficiently enhance LLM reasoning capabilities.
We present InfiAlign, a scalable post-training framework that achieves state-of-the-art reasoning performance while using only 12% of typical training data.
true
1
InfiAlign Framework

InfiAlign isn't just another alignment frameworkβ€”it's your new secret weapon for supercharging LLMs! By ingeniously combining supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) with our smart data selection pipeline, we achieve remarkable reasoning improvements while using only a fraction of typical training data. Talk about doing more with less! πŸ˜‰

🌟 Why InfiAlign Stands Out

At its heart lies our efficient data pipeline – an automated curator that handpicks the crΓ¨me de la crΓ¨me from open-source reasoning datasets using multidimensional quality metrics. When tested on Qwen2.5-Math-7B-Base, the results were mind-blowing:

  • Matches DeepSeek-R1-Distill-Qwen-7B's performance using just 12% of the data! (Your GPU just breathed a sigh of relief 😌)
  • DPO magic delivers an extra boost, particularly in math tasks (+3.89% on AIME benchmarks)

πŸš€ Main Contribution

  1. Data-Efficient Alignment via Multi-Dimensional Filtering. We design an automated pipeline that selects high-quality instruction data from open-source corpora using diversity, difficulty, and quality metrics, achieving strong performance with only $\sim$20% of the data used by distilled baselines.
  2. Modular and Scalable Framework. InfiAlign enables seamless integration of new data sources and tasks via its modular design, allowing flexible and low-overhead adaptation across domains.
  3. Enhanced Reasoning through Multi-Stage Training. We adopt a multi-stage training regimen that balances data mixing, curriculum-guided SFT, and DPO to boost reasoning across various benchmarks.

πŸŽ‰ Hot Off the Press!

🧠 Methodology Overview

Data Selection Pipeline

Traditional alignment methods often require massive amounts of training data, which is computationally expensive and can lead to diminishing returns. Our data-efficient approach addresses this by: (1) eliminating redundant or low-quality samples through multi-dimensional filtering (diversity, difficulty, quality), and (2) optimizing the training curriculum to focus on the most impactful examples. This enables comparable performance to distilled baselines using only 20% of the data, while maintaining strong generalization across reasoning tasks.

Our data curation process transforms raw datasets into high-quality reasoning corpora:

  1. Data Preparation: Aggregates QA pairs from multiple sources, standardizes formats, and generates missing Chain-of-Thought traces using teacher models
  2. Diversity Sampling: Balances domain representation (Algebra, Geometry, etc.) and ensures semantic variety through clustering
  3. Difficulty Sampling: Prioritizes complex problems using response length as a proxy for difficulty
  4. Quality Control: Validates answer correctness through automated verifiers and LLM scoring
  5. Benchmark Protection: Implements rigorous decontamination to prevent data leakage

Training Framework

Supervised Fine-Tuning:

  • Utilizes two curated datasets (95K and 165K samples) distilled from 10M+ examples
  • Employs two-phase training: foundational skills development followed by advanced reasoning tasks

Direct Preference Optimization:

  • Constructs preference pairs by comparing SFT model errors with expert solutions
  • Implements three refinement steps: data cleaning, challenge selection, and quality verification
  • Optimizes using the DPO objective function to enhance reasoning capabilities

πŸ† Benchmark Results: Breaking Efficiency Barriers

InfiAlign redefines the performance-efficiency tradeoff, achieving SOTA results with only 12% of typical training data. Our models outperform larger competitors across mathematical, scientific, and coding domains:

Key Achievements:

  • 3.89% avg gain on AIME 24/25 math competitions and 92.7% accuracy on MATH500, with 79% less data than DeepSeek-Distill
  • Demonstrates excellent generalization across tasks in different fields

Detailed Breakdown:

Model Initial Checkpoint Data Size AIME 2025 AIME 2024 MATH500 GPQA MMLU-Pro LiveCodeBench Avg.
Qwen2.5-7B-Instruct Qwen2.5-7B-Base 1M 8.80 11.93 76.15 38.70 57.49 15.77 34.80
DeepSeek-Distill-Qwen-7B Qwen2.5-7B-Math-Base 800K 37.97 55.50 92.80 49.10 54.16 37.60 54.43
InfiAlign-SFT (ours) Qwen2.5-7B-Math-Base 165K 42.19 63.75 92.70 53.60 56.68 36.20 57.52
InfiAlign-DPO (ours) InfiAlign-SFT 10K 47.45 61.25 93.45 51.77 53.95 35.30 57.20

πŸ“š Citation Information

If you find this work useful, citations to the following papers are welcome:

@misc{cai2025infialignscalablesampleefficientframework,
      title={InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities}, 
      author={Shuo Cai and Su Lu and Qi Zhou and Kejing Yang and Zhijie Sang and Congkai Xie and Hongxia Yang},
      year={2025},
      eprint={2508.05496},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.05496}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published