Awesome Algorithm Auto Tools

A curated collection of tools, frameworks, and resources for AI-driven automated model training — letting AI agents autonomously run experiments, fine-tune models, optimize hyperparameters, and evolve themselves.

Inspired by Karpathy's AutoResearch, HuggingFace Skills, and the broader AutoML movement.

Why This List?

The paradigm is shifting: instead of manually tuning models, we now have tools that let AI agents design experiments, modify training code, evaluate results, and iterate autonomously — while you sleep.

This repository collects the best open-source tools and frameworks that make this possible across the full training lifecycle.

Autonomous Experiment / Research Frameworks
Agent-Driven Training Skills (HuggingFace Ecosystem)
LLM Fine-Tuning Frameworks
RL Alignment Training Frameworks (RLHF / GRPO)
Automated Hyperparameter Optimization / AutoML
Self-Evolving / Self-Play Training
Synthetic Data Generation & Curation
Knowledge Distillation
Model Merging & Quantization
Lightweight Pretraining & Distributed Training
Inference Engines (for RL Training Loops)
Multimodal Training Frameworks
Experiment Tracking & Orchestration
Benchmarks & Evaluation
Coding Agents (for Training Script Development)
Recommended Stacks

Autonomous Experiment / Research Frameworks

Core idea: AI agents autonomously design experiments, modify training code, evaluate results, and iterate. You sleep, AI experiments.

Project	Description	Key Highlight
AutoResearch	AI agent runs autonomous ML experiments in a loop	630 lines of Python, ~100 experiments overnight, 11% efficiency gain on GPT-2 training
AI Scientist v2	Fully automated scientific discovery with agentic tree search	Hypothesis → Experiment → Paper, no human templates needed
AutoML-Agent	Multi-agent LLM framework for full-pipeline AutoML (ICML 2025)	Parallel specialized agents for preprocessing, architecture design, HPO; retrieval-augmented planning
auto-ml-agent	LLM-orchestrated autonomous ML pipeline	End-to-end: data preprocessing → model deployment, multi-agent architecture
MLAgentBench	Benchmark for evaluating AI agents on ML experimentation	13 end-to-end ML tasks from CIFAR-10 to BabyLM
AutoAgent	Zero-code LLM agent framework with self-play customization	Create agents via natural language, iterative self-improvement
ShinkaEvolve	LLM-as-mutation-operator program evolution framework	Evolves programs for scientific discovery
AI-Supervisor	Autonomous research supervision via persistent Research World Model	Multi-agent consensus + Knowledge Graph; validates claims via GPU computation; self-correcting updates
ARIS	Lightweight Markdown-only skills for autonomous ML research overnight	Zero dependencies; cross-model review loops; 20+ GPU experiments per overnight run; works with any LLM agent

Agent-Driven Training Skills (HuggingFace Ecosystem)

"Vibe Training" — use natural language to drive the full model training lifecycle through coding agents.

Project	Description	Key Highlight
HuggingFace Skills	Standardized ML skill packages for coding agents	12 skills: model training (SFT/DPO/GRPO), vision training, experiment tracking, evaluation, dataset management
HuggingFace AutoTrain	No-code training platform	Upload data → auto model selection → training → evaluation → Hub publishing

HF Skills covers:

hugging-face-model-trainer — Fine-tune LLMs with TRL (SFT, DPO, GRPO), 0.5B to 70B parameters
hugging-face-vision-trainer — Train object detection & image classification (RTDETRv2, YOLOS, ViT)
hugging-face-jobs — Run compute jobs on HF infrastructure with cost estimation
hugging-face-trackio — ML experiment tracking with real-time metrics
hugging-face-evaluation — Model evaluation with lighteval
hugging-face-datasets — Dataset creation and management
Compatible with: Claude Code, OpenAI Codex, Google Gemini CLI, Cursor

LLM Fine-Tuning Frameworks

The training engines. Upper-level agents (AutoResearch, HF Skills) ultimately call these frameworks to execute training.

Project	Description	Key Highlight
Unsloth	Ultra-efficient LLM fine-tuning & RL	2x faster, 70% less VRAM; custom CUDA kernels; MoE 12x faster; MCP Server available
Axolotl	Flexible, production-ready fine-tuning	YAML-driven; v0.8.x: QAT, sequence parallelism, GRPO, full RLHF pipeline
LlamaFactory	Unified fine-tuning with Web UI	LlamaBoard browser UI; 100+ models; SFT/RLHF/DPO/PPO
TRL	HuggingFace's RL training library	SFT, DPO, GRPO, PPO, KTO, ORPO; deep Transformers/PEFT integration
torchtune	PyTorch-native fine-tuning	No extra abstractions; multi-node support (Feb 2025)
NeMo AutoModel	NVIDIA's DTensor-native training library	Day-0 HuggingFace support; single-to-multi-node scaling
LMFlow	Extensible toolkit for fine-tuning large foundation models	LISA memory-efficient training (outperforms LoRA); FlashAttention; NAACL Best Demo Paper
H2O LLM Studio	No-code GUI framework for fine-tuning LLMs	Browser-based UI; LoRA/4-bit/8-bit; DPO/IPO/KTO; W&B integration
LitGPT	20+ high-performance LLMs with pretrain/finetune/deploy recipes	CLI-driven; powered TinyLlama project; NeurIPS 2023 LLM Efficiency Challenge
InstructLab	IBM/Red Hat collaborative LLM customization via synthetic data	LAB alignment method; taxonomy-driven skill contributions; targets Granite models

RL Alignment Training Frameworks (RLHF / GRPO)

2025-2026 trend: GRPO (Group Relative Policy Optimization) is replacing PPO as the default alignment method — no critic model needed, simpler and more stable.

Project	Description	Key Highlight
OpenRLHF	High-performance RLHF framework on Ray + vLLM	70B+ full tuning; PPO/DAPO/REINFORCE++; async agent RLHF
verl	ByteDance's Volcano Engine RL for LLMs	GRPO/PPO in few lines; 3D-HybridEngine; used by ByteDance, Alibaba Qwen, UC Berkeley, LMSys
DAPO	Open-source RL system from ByteDance Seed + Tsinghua	50 pts on AIME 2024 with Qwen2.5-32B; 4 key stability techniques; built on verl
AReaL	Fully asynchronous RL for LLM reasoning (Ant Group + Tsinghua)	2.77x speedup vs synchronous; GSPO algorithm; Ascend NPU support
slime	LLM post-training framework for RL scaling (GLM team)	Powers GLM-4.5/4.6/4.7/5; Megatron + SGLang; RLVE (400 verifiable environments)
NeMo RL	NVIDIA's scalable post-training RL library	GRPO, SFT, DPO, DAPO; Ray-based; Megatron Core parallelism
NeMo Gym	Build RL environments for LLM training	Multi-step/multi-turn environments; interoperable with NeMo RL, OpenRLHF, TRL, Unsloth
rLLM	Post-training RL framework for language agents	Custom agents + environments → RL training → deployment; rLLM-FinQA-4B beats Qwen3-235B
RAGEN	Multi-turn RL framework for training reasoning agents	StarPO framework; 10 built-in environments; identifies "Echo Trap" instability
f-GRPO	f-Divergence based GRPO for general LLM alignment	KL/Reverse KL/Pearson/Hellinger/JS divergences; superior on both RLVR (math) and safety alignment; built on Unsloth
Tree-GRPO	Tree search for LLM agent RL (ICLR 2026)	4x less rollout budget via shared prefixes; step-wise process supervision from outcome reward; tree-structured ReAct
SimpleRL-Reason	Simple RL recipe for reasoning (HKUST)	DeepSeek-R1-style; 7B achieves 33.3% AIME with only 8K examples; no SFT needed
SWE-RL	Meta's RL for software engineering reasoning	Llama3-SWE-RL-70B achieves 41% on SWE-bench Verified (NeurIPS 2025)
OpenManus-RL	RL tuning for LLM agents (UIUC + MetaGPT)	PPO-based; AgentGym environments + verl training
LlamaGym	Online RL fine-tuning for LLM agents	Define agent → create LLM → write RL loop
Reasoning Gym	Procedural reasoning environments for RLVR	100+ tasks; NeurIPS 2025 Spotlight; unlimited controllable task generation

Automated Hyperparameter Optimization / AutoML

Project	Description	Key Highlight
AgentHPO	LLM-driven hyperparameter optimization	Matches/surpasses human best trials on 12 ML tasks with explainable results
AutoML-Agent	Multi-agent LLM framework for full-pipeline AutoML (ICML 2025)	Parallel specialized agents; retrieval-augmented planning; 14 datasets tested
Optuna	Industry-standard HPO framework	Bayesian search, pruning, distributed execution, visualization dashboard
Microsoft NNI	Full AutoML toolkit	Neural Architecture Search + HPO + model compression + feature engineering
W&B Sweeps	Automated hyperparameter search + tracking	Bayesian/Grid/Random search; Hyperband early stopping; cross-machine parallelism

Self-Evolving / Self-Play Training

Core idea: Models generate their own training data to train themselves, reducing dependence on human annotations.

Project	Description	Key Highlight
SPIN	Self-Play Fine-Tuning	Model plays against its previous iterations; outperforms DPO + GPT-4 preference data without extra annotations
SPPO	Self-Play Preference Optimization	Iterative policy updates approximating Nash equilibrium with convergence guarantees
SPC (Self-Play Critic)	Adversarial self-play for evolving reasoning critics	"Sneaky generator" vs "critic" game; eliminates manual step-level annotation
SPELL	Self-Play RL for Evolving Long-Context Language Models	Label-free self-play; base model surpasses instruction-tuned counterpart on long-context tasks
Multi-Agent Evolve	One LLM plays Proposer + Solver + Judge roles	Verified improvements on math, coding, reasoning with Qwen2.5-3B
Multiagent Finetuning	Multi-agent society from same base model	Multi-agent iteration keeps improving where single-model self-training plateaus
CORY	Cooperative multi-agent RL fine-tuning	Pioneer + Observer dual-agent paradigm (NeurIPS 2024)

Synthetic Data Generation & Curation

Critical for automated training pipelines: generate high-quality training data at scale without manual annotation.

Data Generation

Project	Description	Key Highlight
Distilabel	Framework for synthetic data and AI feedback pipelines	Modular pipeline; SFT/DPO/UltraFeedback techniques; any LLM provider
Magpie	Alignment data synthesis from scratch (ICLR 2025)	No prompt engineering needed; 4M instructions generated; matches Llama-3 Instruct
DataDreamer	Reproducible synthetic data generation (ACL 2024)	Multi-step prompting; generate/align/fine-tune/distill; built-in caching
Cosmopedia	Large-scale synthetic pretraining data pipeline	25B tokens of synthetic textbooks/blogs; uses Mixtral-8x7B
InstructLab SDG	Synthetic data via LAB methodology (IBM/Red Hat)	Skills-SDG + Knowledge-SDG; minimal seed taxonomy → large-scale data
Persona Hub	Persona-driven synthetic data at billion scale (Tencent)	1B diverse personas; 370M elite personas released
synth_gen	Execution-verified synthetic data (Meta)	Modular verifier system; parser-based verification for code
Evidently	Open-source synthetic data generation with user profiles	Model-agnostic; customizable personas & goals; no-code UI in Evidently Cloud; outputs to pandas DataFrame
NVIDIA Nemotron-4 340B	Open models for synthetic data generation pipeline	Base + Instruct + Reward models; commercial use allowed

Data Curation & Filtering

Project	Description	Key Highlight
NeMo Curator	GPU-accelerated data preprocessing & curation	30+ filters; fuzzy dedup 1.1T tokens in 1.8h on 64 A100s; 16x faster
DataTrove	Platform-agnostic data processing pipeline	Used for FineWeb and Cosmopedia; low memory; Slurm support
Dolma	High-performance dataset curation toolkit (AllenAI)	Built-in parallelism for billions of docs; used for OLMo (3T tokens)
Data Prep Kit	Unstructured data preparation (IBM)	Python/Ray/Spark runtimes; laptop to datacenter scaling

Knowledge Distillation

Compress large models into smaller, deployable ones while preserving capabilities.

Project	Description	Key Highlight
EasyDistill	Comprehensive distillation toolkit (Alibaba/ModelScope, EMNLP 2025)	Black-box + white-box KD; data synthesis + SFT + logits distillation + RL
DistillKit	Production-ready LLM distillation (Arcee AI)	Online and offline workflows; powers Arcee Virtuoso, SuperNova models
MiniPLM	Knowledge distillation for pre-training (Tsinghua, ICLR 2025)	Improved DPKD variant
DistiLLM	Streamlined distillation with contrastive approach (ICML 2024)	DistiLLM-2 contrastive distillation

Model Merging & Quantization

Combine multiple models or compress them for efficient deployment and training.

Model Merging

Project	Description	Key Highlight
MergeKit	Leading toolkit for merging pretrained LLMs	SLERP, TIES, DARE, Passthrough, Evolutionary merge; works on CPU with 8GB VRAM
MergeLM	Language model merging codebase (ICML 2024)	Research-grade implementations

Quantization

Project	Description	Key Highlight
GPTQModel	Production-ready LLM quantization toolkit	GPTQ, AWQ, QQQ, GPTAQ, EoRA, GAR; multi-backend CPU/GPU
AutoGPTQ	Easy-to-use GPTQ quantization	8/4/3/2-bit; Marlin int4*fp16 kernel; ~150-200K monthly PyPI downloads
AutoRound	Advanced quantization via sign-gradient descent (Intel)	High accuracy at 2-4 bits; exports to GPTQ/AWQ/GGUF; broad HW compatibility
NVIDIA Model Optimizer	Unified quantization, pruning, distillation & speculative decoding	FP8/INT8/INT4; exports to TensorRT-LLM/vLLM; NeMo Megatron integration
TurboQuant	Google's KV cache compression (ICLR 2026)	6x memory reduction at 3-bit with zero accuracy loss; PolarQuant + QJL; 8x perf on H100
llama.cpp	LLM inference in C/C++ with GGUF quantization	Q4_K_M sweet spot: 92% quality, 75% size reduction; runs everywhere

Lightweight Pretraining & Distributed Training

Pair these with autonomous experiment frameworks — fast, small-scale training is the foundation for autonomous experimentation.

Lightweight Pretraining

Project	Description	Key Highlight
nanochat	Minimal LLM training harness (AutoResearch's engine)	Single GPU; tokenization → pretrain → finetune → eval → chat; GPT-2 for ~$48
Nanotron	Minimal 3D-parallel LLM pretraining	Data + Tensor + Pipeline parallelism; scales from experiments to production

Distributed Training

Project	Description	Key Highlight
TorchTitan	PyTorch-native large-scale training platform	Up to 4D parallelism without model code changes; MXFP8 on Blackwell; elastic scaling
Open-dLLM	First open-source full stack for diffusion LLMs	Raw data → training → checkpoints → evaluation → inference, all-in-one

Inference Engines (for RL Training Loops)

Inference engines are critical for RL training — 80% of RLHF training time is spent on sample generation. Fast inference = fast training.

Project	Description	Key Highlight
vLLM	Most mature open-source LLM serving engine	PagedAttention; 4x higher throughput on Blackwell; core engine for OpenRLHF
SGLang	High-performance serving for LLMs & multimodal	~16,200 tok/sec on H100; RadixAttention; used by slime for RL training
TensorRT-LLM	NVIDIA's optimized inference library	FP8/FP4/INT4; EAGLE-3 speculative decoding; max GPU performance
LMDeploy	LLM compression, deployment & serving	TurboMind MXFP4; 1.5x vLLM performance; DeepSeek PD disaggregation
HuggingFace TGI	Multi-backend LLM serving (TensorRT-LLM, vLLM, llama.cpp)	Unified frontend; token streaming; HF Hub native; CPU/GPU/Inferentia support
NVIDIA Dynamo	Datacenter-scale distributed inference	30x request throughput on DeepSeek-R1; disaggregated prefill/decode; Rust + Python

Multimodal Training Frameworks

Training models that understand text, images, video, and audio simultaneously.

Project	Description	Key Highlight
LLaVA-OneVision-1.5	Fully open-source multimodal training	Native-resolution images; SOTA performance; lower training costs
LLaVA-OneVision-1.5-RL	Democratized multimodal RL training	Open code, data, and models for multimodal RLHF
OpenRLHF-M	Multimodal model RLHF training	Extension of OpenRLHF for VLMs
LLaVA-KD	Multimodal knowledge distillation (ICCV 2025)	Distills large MLLMs into smaller ones
MoE-LLaVA	Mixture-of-Experts for vision-language models (TMM 2025)	Efficient multimodal MoE architecture

Experiment Tracking & Orchestration

Project	Description	Key Highlight
Weights & Biases	Experiment tracking + sweeps + model registry	Industry standard; integrates with all major frameworks
MLflow 3.0	Open-source experiment tracking + model serving	Self-hosted; nested experiments; model registry
ClearML	Open-source MLOps platform	150K+ users at Fortune 500; auto-logging; pipeline orchestration; dataset versioning
HF Trackio	Lightweight experiment tracking in HF ecosystem	Deep integration with HF Skills; agents can read metrics and make decisions

Benchmarks & Evaluation

ML Agent Benchmarks

Benchmark	Description	Key Highlight
MLE-bench	75 Kaggle ML engineering competition tasks	Evaluates AI agents on real ML engineering: training, data prep, experiments
MLAgentBench	13 end-to-end ML experimentation tasks	Stanford SNAP; Claude v3 Opus best at 37.5%
PaperBench	Evaluates AI's ability to replicate ICML 2024 papers	8,316 gradable tasks across 20 papers; best agent scores 21%
CORE-Bench	Computational Reproducibility Agent Benchmark	270 tasks from 90 papers across CS, social science, medicine
MLRC-Bench	ML Research Competition challenges	Tests novel methodology development
AgentBench	Multi-dimensional benchmark for LLM agents	Tests across OS, database, knowledge graph, web, and game environments
SWE-bench Verified	Human-verified GitHub issue resolution	Industry standard for coding agents; top scores 70%+
LiveBench	Monthly-updated contamination-free LLM benchmark	6 categories (Math/Reasoning/Coding/Language/Data/IF); objective auto-scoring; no LLM judge needed

Model Evaluation Frameworks

Tool	Description	Key Highlight
DeepEval	Pytest-like LLM evaluation framework	v3.0: 14+ metrics; multi-turn simulation; DeepTeam for red teaming
Opik	Open-source LLM observability & evaluation (Comet)	Deep tracing; LLM-as-a-judge; hallucination detection; production dashboards
LMMs-Eval	Multimodal evaluation across text, image, video, audio	v0.6: eval-as-a-service; 7.5x throughput; 50+ tasks
Arize Phoenix	Open-source LLM observability and evaluation	Fully self-hosted; tracing, evaluation, retrieval analysis
LiveCodeBench	Contamination-free coding benchmark	Fresh problems from LeetCode/AtCoder/Codeforces

Coding Agents (for Training Script Development)

These agents don't train models directly, but can write and debug training code, completing the automation loop when paired with HF Skills.

Project	Description	Key Highlight
Aider	Terminal AI pair programming	Git integration; supports Claude/GPT/DeepSeek/local models
OpenHands	AI-driven software development (open-source Devin)	Autonomous code editing + execution + debugging; MIT license
SWE-agent	Autonomous GitHub issue fixer	SWE-bench open-source SOTA (NeurIPS 2024)
Open-SWE	LangChain's async cloud-hosted coding agent	Multi-agent (Planner + Reviewer); GitHub integration; auto PR creation
SERA	Ai2's open coding agent family	54.2% on SWE-Bench; trains in 40 GPU-days (~$2K); all open
Cline	VS Code AI coding agent with 60K+ GitHub stars	MCP tool creation; 5M+ developers; human-in-the-loop approval; native subagents
OpenCode	Go-based terminal AI agent with 95K+ GitHub stars	Bubble Tea TUI; 75+ LLM providers; 6.5M monthly developers; SQLite persistence
Plandex	Terminal agent for large projects with 2M token context	Tree-sitter project maps; diff review sandbox; auto-debugging; 30+ languages
Roo Code	Terminal agent with 95K+ GitHub stars	75+ LLM providers; plan-first development; 2.5M monthly developers

Recommended Stacks

Most Complete Automation

HuggingFace Skills + Claude Code + Unsloth + W&B

Natural language → Claude Code orchestrates → HF Skills calls Unsloth for training → W&B tracks experiments.

Lightest Autonomous Research

AutoResearch + nanochat (single GPU)

Start before bed, wake up to ~100 autonomous experiment results.

Most Flexible Production Setup

Axolotl / LlamaFactory + OpenRLHF + Optuna + MLflow

YAML-configured training + automated HPO + full experiment tracking.

Full RL Training Pipeline (2026 SOTA)

verl / OpenRLHF + vLLM/SGLang + Reasoning Gym + W&B

State-of-the-art RL training with fast inference engines and rich environments.

Synthetic Data → Training → Eval

Distilabel / Magpie → Unsloth / TRL → DeepEval / LMMs-Eval

Generate data at scale → train efficiently → evaluate comprehensively.

Trends (2026 Q2 Update)

AutoResearch Paradigm: Karpathy proved "AI autonomously doing ML research" works with just 630 lines of code — now spawning derivatives like ARIS and AI-Supervisor
"Vibe Training": HF Skills enables natural-language-driven model training lifecycle
GRPO Variants Proliferate: f-GRPO (f-divergence family), Tree-GRPO (tree search, ICLR 2026), DAPO — GRPO is the new default, and specialized variants are emerging fast
RL Framework Explosion: verl, DAPO, AReaL, slime — every major lab now has an open-source RL training framework
Self-Play Breakthrough: Multi-agent self-evolution (SPIN, MAE, SPC) overcomes single-model self-training plateaus
Synthetic Data as Infrastructure: Distilabel, Magpie, Evidently make data generation a first-class pipeline stage; model collapse mitigation (Evol-Instruct) becoming standard
MCP Standardization: Model Context Protocol adopted by OpenAI/Google/Microsoft as the "USB-C for AI agents"
Single-GPU Research: Unsloth + nanochat + AutoResearch enables individual developers to do serious LLM research
Inference-Training Convergence: vLLM/SGLang/TGI are now core components of RL training loops, not just serving
Multimodal RL: LLaVA-OneVision-1.5-RL and OpenRLHF-M bring RL alignment to vision-language models
Extreme Quantization: Google TurboQuant achieves 6x KV cache compression at zero accuracy loss (ICLR 2026); NVIDIA Model Optimizer unifies quantization/pruning/distillation
Multi-Agent Coding Wave: Feb 2026 saw every major tool ship multi-agent capabilities (Grok Build, Windsurf, Claude Code, Codex CLI, Devin) — coding agents now routinely write training scripts

Related Awesome Lists

Contributing

Contributions are welcome! Please open an issue or submit a PR if you know of tools that fit this collection.

Criteria for inclusion:

Must be directly usable for automated model training workflows
Preference for open-source projects with active maintenance
Focus on tools that leverage AI/LLMs to automate the training process itself

License

This curated list is released under CC0 1.0.

Compiled March 2026, updated April 2026. Project statuses may change — check individual GitHub repos for the latest.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
算法自动化训练.md		算法自动化训练.md

Folders and files

Latest commit

History

Repository files navigation

Awesome Algorithm Auto Tools

Why This List?

Table of Contents

Autonomous Experiment / Research Frameworks

Agent-Driven Training Skills (HuggingFace Ecosystem)

LLM Fine-Tuning Frameworks

RL Alignment Training Frameworks (RLHF / GRPO)

Automated Hyperparameter Optimization / AutoML

Self-Evolving / Self-Play Training

Synthetic Data Generation & Curation

Data Generation

Data Curation & Filtering

Knowledge Distillation

Model Merging & Quantization

Model Merging

Quantization

Lightweight Pretraining & Distributed Training

Lightweight Pretraining

Distributed Training

Inference Engines (for RL Training Loops)

Multimodal Training Frameworks

Experiment Tracking & Orchestration

Benchmarks & Evaluation

ML Agent Benchmarks

Model Evaluation Frameworks

Coding Agents (for Training Script Development)

Recommended Stacks

Most Complete Automation

Lightest Autonomous Research

Most Flexible Production Setup

Full RL Training Pipeline (2026 SOTA)

Synthetic Data → Training → Eval

Trends (2026 Q2 Update)

Related Awesome Lists

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages