Skip to content

BinHPdev/awesome-algorithm-auto-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Algorithm Auto Tools

A curated collection of tools, frameworks, and resources for AI-driven automated model training — letting AI agents autonomously run experiments, fine-tune models, optimize hyperparameters, and evolve themselves.

Inspired by Karpathy's AutoResearch, HuggingFace Skills, and the broader AutoML movement.


Why This List?

The paradigm is shifting: instead of manually tuning models, we now have tools that let AI agents design experiments, modify training code, evaluate results, and iterate autonomously — while you sleep.

This repository collects the best open-source tools and frameworks that make this possible across the full training lifecycle.


Table of Contents


Autonomous Experiment / Research Frameworks

Core idea: AI agents autonomously design experiments, modify training code, evaluate results, and iterate. You sleep, AI experiments.

Project Description Key Highlight
AutoResearch AI agent runs autonomous ML experiments in a loop 630 lines of Python, ~100 experiments overnight, 11% efficiency gain on GPT-2 training
AI Scientist v2 Fully automated scientific discovery with agentic tree search Hypothesis → Experiment → Paper, no human templates needed
AutoML-Agent Multi-agent LLM framework for full-pipeline AutoML (ICML 2025) Parallel specialized agents for preprocessing, architecture design, HPO; retrieval-augmented planning
auto-ml-agent LLM-orchestrated autonomous ML pipeline End-to-end: data preprocessing → model deployment, multi-agent architecture
MLAgentBench Benchmark for evaluating AI agents on ML experimentation 13 end-to-end ML tasks from CIFAR-10 to BabyLM
AutoAgent Zero-code LLM agent framework with self-play customization Create agents via natural language, iterative self-improvement
ShinkaEvolve LLM-as-mutation-operator program evolution framework Evolves programs for scientific discovery
AI-Supervisor Autonomous research supervision via persistent Research World Model Multi-agent consensus + Knowledge Graph; validates claims via GPU computation; self-correcting updates
ARIS Lightweight Markdown-only skills for autonomous ML research overnight Zero dependencies; cross-model review loops; 20+ GPU experiments per overnight run; works with any LLM agent

Agent-Driven Training Skills (HuggingFace Ecosystem)

"Vibe Training" — use natural language to drive the full model training lifecycle through coding agents.

Project Description Key Highlight
HuggingFace Skills Standardized ML skill packages for coding agents 12 skills: model training (SFT/DPO/GRPO), vision training, experiment tracking, evaluation, dataset management
HuggingFace AutoTrain No-code training platform Upload data → auto model selection → training → evaluation → Hub publishing

HF Skills covers:

  • hugging-face-model-trainer — Fine-tune LLMs with TRL (SFT, DPO, GRPO), 0.5B to 70B parameters
  • hugging-face-vision-trainer — Train object detection & image classification (RTDETRv2, YOLOS, ViT)
  • hugging-face-jobs — Run compute jobs on HF infrastructure with cost estimation
  • hugging-face-trackio — ML experiment tracking with real-time metrics
  • hugging-face-evaluation — Model evaluation with lighteval
  • hugging-face-datasets — Dataset creation and management
  • Compatible with: Claude Code, OpenAI Codex, Google Gemini CLI, Cursor

LLM Fine-Tuning Frameworks

The training engines. Upper-level agents (AutoResearch, HF Skills) ultimately call these frameworks to execute training.

Project Description Key Highlight
Unsloth Ultra-efficient LLM fine-tuning & RL 2x faster, 70% less VRAM; custom CUDA kernels; MoE 12x faster; MCP Server available
Axolotl Flexible, production-ready fine-tuning YAML-driven; v0.8.x: QAT, sequence parallelism, GRPO, full RLHF pipeline
LlamaFactory Unified fine-tuning with Web UI LlamaBoard browser UI; 100+ models; SFT/RLHF/DPO/PPO
TRL HuggingFace's RL training library SFT, DPO, GRPO, PPO, KTO, ORPO; deep Transformers/PEFT integration
torchtune PyTorch-native fine-tuning No extra abstractions; multi-node support (Feb 2025)
NeMo AutoModel NVIDIA's DTensor-native training library Day-0 HuggingFace support; single-to-multi-node scaling
LMFlow Extensible toolkit for fine-tuning large foundation models LISA memory-efficient training (outperforms LoRA); FlashAttention; NAACL Best Demo Paper
H2O LLM Studio No-code GUI framework for fine-tuning LLMs Browser-based UI; LoRA/4-bit/8-bit; DPO/IPO/KTO; W&B integration
LitGPT 20+ high-performance LLMs with pretrain/finetune/deploy recipes CLI-driven; powered TinyLlama project; NeurIPS 2023 LLM Efficiency Challenge
InstructLab IBM/Red Hat collaborative LLM customization via synthetic data LAB alignment method; taxonomy-driven skill contributions; targets Granite models

RL Alignment Training Frameworks (RLHF / GRPO)

2025-2026 trend: GRPO (Group Relative Policy Optimization) is replacing PPO as the default alignment method — no critic model needed, simpler and more stable.

Project Description Key Highlight
OpenRLHF High-performance RLHF framework on Ray + vLLM 70B+ full tuning; PPO/DAPO/REINFORCE++; async agent RLHF
verl ByteDance's Volcano Engine RL for LLMs GRPO/PPO in few lines; 3D-HybridEngine; used by ByteDance, Alibaba Qwen, UC Berkeley, LMSys
DAPO Open-source RL system from ByteDance Seed + Tsinghua 50 pts on AIME 2024 with Qwen2.5-32B; 4 key stability techniques; built on verl
AReaL Fully asynchronous RL for LLM reasoning (Ant Group + Tsinghua) 2.77x speedup vs synchronous; GSPO algorithm; Ascend NPU support
slime LLM post-training framework for RL scaling (GLM team) Powers GLM-4.5/4.6/4.7/5; Megatron + SGLang; RLVE (400 verifiable environments)
NeMo RL NVIDIA's scalable post-training RL library GRPO, SFT, DPO, DAPO; Ray-based; Megatron Core parallelism
NeMo Gym Build RL environments for LLM training Multi-step/multi-turn environments; interoperable with NeMo RL, OpenRLHF, TRL, Unsloth
rLLM Post-training RL framework for language agents Custom agents + environments → RL training → deployment; rLLM-FinQA-4B beats Qwen3-235B
RAGEN Multi-turn RL framework for training reasoning agents StarPO framework; 10 built-in environments; identifies "Echo Trap" instability
f-GRPO f-Divergence based GRPO for general LLM alignment KL/Reverse KL/Pearson/Hellinger/JS divergences; superior on both RLVR (math) and safety alignment; built on Unsloth
Tree-GRPO Tree search for LLM agent RL (ICLR 2026) 4x less rollout budget via shared prefixes; step-wise process supervision from outcome reward; tree-structured ReAct
SimpleRL-Reason Simple RL recipe for reasoning (HKUST) DeepSeek-R1-style; 7B achieves 33.3% AIME with only 8K examples; no SFT needed
SWE-RL Meta's RL for software engineering reasoning Llama3-SWE-RL-70B achieves 41% on SWE-bench Verified (NeurIPS 2025)
OpenManus-RL RL tuning for LLM agents (UIUC + MetaGPT) PPO-based; AgentGym environments + verl training
LlamaGym Online RL fine-tuning for LLM agents Define agent → create LLM → write RL loop
Reasoning Gym Procedural reasoning environments for RLVR 100+ tasks; NeurIPS 2025 Spotlight; unlimited controllable task generation

Automated Hyperparameter Optimization / AutoML

Project Description Key Highlight
AgentHPO LLM-driven hyperparameter optimization Matches/surpasses human best trials on 12 ML tasks with explainable results
AutoML-Agent Multi-agent LLM framework for full-pipeline AutoML (ICML 2025) Parallel specialized agents; retrieval-augmented planning; 14 datasets tested
Optuna Industry-standard HPO framework Bayesian search, pruning, distributed execution, visualization dashboard
Microsoft NNI Full AutoML toolkit Neural Architecture Search + HPO + model compression + feature engineering
W&B Sweeps Automated hyperparameter search + tracking Bayesian/Grid/Random search; Hyperband early stopping; cross-machine parallelism

Self-Evolving / Self-Play Training

Core idea: Models generate their own training data to train themselves, reducing dependence on human annotations.

Project Description Key Highlight
SPIN Self-Play Fine-Tuning Model plays against its previous iterations; outperforms DPO + GPT-4 preference data without extra annotations
SPPO Self-Play Preference Optimization Iterative policy updates approximating Nash equilibrium with convergence guarantees
SPC (Self-Play Critic) Adversarial self-play for evolving reasoning critics "Sneaky generator" vs "critic" game; eliminates manual step-level annotation
SPELL Self-Play RL for Evolving Long-Context Language Models Label-free self-play; base model surpasses instruction-tuned counterpart on long-context tasks
Multi-Agent Evolve One LLM plays Proposer + Solver + Judge roles Verified improvements on math, coding, reasoning with Qwen2.5-3B
Multiagent Finetuning Multi-agent society from same base model Multi-agent iteration keeps improving where single-model self-training plateaus
CORY Cooperative multi-agent RL fine-tuning Pioneer + Observer dual-agent paradigm (NeurIPS 2024)

Synthetic Data Generation & Curation

Critical for automated training pipelines: generate high-quality training data at scale without manual annotation.

Data Generation

Project Description Key Highlight
Distilabel Framework for synthetic data and AI feedback pipelines Modular pipeline; SFT/DPO/UltraFeedback techniques; any LLM provider
Magpie Alignment data synthesis from scratch (ICLR 2025) No prompt engineering needed; 4M instructions generated; matches Llama-3 Instruct
DataDreamer Reproducible synthetic data generation (ACL 2024) Multi-step prompting; generate/align/fine-tune/distill; built-in caching
Cosmopedia Large-scale synthetic pretraining data pipeline 25B tokens of synthetic textbooks/blogs; uses Mixtral-8x7B
InstructLab SDG Synthetic data via LAB methodology (IBM/Red Hat) Skills-SDG + Knowledge-SDG; minimal seed taxonomy → large-scale data
Persona Hub Persona-driven synthetic data at billion scale (Tencent) 1B diverse personas; 370M elite personas released
synth_gen Execution-verified synthetic data (Meta) Modular verifier system; parser-based verification for code
Evidently Open-source synthetic data generation with user profiles Model-agnostic; customizable personas & goals; no-code UI in Evidently Cloud; outputs to pandas DataFrame
NVIDIA Nemotron-4 340B Open models for synthetic data generation pipeline Base + Instruct + Reward models; commercial use allowed

Data Curation & Filtering

Project Description Key Highlight
NeMo Curator GPU-accelerated data preprocessing & curation 30+ filters; fuzzy dedup 1.1T tokens in 1.8h on 64 A100s; 16x faster
DataTrove Platform-agnostic data processing pipeline Used for FineWeb and Cosmopedia; low memory; Slurm support
Dolma High-performance dataset curation toolkit (AllenAI) Built-in parallelism for billions of docs; used for OLMo (3T tokens)
Data Prep Kit Unstructured data preparation (IBM) Python/Ray/Spark runtimes; laptop to datacenter scaling

Knowledge Distillation

Compress large models into smaller, deployable ones while preserving capabilities.

Project Description Key Highlight
EasyDistill Comprehensive distillation toolkit (Alibaba/ModelScope, EMNLP 2025) Black-box + white-box KD; data synthesis + SFT + logits distillation + RL
DistillKit Production-ready LLM distillation (Arcee AI) Online and offline workflows; powers Arcee Virtuoso, SuperNova models
MiniPLM Knowledge distillation for pre-training (Tsinghua, ICLR 2025) Improved DPKD variant
DistiLLM Streamlined distillation with contrastive approach (ICML 2024) DistiLLM-2 contrastive distillation

Model Merging & Quantization

Combine multiple models or compress them for efficient deployment and training.

Model Merging

Project Description Key Highlight
MergeKit Leading toolkit for merging pretrained LLMs SLERP, TIES, DARE, Passthrough, Evolutionary merge; works on CPU with 8GB VRAM
MergeLM Language model merging codebase (ICML 2024) Research-grade implementations

Quantization

Project Description Key Highlight
GPTQModel Production-ready LLM quantization toolkit GPTQ, AWQ, QQQ, GPTAQ, EoRA, GAR; multi-backend CPU/GPU
AutoGPTQ Easy-to-use GPTQ quantization 8/4/3/2-bit; Marlin int4*fp16 kernel; ~150-200K monthly PyPI downloads
AutoRound Advanced quantization via sign-gradient descent (Intel) High accuracy at 2-4 bits; exports to GPTQ/AWQ/GGUF; broad HW compatibility
NVIDIA Model Optimizer Unified quantization, pruning, distillation & speculative decoding FP8/INT8/INT4; exports to TensorRT-LLM/vLLM; NeMo Megatron integration
TurboQuant Google's KV cache compression (ICLR 2026) 6x memory reduction at 3-bit with zero accuracy loss; PolarQuant + QJL; 8x perf on H100
llama.cpp LLM inference in C/C++ with GGUF quantization Q4_K_M sweet spot: 92% quality, 75% size reduction; runs everywhere

Lightweight Pretraining & Distributed Training

Pair these with autonomous experiment frameworks — fast, small-scale training is the foundation for autonomous experimentation.

Lightweight Pretraining

Project Description Key Highlight
nanochat Minimal LLM training harness (AutoResearch's engine) Single GPU; tokenization → pretrain → finetune → eval → chat; GPT-2 for ~$48
Nanotron Minimal 3D-parallel LLM pretraining Data + Tensor + Pipeline parallelism; scales from experiments to production

Distributed Training

Project Description Key Highlight
TorchTitan PyTorch-native large-scale training platform Up to 4D parallelism without model code changes; MXFP8 on Blackwell; elastic scaling
Open-dLLM First open-source full stack for diffusion LLMs Raw data → training → checkpoints → evaluation → inference, all-in-one

Inference Engines (for RL Training Loops)

Inference engines are critical for RL training — 80% of RLHF training time is spent on sample generation. Fast inference = fast training.

Project Description Key Highlight
vLLM Most mature open-source LLM serving engine PagedAttention; 4x higher throughput on Blackwell; core engine for OpenRLHF
SGLang High-performance serving for LLMs & multimodal ~16,200 tok/sec on H100; RadixAttention; used by slime for RL training
TensorRT-LLM NVIDIA's optimized inference library FP8/FP4/INT4; EAGLE-3 speculative decoding; max GPU performance
LMDeploy LLM compression, deployment & serving TurboMind MXFP4; 1.5x vLLM performance; DeepSeek PD disaggregation
HuggingFace TGI Multi-backend LLM serving (TensorRT-LLM, vLLM, llama.cpp) Unified frontend; token streaming; HF Hub native; CPU/GPU/Inferentia support
NVIDIA Dynamo Datacenter-scale distributed inference 30x request throughput on DeepSeek-R1; disaggregated prefill/decode; Rust + Python

Multimodal Training Frameworks

Training models that understand text, images, video, and audio simultaneously.

Project Description Key Highlight
LLaVA-OneVision-1.5 Fully open-source multimodal training Native-resolution images; SOTA performance; lower training costs
LLaVA-OneVision-1.5-RL Democratized multimodal RL training Open code, data, and models for multimodal RLHF
OpenRLHF-M Multimodal model RLHF training Extension of OpenRLHF for VLMs
LLaVA-KD Multimodal knowledge distillation (ICCV 2025) Distills large MLLMs into smaller ones
MoE-LLaVA Mixture-of-Experts for vision-language models (TMM 2025) Efficient multimodal MoE architecture

Experiment Tracking & Orchestration

Project Description Key Highlight
Weights & Biases Experiment tracking + sweeps + model registry Industry standard; integrates with all major frameworks
MLflow 3.0 Open-source experiment tracking + model serving Self-hosted; nested experiments; model registry
ClearML Open-source MLOps platform 150K+ users at Fortune 500; auto-logging; pipeline orchestration; dataset versioning
HF Trackio Lightweight experiment tracking in HF ecosystem Deep integration with HF Skills; agents can read metrics and make decisions

Benchmarks & Evaluation

ML Agent Benchmarks

Benchmark Description Key Highlight
MLE-bench 75 Kaggle ML engineering competition tasks Evaluates AI agents on real ML engineering: training, data prep, experiments
MLAgentBench 13 end-to-end ML experimentation tasks Stanford SNAP; Claude v3 Opus best at 37.5%
PaperBench Evaluates AI's ability to replicate ICML 2024 papers 8,316 gradable tasks across 20 papers; best agent scores 21%
CORE-Bench Computational Reproducibility Agent Benchmark 270 tasks from 90 papers across CS, social science, medicine
MLRC-Bench ML Research Competition challenges Tests novel methodology development
AgentBench Multi-dimensional benchmark for LLM agents Tests across OS, database, knowledge graph, web, and game environments
SWE-bench Verified Human-verified GitHub issue resolution Industry standard for coding agents; top scores 70%+
LiveBench Monthly-updated contamination-free LLM benchmark 6 categories (Math/Reasoning/Coding/Language/Data/IF); objective auto-scoring; no LLM judge needed

Model Evaluation Frameworks

Tool Description Key Highlight
DeepEval Pytest-like LLM evaluation framework v3.0: 14+ metrics; multi-turn simulation; DeepTeam for red teaming
Opik Open-source LLM observability & evaluation (Comet) Deep tracing; LLM-as-a-judge; hallucination detection; production dashboards
LMMs-Eval Multimodal evaluation across text, image, video, audio v0.6: eval-as-a-service; 7.5x throughput; 50+ tasks
Arize Phoenix Open-source LLM observability and evaluation Fully self-hosted; tracing, evaluation, retrieval analysis
LiveCodeBench Contamination-free coding benchmark Fresh problems from LeetCode/AtCoder/Codeforces

Coding Agents (for Training Script Development)

These agents don't train models directly, but can write and debug training code, completing the automation loop when paired with HF Skills.

Project Description Key Highlight
Aider Terminal AI pair programming Git integration; supports Claude/GPT/DeepSeek/local models
OpenHands AI-driven software development (open-source Devin) Autonomous code editing + execution + debugging; MIT license
SWE-agent Autonomous GitHub issue fixer SWE-bench open-source SOTA (NeurIPS 2024)
Open-SWE LangChain's async cloud-hosted coding agent Multi-agent (Planner + Reviewer); GitHub integration; auto PR creation
SERA Ai2's open coding agent family 54.2% on SWE-Bench; trains in 40 GPU-days (~$2K); all open
Cline VS Code AI coding agent with 60K+ GitHub stars MCP tool creation; 5M+ developers; human-in-the-loop approval; native subagents
OpenCode Go-based terminal AI agent with 95K+ GitHub stars Bubble Tea TUI; 75+ LLM providers; 6.5M monthly developers; SQLite persistence
Plandex Terminal agent for large projects with 2M token context Tree-sitter project maps; diff review sandbox; auto-debugging; 30+ languages
Roo Code Terminal agent with 95K+ GitHub stars 75+ LLM providers; plan-first development; 2.5M monthly developers

Recommended Stacks

Most Complete Automation

HuggingFace Skills + Claude Code + Unsloth + W&B

Natural language → Claude Code orchestrates → HF Skills calls Unsloth for training → W&B tracks experiments.

Lightest Autonomous Research

AutoResearch + nanochat (single GPU)

Start before bed, wake up to ~100 autonomous experiment results.

Most Flexible Production Setup

Axolotl / LlamaFactory + OpenRLHF + Optuna + MLflow

YAML-configured training + automated HPO + full experiment tracking.

Full RL Training Pipeline (2026 SOTA)

verl / OpenRLHF + vLLM/SGLang + Reasoning Gym + W&B

State-of-the-art RL training with fast inference engines and rich environments.

Synthetic Data → Training → Eval

Distilabel / Magpie → Unsloth / TRL → DeepEval / LMMs-Eval

Generate data at scale → train efficiently → evaluate comprehensively.


Trends (2026 Q2 Update)

  1. AutoResearch Paradigm: Karpathy proved "AI autonomously doing ML research" works with just 630 lines of code — now spawning derivatives like ARIS and AI-Supervisor
  2. "Vibe Training": HF Skills enables natural-language-driven model training lifecycle
  3. GRPO Variants Proliferate: f-GRPO (f-divergence family), Tree-GRPO (tree search, ICLR 2026), DAPO — GRPO is the new default, and specialized variants are emerging fast
  4. RL Framework Explosion: verl, DAPO, AReaL, slime — every major lab now has an open-source RL training framework
  5. Self-Play Breakthrough: Multi-agent self-evolution (SPIN, MAE, SPC) overcomes single-model self-training plateaus
  6. Synthetic Data as Infrastructure: Distilabel, Magpie, Evidently make data generation a first-class pipeline stage; model collapse mitigation (Evol-Instruct) becoming standard
  7. MCP Standardization: Model Context Protocol adopted by OpenAI/Google/Microsoft as the "USB-C for AI agents"
  8. Single-GPU Research: Unsloth + nanochat + AutoResearch enables individual developers to do serious LLM research
  9. Inference-Training Convergence: vLLM/SGLang/TGI are now core components of RL training loops, not just serving
  10. Multimodal RL: LLaVA-OneVision-1.5-RL and OpenRLHF-M bring RL alignment to vision-language models
  11. Extreme Quantization: Google TurboQuant achieves 6x KV cache compression at zero accuracy loss (ICLR 2026); NVIDIA Model Optimizer unifies quantization/pruning/distillation
  12. Multi-Agent Coding Wave: Feb 2026 saw every major tool ship multi-agent capabilities (Grok Build, Windsurf, Claude Code, Codex CLI, Devin) — coding agents now routinely write training scripts

Related Awesome Lists


Contributing

Contributions are welcome! Please open an issue or submit a PR if you know of tools that fit this collection.

Criteria for inclusion:

  • Must be directly usable for automated model training workflows
  • Preference for open-source projects with active maintenance
  • Focus on tools that leverage AI/LLMs to automate the training process itself

License

This curated list is released under CC0 1.0.


Compiled March 2026, updated April 2026. Project statuses may change — check individual GitHub repos for the latest.

About

A curated collection of tools and frameworks for AI-driven automated model training — AutoResearch, HuggingFace Skills, AutoML, Self-Play, and more.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages