A curated collection of tools, frameworks, and resources for AI-driven automated model training — letting AI agents autonomously run experiments, fine-tune models, optimize hyperparameters, and evolve themselves.
Inspired by Karpathy's AutoResearch, HuggingFace Skills, and the broader AutoML movement.
The paradigm is shifting: instead of manually tuning models, we now have tools that let AI agents design experiments, modify training code, evaluate results, and iterate autonomously — while you sleep.
This repository collects the best open-source tools and frameworks that make this possible across the full training lifecycle.
- Autonomous Experiment / Research Frameworks
- Agent-Driven Training Skills (HuggingFace Ecosystem)
- LLM Fine-Tuning Frameworks
- RL Alignment Training Frameworks (RLHF / GRPO)
- Automated Hyperparameter Optimization / AutoML
- Self-Evolving / Self-Play Training
- Synthetic Data Generation & Curation
- Knowledge Distillation
- Model Merging & Quantization
- Lightweight Pretraining & Distributed Training
- Inference Engines (for RL Training Loops)
- Multimodal Training Frameworks
- Experiment Tracking & Orchestration
- Benchmarks & Evaluation
- Coding Agents (for Training Script Development)
- Recommended Stacks
Core idea: AI agents autonomously design experiments, modify training code, evaluate results, and iterate. You sleep, AI experiments.
| Project | Description | Key Highlight |
|---|---|---|
| AutoResearch | AI agent runs autonomous ML experiments in a loop | 630 lines of Python, ~100 experiments overnight, 11% efficiency gain on GPT-2 training |
| AI Scientist v2 | Fully automated scientific discovery with agentic tree search | Hypothesis → Experiment → Paper, no human templates needed |
| AutoML-Agent | Multi-agent LLM framework for full-pipeline AutoML (ICML 2025) | Parallel specialized agents for preprocessing, architecture design, HPO; retrieval-augmented planning |
| auto-ml-agent | LLM-orchestrated autonomous ML pipeline | End-to-end: data preprocessing → model deployment, multi-agent architecture |
| MLAgentBench | Benchmark for evaluating AI agents on ML experimentation | 13 end-to-end ML tasks from CIFAR-10 to BabyLM |
| AutoAgent | Zero-code LLM agent framework with self-play customization | Create agents via natural language, iterative self-improvement |
| ShinkaEvolve | LLM-as-mutation-operator program evolution framework | Evolves programs for scientific discovery |
| AI-Supervisor | Autonomous research supervision via persistent Research World Model | Multi-agent consensus + Knowledge Graph; validates claims via GPU computation; self-correcting updates |
| ARIS | Lightweight Markdown-only skills for autonomous ML research overnight | Zero dependencies; cross-model review loops; 20+ GPU experiments per overnight run; works with any LLM agent |
"Vibe Training" — use natural language to drive the full model training lifecycle through coding agents.
| Project | Description | Key Highlight |
|---|---|---|
| HuggingFace Skills | Standardized ML skill packages for coding agents | 12 skills: model training (SFT/DPO/GRPO), vision training, experiment tracking, evaluation, dataset management |
| HuggingFace AutoTrain | No-code training platform | Upload data → auto model selection → training → evaluation → Hub publishing |
HF Skills covers:
hugging-face-model-trainer— Fine-tune LLMs with TRL (SFT, DPO, GRPO), 0.5B to 70B parametershugging-face-vision-trainer— Train object detection & image classification (RTDETRv2, YOLOS, ViT)hugging-face-jobs— Run compute jobs on HF infrastructure with cost estimationhugging-face-trackio— ML experiment tracking with real-time metricshugging-face-evaluation— Model evaluation with lightevalhugging-face-datasets— Dataset creation and management- Compatible with: Claude Code, OpenAI Codex, Google Gemini CLI, Cursor
The training engines. Upper-level agents (AutoResearch, HF Skills) ultimately call these frameworks to execute training.
| Project | Description | Key Highlight |
|---|---|---|
| Unsloth | Ultra-efficient LLM fine-tuning & RL | 2x faster, 70% less VRAM; custom CUDA kernels; MoE 12x faster; MCP Server available |
| Axolotl | Flexible, production-ready fine-tuning | YAML-driven; v0.8.x: QAT, sequence parallelism, GRPO, full RLHF pipeline |
| LlamaFactory | Unified fine-tuning with Web UI | LlamaBoard browser UI; 100+ models; SFT/RLHF/DPO/PPO |
| TRL | HuggingFace's RL training library | SFT, DPO, GRPO, PPO, KTO, ORPO; deep Transformers/PEFT integration |
| torchtune | PyTorch-native fine-tuning | No extra abstractions; multi-node support (Feb 2025) |
| NeMo AutoModel | NVIDIA's DTensor-native training library | Day-0 HuggingFace support; single-to-multi-node scaling |
| LMFlow | Extensible toolkit for fine-tuning large foundation models | LISA memory-efficient training (outperforms LoRA); FlashAttention; NAACL Best Demo Paper |
| H2O LLM Studio | No-code GUI framework for fine-tuning LLMs | Browser-based UI; LoRA/4-bit/8-bit; DPO/IPO/KTO; W&B integration |
| LitGPT | 20+ high-performance LLMs with pretrain/finetune/deploy recipes | CLI-driven; powered TinyLlama project; NeurIPS 2023 LLM Efficiency Challenge |
| InstructLab | IBM/Red Hat collaborative LLM customization via synthetic data | LAB alignment method; taxonomy-driven skill contributions; targets Granite models |
2025-2026 trend: GRPO (Group Relative Policy Optimization) is replacing PPO as the default alignment method — no critic model needed, simpler and more stable.
| Project | Description | Key Highlight |
|---|---|---|
| OpenRLHF | High-performance RLHF framework on Ray + vLLM | 70B+ full tuning; PPO/DAPO/REINFORCE++; async agent RLHF |
| verl | ByteDance's Volcano Engine RL for LLMs | GRPO/PPO in few lines; 3D-HybridEngine; used by ByteDance, Alibaba Qwen, UC Berkeley, LMSys |
| DAPO | Open-source RL system from ByteDance Seed + Tsinghua | 50 pts on AIME 2024 with Qwen2.5-32B; 4 key stability techniques; built on verl |
| AReaL | Fully asynchronous RL for LLM reasoning (Ant Group + Tsinghua) | 2.77x speedup vs synchronous; GSPO algorithm; Ascend NPU support |
| slime | LLM post-training framework for RL scaling (GLM team) | Powers GLM-4.5/4.6/4.7/5; Megatron + SGLang; RLVE (400 verifiable environments) |
| NeMo RL | NVIDIA's scalable post-training RL library | GRPO, SFT, DPO, DAPO; Ray-based; Megatron Core parallelism |
| NeMo Gym | Build RL environments for LLM training | Multi-step/multi-turn environments; interoperable with NeMo RL, OpenRLHF, TRL, Unsloth |
| rLLM | Post-training RL framework for language agents | Custom agents + environments → RL training → deployment; rLLM-FinQA-4B beats Qwen3-235B |
| RAGEN | Multi-turn RL framework for training reasoning agents | StarPO framework; 10 built-in environments; identifies "Echo Trap" instability |
| f-GRPO | f-Divergence based GRPO for general LLM alignment | KL/Reverse KL/Pearson/Hellinger/JS divergences; superior on both RLVR (math) and safety alignment; built on Unsloth |
| Tree-GRPO | Tree search for LLM agent RL (ICLR 2026) | 4x less rollout budget via shared prefixes; step-wise process supervision from outcome reward; tree-structured ReAct |
| SimpleRL-Reason | Simple RL recipe for reasoning (HKUST) | DeepSeek-R1-style; 7B achieves 33.3% AIME with only 8K examples; no SFT needed |
| SWE-RL | Meta's RL for software engineering reasoning | Llama3-SWE-RL-70B achieves 41% on SWE-bench Verified (NeurIPS 2025) |
| OpenManus-RL | RL tuning for LLM agents (UIUC + MetaGPT) | PPO-based; AgentGym environments + verl training |
| LlamaGym | Online RL fine-tuning for LLM agents | Define agent → create LLM → write RL loop |
| Reasoning Gym | Procedural reasoning environments for RLVR | 100+ tasks; NeurIPS 2025 Spotlight; unlimited controllable task generation |
| Project | Description | Key Highlight |
|---|---|---|
| AgentHPO | LLM-driven hyperparameter optimization | Matches/surpasses human best trials on 12 ML tasks with explainable results |
| AutoML-Agent | Multi-agent LLM framework for full-pipeline AutoML (ICML 2025) | Parallel specialized agents; retrieval-augmented planning; 14 datasets tested |
| Optuna | Industry-standard HPO framework | Bayesian search, pruning, distributed execution, visualization dashboard |
| Microsoft NNI | Full AutoML toolkit | Neural Architecture Search + HPO + model compression + feature engineering |
| W&B Sweeps | Automated hyperparameter search + tracking | Bayesian/Grid/Random search; Hyperband early stopping; cross-machine parallelism |
Core idea: Models generate their own training data to train themselves, reducing dependence on human annotations.
| Project | Description | Key Highlight |
|---|---|---|
| SPIN | Self-Play Fine-Tuning | Model plays against its previous iterations; outperforms DPO + GPT-4 preference data without extra annotations |
| SPPO | Self-Play Preference Optimization | Iterative policy updates approximating Nash equilibrium with convergence guarantees |
| SPC (Self-Play Critic) | Adversarial self-play for evolving reasoning critics | "Sneaky generator" vs "critic" game; eliminates manual step-level annotation |
| SPELL | Self-Play RL for Evolving Long-Context Language Models | Label-free self-play; base model surpasses instruction-tuned counterpart on long-context tasks |
| Multi-Agent Evolve | One LLM plays Proposer + Solver + Judge roles | Verified improvements on math, coding, reasoning with Qwen2.5-3B |
| Multiagent Finetuning | Multi-agent society from same base model | Multi-agent iteration keeps improving where single-model self-training plateaus |
| CORY | Cooperative multi-agent RL fine-tuning | Pioneer + Observer dual-agent paradigm (NeurIPS 2024) |
Critical for automated training pipelines: generate high-quality training data at scale without manual annotation.
| Project | Description | Key Highlight |
|---|---|---|
| Distilabel | Framework for synthetic data and AI feedback pipelines | Modular pipeline; SFT/DPO/UltraFeedback techniques; any LLM provider |
| Magpie | Alignment data synthesis from scratch (ICLR 2025) | No prompt engineering needed; 4M instructions generated; matches Llama-3 Instruct |
| DataDreamer | Reproducible synthetic data generation (ACL 2024) | Multi-step prompting; generate/align/fine-tune/distill; built-in caching |
| Cosmopedia | Large-scale synthetic pretraining data pipeline | 25B tokens of synthetic textbooks/blogs; uses Mixtral-8x7B |
| InstructLab SDG | Synthetic data via LAB methodology (IBM/Red Hat) | Skills-SDG + Knowledge-SDG; minimal seed taxonomy → large-scale data |
| Persona Hub | Persona-driven synthetic data at billion scale (Tencent) | 1B diverse personas; 370M elite personas released |
| synth_gen | Execution-verified synthetic data (Meta) | Modular verifier system; parser-based verification for code |
| Evidently | Open-source synthetic data generation with user profiles | Model-agnostic; customizable personas & goals; no-code UI in Evidently Cloud; outputs to pandas DataFrame |
| NVIDIA Nemotron-4 340B | Open models for synthetic data generation pipeline | Base + Instruct + Reward models; commercial use allowed |
| Project | Description | Key Highlight |
|---|---|---|
| NeMo Curator | GPU-accelerated data preprocessing & curation | 30+ filters; fuzzy dedup 1.1T tokens in 1.8h on 64 A100s; 16x faster |
| DataTrove | Platform-agnostic data processing pipeline | Used for FineWeb and Cosmopedia; low memory; Slurm support |
| Dolma | High-performance dataset curation toolkit (AllenAI) | Built-in parallelism for billions of docs; used for OLMo (3T tokens) |
| Data Prep Kit | Unstructured data preparation (IBM) | Python/Ray/Spark runtimes; laptop to datacenter scaling |
Compress large models into smaller, deployable ones while preserving capabilities.
| Project | Description | Key Highlight |
|---|---|---|
| EasyDistill | Comprehensive distillation toolkit (Alibaba/ModelScope, EMNLP 2025) | Black-box + white-box KD; data synthesis + SFT + logits distillation + RL |
| DistillKit | Production-ready LLM distillation (Arcee AI) | Online and offline workflows; powers Arcee Virtuoso, SuperNova models |
| MiniPLM | Knowledge distillation for pre-training (Tsinghua, ICLR 2025) | Improved DPKD variant |
| DistiLLM | Streamlined distillation with contrastive approach (ICML 2024) | DistiLLM-2 contrastive distillation |
Combine multiple models or compress them for efficient deployment and training.
| Project | Description | Key Highlight |
|---|---|---|
| MergeKit | Leading toolkit for merging pretrained LLMs | SLERP, TIES, DARE, Passthrough, Evolutionary merge; works on CPU with 8GB VRAM |
| MergeLM | Language model merging codebase (ICML 2024) | Research-grade implementations |
| Project | Description | Key Highlight |
|---|---|---|
| GPTQModel | Production-ready LLM quantization toolkit | GPTQ, AWQ, QQQ, GPTAQ, EoRA, GAR; multi-backend CPU/GPU |
| AutoGPTQ | Easy-to-use GPTQ quantization | 8/4/3/2-bit; Marlin int4*fp16 kernel; ~150-200K monthly PyPI downloads |
| AutoRound | Advanced quantization via sign-gradient descent (Intel) | High accuracy at 2-4 bits; exports to GPTQ/AWQ/GGUF; broad HW compatibility |
| NVIDIA Model Optimizer | Unified quantization, pruning, distillation & speculative decoding | FP8/INT8/INT4; exports to TensorRT-LLM/vLLM; NeMo Megatron integration |
| TurboQuant | Google's KV cache compression (ICLR 2026) | 6x memory reduction at 3-bit with zero accuracy loss; PolarQuant + QJL; 8x perf on H100 |
| llama.cpp | LLM inference in C/C++ with GGUF quantization | Q4_K_M sweet spot: 92% quality, 75% size reduction; runs everywhere |
Pair these with autonomous experiment frameworks — fast, small-scale training is the foundation for autonomous experimentation.
| Project | Description | Key Highlight |
|---|---|---|
| nanochat | Minimal LLM training harness (AutoResearch's engine) | Single GPU; tokenization → pretrain → finetune → eval → chat; GPT-2 for ~$48 |
| Nanotron | Minimal 3D-parallel LLM pretraining | Data + Tensor + Pipeline parallelism; scales from experiments to production |
| Project | Description | Key Highlight |
|---|---|---|
| TorchTitan | PyTorch-native large-scale training platform | Up to 4D parallelism without model code changes; MXFP8 on Blackwell; elastic scaling |
| Open-dLLM | First open-source full stack for diffusion LLMs | Raw data → training → checkpoints → evaluation → inference, all-in-one |
Inference engines are critical for RL training — 80% of RLHF training time is spent on sample generation. Fast inference = fast training.
| Project | Description | Key Highlight |
|---|---|---|
| vLLM | Most mature open-source LLM serving engine | PagedAttention; 4x higher throughput on Blackwell; core engine for OpenRLHF |
| SGLang | High-performance serving for LLMs & multimodal | ~16,200 tok/sec on H100; RadixAttention; used by slime for RL training |
| TensorRT-LLM | NVIDIA's optimized inference library | FP8/FP4/INT4; EAGLE-3 speculative decoding; max GPU performance |
| LMDeploy | LLM compression, deployment & serving | TurboMind MXFP4; 1.5x vLLM performance; DeepSeek PD disaggregation |
| HuggingFace TGI | Multi-backend LLM serving (TensorRT-LLM, vLLM, llama.cpp) | Unified frontend; token streaming; HF Hub native; CPU/GPU/Inferentia support |
| NVIDIA Dynamo | Datacenter-scale distributed inference | 30x request throughput on DeepSeek-R1; disaggregated prefill/decode; Rust + Python |
Training models that understand text, images, video, and audio simultaneously.
| Project | Description | Key Highlight |
|---|---|---|
| LLaVA-OneVision-1.5 | Fully open-source multimodal training | Native-resolution images; SOTA performance; lower training costs |
| LLaVA-OneVision-1.5-RL | Democratized multimodal RL training | Open code, data, and models for multimodal RLHF |
| OpenRLHF-M | Multimodal model RLHF training | Extension of OpenRLHF for VLMs |
| LLaVA-KD | Multimodal knowledge distillation (ICCV 2025) | Distills large MLLMs into smaller ones |
| MoE-LLaVA | Mixture-of-Experts for vision-language models (TMM 2025) | Efficient multimodal MoE architecture |
| Project | Description | Key Highlight |
|---|---|---|
| Weights & Biases | Experiment tracking + sweeps + model registry | Industry standard; integrates with all major frameworks |
| MLflow 3.0 | Open-source experiment tracking + model serving | Self-hosted; nested experiments; model registry |
| ClearML | Open-source MLOps platform | 150K+ users at Fortune 500; auto-logging; pipeline orchestration; dataset versioning |
| HF Trackio | Lightweight experiment tracking in HF ecosystem | Deep integration with HF Skills; agents can read metrics and make decisions |
| Benchmark | Description | Key Highlight |
|---|---|---|
| MLE-bench | 75 Kaggle ML engineering competition tasks | Evaluates AI agents on real ML engineering: training, data prep, experiments |
| MLAgentBench | 13 end-to-end ML experimentation tasks | Stanford SNAP; Claude v3 Opus best at 37.5% |
| PaperBench | Evaluates AI's ability to replicate ICML 2024 papers | 8,316 gradable tasks across 20 papers; best agent scores 21% |
| CORE-Bench | Computational Reproducibility Agent Benchmark | 270 tasks from 90 papers across CS, social science, medicine |
| MLRC-Bench | ML Research Competition challenges | Tests novel methodology development |
| AgentBench | Multi-dimensional benchmark for LLM agents | Tests across OS, database, knowledge graph, web, and game environments |
| SWE-bench Verified | Human-verified GitHub issue resolution | Industry standard for coding agents; top scores 70%+ |
| LiveBench | Monthly-updated contamination-free LLM benchmark | 6 categories (Math/Reasoning/Coding/Language/Data/IF); objective auto-scoring; no LLM judge needed |
| Tool | Description | Key Highlight |
|---|---|---|
| DeepEval | Pytest-like LLM evaluation framework | v3.0: 14+ metrics; multi-turn simulation; DeepTeam for red teaming |
| Opik | Open-source LLM observability & evaluation (Comet) | Deep tracing; LLM-as-a-judge; hallucination detection; production dashboards |
| LMMs-Eval | Multimodal evaluation across text, image, video, audio | v0.6: eval-as-a-service; 7.5x throughput; 50+ tasks |
| Arize Phoenix | Open-source LLM observability and evaluation | Fully self-hosted; tracing, evaluation, retrieval analysis |
| LiveCodeBench | Contamination-free coding benchmark | Fresh problems from LeetCode/AtCoder/Codeforces |
These agents don't train models directly, but can write and debug training code, completing the automation loop when paired with HF Skills.
| Project | Description | Key Highlight |
|---|---|---|
| Aider | Terminal AI pair programming | Git integration; supports Claude/GPT/DeepSeek/local models |
| OpenHands | AI-driven software development (open-source Devin) | Autonomous code editing + execution + debugging; MIT license |
| SWE-agent | Autonomous GitHub issue fixer | SWE-bench open-source SOTA (NeurIPS 2024) |
| Open-SWE | LangChain's async cloud-hosted coding agent | Multi-agent (Planner + Reviewer); GitHub integration; auto PR creation |
| SERA | Ai2's open coding agent family | 54.2% on SWE-Bench; trains in 40 GPU-days (~$2K); all open |
| Cline | VS Code AI coding agent with 60K+ GitHub stars | MCP tool creation; 5M+ developers; human-in-the-loop approval; native subagents |
| OpenCode | Go-based terminal AI agent with 95K+ GitHub stars | Bubble Tea TUI; 75+ LLM providers; 6.5M monthly developers; SQLite persistence |
| Plandex | Terminal agent for large projects with 2M token context | Tree-sitter project maps; diff review sandbox; auto-debugging; 30+ languages |
| Roo Code | Terminal agent with 95K+ GitHub stars | 75+ LLM providers; plan-first development; 2.5M monthly developers |
HuggingFace Skills + Claude Code + Unsloth + W&B
Natural language → Claude Code orchestrates → HF Skills calls Unsloth for training → W&B tracks experiments.
AutoResearch + nanochat (single GPU)
Start before bed, wake up to ~100 autonomous experiment results.
Axolotl / LlamaFactory + OpenRLHF + Optuna + MLflow
YAML-configured training + automated HPO + full experiment tracking.
verl / OpenRLHF + vLLM/SGLang + Reasoning Gym + W&B
State-of-the-art RL training with fast inference engines and rich environments.
Distilabel / Magpie → Unsloth / TRL → DeepEval / LMMs-Eval
Generate data at scale → train efficiently → evaluate comprehensively.
- AutoResearch Paradigm: Karpathy proved "AI autonomously doing ML research" works with just 630 lines of code — now spawning derivatives like ARIS and AI-Supervisor
- "Vibe Training": HF Skills enables natural-language-driven model training lifecycle
- GRPO Variants Proliferate: f-GRPO (f-divergence family), Tree-GRPO (tree search, ICLR 2026), DAPO — GRPO is the new default, and specialized variants are emerging fast
- RL Framework Explosion: verl, DAPO, AReaL, slime — every major lab now has an open-source RL training framework
- Self-Play Breakthrough: Multi-agent self-evolution (SPIN, MAE, SPC) overcomes single-model self-training plateaus
- Synthetic Data as Infrastructure: Distilabel, Magpie, Evidently make data generation a first-class pipeline stage; model collapse mitigation (Evol-Instruct) becoming standard
- MCP Standardization: Model Context Protocol adopted by OpenAI/Google/Microsoft as the "USB-C for AI agents"
- Single-GPU Research: Unsloth + nanochat + AutoResearch enables individual developers to do serious LLM research
- Inference-Training Convergence: vLLM/SGLang/TGI are now core components of RL training loops, not just serving
- Multimodal RL: LLaVA-OneVision-1.5-RL and OpenRLHF-M bring RL alignment to vision-language models
- Extreme Quantization: Google TurboQuant achieves 6x KV cache compression at zero accuracy loss (ICLR 2026); NVIDIA Model Optimizer unifies quantization/pruning/distillation
- Multi-Agent Coding Wave: Feb 2026 saw every major tool ship multi-agent capabilities (Grok Build, Windsurf, Claude Code, Codex CLI, Devin) — coding agents now routinely write training scripts
- Awesome LLM Synthetic Data
- Awesome Knowledge Distillation of LLMs
- Awesome Model Merging
- Awesome LLM Quantization
- Awesome LLM Inference Engine
- LLM Datasets
- LLM Distillation Playbook
Contributions are welcome! Please open an issue or submit a PR if you know of tools that fit this collection.
Criteria for inclusion:
- Must be directly usable for automated model training workflows
- Preference for open-source projects with active maintenance
- Focus on tools that leverage AI/LLMs to automate the training process itself
This curated list is released under CC0 1.0.
Compiled March 2026, updated April 2026. Project statuses may change — check individual GitHub repos for the latest.