A playbook for speeding up deep learning training: measure bottlenecks, apply engineering + training-policy fixes, and verify results.
English | 中文
TrainFlashAgent is a set of Skills (Markdown playbooks) for improving deep learning training performance. It is designed to be read and executed by an AI coding assistant inside your editor.
- Core workflow: diagnose with low-overhead timing → optimize engineering + training policy → verify performance and fidelity → write back with approval.
- Primary targets: input pipeline stalls, long-tail batches, padding waste/shape instability, sync points, logging/eval/checkpoint overhead, optimizer-step policy.
- Optional accelerators: AMP /
torch.compile/ TF32 can be layered on later, but the skills here aim to find and fix the underlying system-level bottlenecks first.
- Evidence first: prove the bottleneck before tuning.
- Top-down workflow: macro timing → engineering/training-policy fixes → (profiler only if needed).
- Safety by default: work in a sandbox; write back only after verification and explicit approval.
- Reviewable changes: plan with files, change one variable per experiment, keep a tuning log.
- Copy instructions into the training project you want to optimize:
# Cursor users
cp .cursorrules /path/to/your/project/
# GitHub Copilot users
mkdir -p /path/to/your/project/.github
cp .github/copilot-instructions.md /path/to/your/project/.github/
# Claude Code users
cp CLAUDE.md /path/to/your/project/- Ask your AI assistant to follow the skills:
“Speed up training in this repo. Follow TrainFlashAgent skills strictly and report measurable results.”
Expected flow: sandbox → diagnose → optimize → verify → (optional) write back.
TrainFlashAgent now includes a standalone TrainFlash MCP package at tools/trainflash_mcp.
It is intended for low-overhead training diagnosis before any heavy profiler run, and can combine:
- GPU telemetry from NVML (
gpu_util, memory utilization, PCIe RX/TX throughput, power, temperature) - optional host telemetry from
psutil(CPU / memory / disk / network) - phase timing events for
Data,H2D,Fwd,Bwd,Opt,Eval,Ckpt,Log - aggregated diagnosis summaries and bottleneck hints
Typical setup:
cd tools/trainflash_mcp
pip install -e .[mcp,host,test]
python -m trainflash_mcpHermes host config example:
mcp_servers:
trainflash:
command: "python"
args: ["-m", "trainflash_mcp"]
cwd: "/absolute/path/to/TrainFlashAgent/tools/trainflash_mcp"
timeout: 180
connect_timeout: 60The MCP server exposes:
get_trainflash_capabilitiesget_trainflash_system_snapshotstart_trainflash_sessionrecord_trainflash_phase_eventingest_trainflash_phase_traceget_trainflash_summarystop_trainflash_session
- 01-sandbox.md: isolate a sandbox workspace before any edits
- 02-diagnose.md: identify the dominant bottleneck with low-overhead timing or MCP-driven phase diagnosis
- 03-optimize.md: apply engineering + training-strategy optimizations (effective batch, sampler/bucketing, loop overheads)
- 04-verify.md: verify performance + fidelity (and convergence sanity when needed), then write back safely
- Performance: measurable improvement on the target workload (report mean and tail, not just one number).
- Fidelity: within the agreed tolerance (default: mean loss delta < 1e-3, unless the project defines a better metric).
- Safety: no writeback to the original project directory without explicit user approval.
Training performance issues are usually systems issues:
- input pipeline stalls and long-tail batches
- padding waste / shape instability
- sync points and logging overhead
- optimizer-step policy (micro-batch vs accumulation vs world size)
These require measurement, iteration discipline, and careful verification—exactly what the skills encode.
TrainFlashAgent/
├── skills/ # Playbooks (skills)
│ ├── 01-sandbox.md
│ ├── 02-diagnose.md
│ ├── 03-optimize.md
│ └── 04-verify.md
├── tools/
│ └── trainflash_mcp/ # Standalone MCP helper for telemetry + phase timing
├── .cursorrules # Cursor instructions
├── .github/copilot-instructions.md # Copilot instructions
├── README.md # English (this file)
└── README_zh.md # 中文说明
MIT License