Problem Statement
Customers deploying LLMs on NVIDIA GPUs in AWS face a recurring question: "Which EC2 instance type and how many GPUs do I need for my model?" This requires intersecting three domains — model architecture (parameter count, GQA head structure, KV cache scaling), NVIDIA GPU hardware specs (VRAM, HBM bandwidth, FP8/INT4 support, NVLink/NVSwitch topology), and AWS EC2 instance topology (instance families, EFA networking, regional availability). Getting any of these wrong leads to OOM crashes, over-provisioning, or missed latency SLAs.
The current NVIDIA skills catalog covers deployment (Dynamo, NemoClaw, RAG Blueprint), training (Megatron-Core, NeMo), and model development (Nemotron, TAO) — but there is no skill for the pre-deployment sizing decision that every AWS customer must make before touching any of those deployment tools.
Proposed Design
A read-only advisory skill — llm-aws-sizing — that takes a model name + workload parameters and produces a structured sizing report: recommended EC2 instance type, GPU count, tensor/pipeline parallelism configuration, quantization recommendation, and estimated tokens-per-second.
9-step workflow:
| Step |
Function |
| 0. Refresh |
Fetches live EC2 specs via AWS Knowledge MCP (search_documentation, get_regional_availability) or web search fallback. Discovers new instance families (P6-B200, G7, G7e) not in static reference tables. |
| 1–2. Memory math |
Static weight memory (P × bytes/param) + GQA-aware KV cache (M_kv = 2 × B × L × H_kv × D × S × dtype). Always reads num_key_value_heads from model config.json — prevents the common 8× KV cache overestimation on Llama-3/Mistral/Qwen models. |
| 3. VRAM total |
Combines weights + KV cache + overhead, applies gpu_memory_utilization factor (0.85–0.9), computes minimum GPU count. |
| 4. Instance selection |
Decision matrix mapping model size → EC2 family (G5/G6/G6e/P4de/P5/P5e/P5en/P6/Inf2) with NVLink, EFA, and FP8 support constraints. |
| 5. Parallelism |
TP/PP strategy with single-node NVLink requirement enforcement for TP. |
| 6. Quantization |
FP8 (Hopper/Ada) vs INT4 AWQ/GPTQ (Ampere) trade-off matrix with GPU compatibility. |
| 7. Performance estimation |
Deterministic decode speed model: TPS = effective_bandwidth / weights_read_per_token, with worked examples for GPU-resident and offloaded MoE (KTransformers-style) scenarios. Prefill TTFT estimation via compute-bound model. |
| 8. SLA validation |
Maps to vLLM/TensorRT-LLM benchmark suites; bottleneck identification by workload type (prefill-heavy, decode-heavy, high-concurrency, long-context). |
NVIDIA alignment: The skill is centered on NVIDIA GPU hardware — it uses NVIDIA datasheets for VRAM, HBM bandwidth, FP8/INT8/INT4 TFLOPS, and NVLink/NVSwitch specs. It guides customers toward the optimal NVIDIA GPU (L4, L40S, A100, H100, H200, B200) for their workload, directly supporting NVIDIA hardware adoption on AWS.
Performance estimation is academically validated:
- Pope et al., "Efficiently Scaling Transformer Inference" (MLSys 2023, arXiv:2211.05102) — foundational bandwidth-bound decode model
- "LLM Inference Unveiled: Survey and Roofline Model Insights" (arXiv:2402.16363) — roofline characterization of prefill (compute-bound) vs decode (bandwidth-bound)
- Splitwise (ISCA 2024, arXiv:2311.18677) — phase disaggregation
File structure:
llm-aws-sizing/
├── SKILL.md # 9-step workflow, formulas, decision matrices
└── references/
├── ec2-instance-catalog.md # Full instance tables (G5–P6, Inf2, Trn2)
└── gpu-specs.md # NVIDIA GPU specs with FP8 support matrix
Working implementation: Already built and tested — https://github.com/gcasilva/llm-aws-sizing-skill (Apache 2.0). The skill is agent-agnostic and follows the Agent Skills spec (portable directory with SKILL.md at root, YAML frontmatter with name + description).
Example output:
Alternatives Considered
- Add sizing guidance to the Dynamo skill — Rejected. Dynamo is a deployment framework; sizing is a pre-deployment advisory step that should be model/engine-agnostic, not coupled to a specific serving stack.
- Documentation page instead of a skill — Rejected. A static doc can't do the math (parameter count × bytes/param, KV cache with model-specific architecture, GPU count calculation). An agent skill can read the model's
config.json, fetch live EC2 specs, and produce a computed recommendation.
- AWS-only tool (not in NVIDIA catalog) — Considered, but the skill is fundamentally about NVIDIA GPU selection. Every recommendation centers on NVIDIA hardware capabilities (bandwidth, FP8 support, NVLink topology). It belongs alongside other NVIDIA GPU-adjacent skills.
What's needed for catalog onboarding
Per CONTRIBUTING.md, publishing requires:
The skill content, formulas, and reference data are complete. What's needed is NVIDIA product team sponsorship to run it through the signing/evaluation pipeline.
Category
enhancement: new skill request
Checklist
Problem Statement
Customers deploying LLMs on NVIDIA GPUs in AWS face a recurring question: "Which EC2 instance type and how many GPUs do I need for my model?" This requires intersecting three domains — model architecture (parameter count, GQA head structure, KV cache scaling), NVIDIA GPU hardware specs (VRAM, HBM bandwidth, FP8/INT4 support, NVLink/NVSwitch topology), and AWS EC2 instance topology (instance families, EFA networking, regional availability). Getting any of these wrong leads to OOM crashes, over-provisioning, or missed latency SLAs.
The current NVIDIA skills catalog covers deployment (Dynamo, NemoClaw, RAG Blueprint), training (Megatron-Core, NeMo), and model development (Nemotron, TAO) — but there is no skill for the pre-deployment sizing decision that every AWS customer must make before touching any of those deployment tools.
Proposed Design
A read-only advisory skill —
llm-aws-sizing— that takes a model name + workload parameters and produces a structured sizing report: recommended EC2 instance type, GPU count, tensor/pipeline parallelism configuration, quantization recommendation, and estimated tokens-per-second.9-step workflow:
search_documentation,get_regional_availability) or web search fallback. Discovers new instance families (P6-B200, G7, G7e) not in static reference tables.P × bytes/param) + GQA-aware KV cache (M_kv = 2 × B × L × H_kv × D × S × dtype). Always readsnum_key_value_headsfrom modelconfig.json— prevents the common 8× KV cache overestimation on Llama-3/Mistral/Qwen models.gpu_memory_utilizationfactor (0.85–0.9), computes minimum GPU count.TPS = effective_bandwidth / weights_read_per_token, with worked examples for GPU-resident and offloaded MoE (KTransformers-style) scenarios. Prefill TTFT estimation via compute-bound model.NVIDIA alignment: The skill is centered on NVIDIA GPU hardware — it uses NVIDIA datasheets for VRAM, HBM bandwidth, FP8/INT8/INT4 TFLOPS, and NVLink/NVSwitch specs. It guides customers toward the optimal NVIDIA GPU (L4, L40S, A100, H100, H200, B200) for their workload, directly supporting NVIDIA hardware adoption on AWS.
Performance estimation is academically validated:
File structure:
Working implementation: Already built and tested — https://github.com/gcasilva/llm-aws-sizing-skill (Apache 2.0). The skill is agent-agnostic and follows the Agent Skills spec (portable directory with
SKILL.mdat root, YAML frontmatter withname+description).Example output:
Alternatives Considered
config.json, fetch live EC2 specs, and produce a computed recommendation.What's needed for catalog onboarding
Per CONTRIBUTING.md, publishing requires:
skill-card.mdgovernance cardevals/evals.json)BENCHMARK.mdfrom evaluation runsThe skill content, formulas, and reference data are complete. What's needed is NVIDIA product team sponsorship to run it through the signing/evaluation pipeline.
Category
enhancement: new skill requestChecklist