Skip to content

Feature: LLM inference sizing skill for AWS EC2 (NVIDIA GPU selection, VRAM math, performance estimation) #291

Description

@gcasilva

Problem Statement

Customers deploying LLMs on NVIDIA GPUs in AWS face a recurring question: "Which EC2 instance type and how many GPUs do I need for my model?" This requires intersecting three domains — model architecture (parameter count, GQA head structure, KV cache scaling), NVIDIA GPU hardware specs (VRAM, HBM bandwidth, FP8/INT4 support, NVLink/NVSwitch topology), and AWS EC2 instance topology (instance families, EFA networking, regional availability). Getting any of these wrong leads to OOM crashes, over-provisioning, or missed latency SLAs.

The current NVIDIA skills catalog covers deployment (Dynamo, NemoClaw, RAG Blueprint), training (Megatron-Core, NeMo), and model development (Nemotron, TAO) — but there is no skill for the pre-deployment sizing decision that every AWS customer must make before touching any of those deployment tools.

Proposed Design

A read-only advisory skill — llm-aws-sizing — that takes a model name + workload parameters and produces a structured sizing report: recommended EC2 instance type, GPU count, tensor/pipeline parallelism configuration, quantization recommendation, and estimated tokens-per-second.

9-step workflow:

Step Function
0. Refresh Fetches live EC2 specs via AWS Knowledge MCP (search_documentation, get_regional_availability) or web search fallback. Discovers new instance families (P6-B200, G7, G7e) not in static reference tables.
1–2. Memory math Static weight memory (P × bytes/param) + GQA-aware KV cache (M_kv = 2 × B × L × H_kv × D × S × dtype). Always reads num_key_value_heads from model config.json — prevents the common 8× KV cache overestimation on Llama-3/Mistral/Qwen models.
3. VRAM total Combines weights + KV cache + overhead, applies gpu_memory_utilization factor (0.85–0.9), computes minimum GPU count.
4. Instance selection Decision matrix mapping model size → EC2 family (G5/G6/G6e/P4de/P5/P5e/P5en/P6/Inf2) with NVLink, EFA, and FP8 support constraints.
5. Parallelism TP/PP strategy with single-node NVLink requirement enforcement for TP.
6. Quantization FP8 (Hopper/Ada) vs INT4 AWQ/GPTQ (Ampere) trade-off matrix with GPU compatibility.
7. Performance estimation Deterministic decode speed model: TPS = effective_bandwidth / weights_read_per_token, with worked examples for GPU-resident and offloaded MoE (KTransformers-style) scenarios. Prefill TTFT estimation via compute-bound model.
8. SLA validation Maps to vLLM/TensorRT-LLM benchmark suites; bottleneck identification by workload type (prefill-heavy, decode-heavy, high-concurrency, long-context).

NVIDIA alignment: The skill is centered on NVIDIA GPU hardware — it uses NVIDIA datasheets for VRAM, HBM bandwidth, FP8/INT8/INT4 TFLOPS, and NVLink/NVSwitch specs. It guides customers toward the optimal NVIDIA GPU (L4, L40S, A100, H100, H200, B200) for their workload, directly supporting NVIDIA hardware adoption on AWS.

Performance estimation is academically validated:

  • Pope et al., "Efficiently Scaling Transformer Inference" (MLSys 2023, arXiv:2211.05102) — foundational bandwidth-bound decode model
  • "LLM Inference Unveiled: Survey and Roofline Model Insights" (arXiv:2402.16363) — roofline characterization of prefill (compute-bound) vs decode (bandwidth-bound)
  • Splitwise (ISCA 2024, arXiv:2311.18677) — phase disaggregation

File structure:

llm-aws-sizing/
├── SKILL.md                          # 9-step workflow, formulas, decision matrices
└── references/
    ├── ec2-instance-catalog.md       # Full instance tables (G5–P6, Inf2, Trn2)
    └── gpu-specs.md                  # NVIDIA GPU specs with FP8 support matrix

Working implementation: Already built and tested — https://github.com/gcasilva/llm-aws-sizing-skill (Apache 2.0). The skill is agent-agnostic and follows the Agent Skills spec (portable directory with SKILL.md at root, YAML frontmatter with name + description).

Example output:

Image

Alternatives Considered

  1. Add sizing guidance to the Dynamo skill — Rejected. Dynamo is a deployment framework; sizing is a pre-deployment advisory step that should be model/engine-agnostic, not coupled to a specific serving stack.
  2. Documentation page instead of a skill — Rejected. A static doc can't do the math (parameter count × bytes/param, KV cache with model-specific architecture, GPU count calculation). An agent skill can read the model's config.json, fetch live EC2 specs, and produce a computed recommendation.
  3. AWS-only tool (not in NVIDIA catalog) — Considered, but the skill is fundamentally about NVIDIA GPU selection. Every recommendation centers on NVIDIA hardware capabilities (bandwidth, FP8 support, NVLink topology). It belongs alongside other NVIDIA GPU-adjacent skills.

What's needed for catalog onboarding

Per CONTRIBUTING.md, publishing requires:

  • Product repo to host the source of truth (this skill currently lives in a standalone community repo)
  • skill-card.md governance card
  • Tier-3 evaluation dataset (evals/evals.json)
  • BENCHMARK.md from evaluation runs
  • OMS signature via NVIDIA signing pipeline
  • IP review (six-question check)

The skill content, formulas, and reference data are complete. What's needed is NVIDIA product team sponsorship to run it through the signing/evaluation pipeline.

Category

enhancement: new skill request

Checklist

  • I searched existing issues and this is not a duplicate
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions