Feature: LLM inference sizing skill for AWS EC2 (NVIDIA GPU selection, VRAM math, performance estimation)

### Problem Statement

Customers deploying LLMs on NVIDIA GPUs in AWS face a recurring question: **"Which EC2 instance type and how many GPUs do I need for my model?"** This requires intersecting three domains — model architecture (parameter count, GQA head structure, KV cache scaling), NVIDIA GPU hardware specs (VRAM, HBM bandwidth, FP8/INT4 support, NVLink/NVSwitch topology), and AWS EC2 instance topology (instance families, EFA networking, regional availability). Getting any of these wrong leads to OOM crashes, over-provisioning, or missed latency SLAs.

The current NVIDIA skills catalog covers deployment (Dynamo, NemoClaw, RAG Blueprint), training (Megatron-Core, NeMo), and model development (Nemotron, TAO) — but there is **no skill for the pre-deployment sizing decision** that every AWS customer must make before touching any of those deployment tools.

### Proposed Design

A read-only advisory skill — `llm-aws-sizing` — that takes a model name + workload parameters and produces a structured sizing report: recommended EC2 instance type, GPU count, tensor/pipeline parallelism configuration, quantization recommendation, and estimated tokens-per-second.

**9-step workflow:**

| Step | Function |
|------|----------|
| **0. Refresh** | Fetches live EC2 specs via AWS Knowledge MCP (`search_documentation`, `get_regional_availability`) or web search fallback. Discovers new instance families (P6-B200, G7, G7e) not in static reference tables. |
| **1–2. Memory math** | Static weight memory (`P × bytes/param`) + GQA-aware KV cache (`M_kv = 2 × B × L × H_kv × D × S × dtype`). Always reads `num_key_value_heads` from model `config.json` — prevents the common 8× KV cache overestimation on Llama-3/Mistral/Qwen models. |
| **3. VRAM total** | Combines weights + KV cache + overhead, applies `gpu_memory_utilization` factor (0.85–0.9), computes minimum GPU count. |
| **4. Instance selection** | Decision matrix mapping model size → EC2 family (G5/G6/G6e/P4de/P5/P5e/P5en/P6/Inf2) with NVLink, EFA, and FP8 support constraints. |
| **5. Parallelism** | TP/PP strategy with single-node NVLink requirement enforcement for TP. |
| **6. Quantization** | FP8 (Hopper/Ada) vs INT4 AWQ/GPTQ (Ampere) trade-off matrix with GPU compatibility. |
| **7. Performance estimation** | Deterministic decode speed model: `TPS = effective_bandwidth / weights_read_per_token`, with worked examples for GPU-resident and offloaded MoE (KTransformers-style) scenarios. Prefill TTFT estimation via compute-bound model. |
| **8. SLA validation** | Maps to vLLM/TensorRT-LLM benchmark suites; bottleneck identification by workload type (prefill-heavy, decode-heavy, high-concurrency, long-context). |

**NVIDIA alignment:** The skill is centered on NVIDIA GPU hardware — it uses NVIDIA datasheets for VRAM, HBM bandwidth, FP8/INT8/INT4 TFLOPS, and NVLink/NVSwitch specs. It guides customers toward the optimal NVIDIA GPU (L4, L40S, A100, H100, H200, B200) for their workload, directly supporting NVIDIA hardware adoption on AWS.

**Performance estimation is academically validated:**
- Pope et al., "Efficiently Scaling Transformer Inference" (MLSys 2023, arXiv:2211.05102) — foundational bandwidth-bound decode model
- "LLM Inference Unveiled: Survey and Roofline Model Insights" (arXiv:2402.16363) — roofline characterization of prefill (compute-bound) vs decode (bandwidth-bound)
- Splitwise (ISCA 2024, arXiv:2311.18677) — phase disaggregation

**File structure:**
```
llm-aws-sizing/
├── SKILL.md                          # 9-step workflow, formulas, decision matrices
└── references/
    ├── ec2-instance-catalog.md       # Full instance tables (G5–P6, Inf2, Trn2)
    └── gpu-specs.md                  # NVIDIA GPU specs with FP8 support matrix
```

**Working implementation:** Already built and tested — https://github.com/gcasilva/llm-aws-sizing-skill (Apache 2.0). The skill is agent-agnostic and follows the Agent Skills spec (portable directory with `SKILL.md` at root, YAML frontmatter with `name` + `description`).

Example output:

<img width="931" height="649" alt="Image" src="https://github.com/user-attachments/assets/3518a685-7255-4c17-aa62-b239121fc77e" />

### Alternatives Considered

1. **Add sizing guidance to the Dynamo skill** — Rejected. Dynamo is a deployment framework; sizing is a pre-deployment advisory step that should be model/engine-agnostic, not coupled to a specific serving stack.
2. **Documentation page instead of a skill** — Rejected. A static doc can't do the math (parameter count × bytes/param, KV cache with model-specific architecture, GPU count calculation). An agent skill can read the model's `config.json`, fetch live EC2 specs, and produce a computed recommendation.
3. **AWS-only tool (not in NVIDIA catalog)** — Considered, but the skill is fundamentally about NVIDIA GPU selection. Every recommendation centers on NVIDIA hardware capabilities (bandwidth, FP8 support, NVLink topology). It belongs alongside other NVIDIA GPU-adjacent skills.

### What's needed for catalog onboarding

Per CONTRIBUTING.md, publishing requires:
- [ ] Product repo to host the source of truth (this skill currently lives in a standalone community repo)
- [ ] `skill-card.md` governance card
- [ ] Tier-3 evaluation dataset (`evals/evals.json`)
- [ ] `BENCHMARK.md` from evaluation runs
- [ ] OMS signature via NVIDIA signing pipeline
- [ ] IP review (six-question check)

The skill content, formulas, and reference data are complete. What's needed is NVIDIA product team sponsorship to run it through the signing/evaluation pipeline.

### Category

`enhancement: new skill request`

### Checklist

- [x] I searched existing issues and this is not a duplicate
- [x] This is a design proposal, not a "please build this" request

Step	Function
0. Refresh	Fetches live EC2 specs via AWS Knowledge MCP (`search_documentation`, `get_regional_availability`) or web search fallback. Discovers new instance families (P6-B200, G7, G7e) not in static reference tables.
1–2. Memory math	Static weight memory (`P × bytes/param`) + GQA-aware KV cache (`M_kv = 2 × B × L × H_kv × D × S × dtype`). Always reads `num_key_value_heads` from model `config.json` — prevents the common 8× KV cache overestimation on Llama-3/Mistral/Qwen models.
3. VRAM total	Combines weights + KV cache + overhead, applies `gpu_memory_utilization` factor (0.85–0.9), computes minimum GPU count.
4. Instance selection	Decision matrix mapping model size → EC2 family (G5/G6/G6e/P4de/P5/P5e/P5en/P6/Inf2) with NVLink, EFA, and FP8 support constraints.
5. Parallelism	TP/PP strategy with single-node NVLink requirement enforcement for TP.
6. Quantization	FP8 (Hopper/Ada) vs INT4 AWQ/GPTQ (Ampere) trade-off matrix with GPU compatibility.
7. Performance estimation	Deterministic decode speed model: `TPS = effective_bandwidth / weights_read_per_token`, with worked examples for GPU-resident and offloaded MoE (KTransformers-style) scenarios. Prefill TTFT estimation via compute-bound model.
8. SLA validation	Maps to vLLM/TensorRT-LLM benchmark suites; bottleneck identification by workload type (prefill-heavy, decode-heavy, high-concurrency, long-context).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: LLM inference sizing skill for AWS EC2 (NVIDIA GPU selection, VRAM math, performance estimation) #291

Problem Statement

Proposed Design

Alternatives Considered

What's needed for catalog onboarding

Category

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Feature: LLM inference sizing skill for AWS EC2 (NVIDIA GPU selection, VRAM math, performance estimation) #291

Description

Problem Statement

Proposed Design

Alternatives Considered

What's needed for catalog onboarding

Category

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions