Skip to content

wyann22/aios

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AIOS: Build an LLM Inference Engine from Scratch

Have you ever wondered how ChatGPT generates responses? How a model with billions of parameters actually runs on your GPU? Or why inference optimization matters so much?

AIOS is a hands-on learning project where you build an LLM inference engine from scratch. By the end, you won't just use LLMs—you'll understand how inference works under the hood.

Who This Is For

  • Software engineers curious about AI/ML systems
  • ML practitioners who want to understand inference beyond training
  • System engineers interested in GPU programming and optimization
  • Students looking for practical, implementable knowledge

What You'll Build

By completing this project, you will:

  • Build a production-grade LLM inference engine from a simple HuggingFace model
  • Implement every major optimization: KV cache, paged attention, continuous batching, FlashAttention, CUDA graphs, tensor parallelism
  • Go from ~5 tok/s to ~1200+ tok/s — a 240x improvement
  • Serve an OpenAI-compatible API

The Key Mental Model: LLM Inference Engine as an "Operating System"

Traditional computing has an OS that bridges applications and hardware. AI computing has an inference engine that bridges LLMs and GPUs:

Operating System Inference Engine
Process scheduling Request batching & scheduling
Memory management (virtual memory, paging) KV cache management (paged attention)
I/O scheduling Prefill/decode scheduling
Device drivers Kernel/operator optimization
Multi-core parallelism Multi-GPU tensor parallelism

Performance Progression

Each lesson adds one major optimization. Here's the throughput progression:

Lesson Throughput (32 req) Key Optimization Multiplier
3 ~5 tok/s Baseline (no KV cache) 1x
4 ~25 tok/s KV cache reuse 5x
5 ~30 tok/s Pre-allocated cache 1.2x
6 ~30 tok/s Paged cache (memory efficiency)
7 ~400 tok/s Batching 13x
8 ~600 tok/s Continuous batching 1.5x
9 ~900 tok/s FlashAttention 1.5x
10 ~1000 tok/s Fused layers 1.1x
11 ~1200 tok/s CUDA graphs 1.2x
12 ~1200 tok/s Sampling (quality)
13 ~1200 tok/s + prefix Prefix caching prefill savings
14 ~2000 tok/s (2 GPU) Tensor parallelism 1.7x
15 Production API Serving layer

Course Roadmap

Foundation (Lessons 0–2)

Building the Engine (Lessons 3–8)

Optimization (Lessons 9–12)

Scaling and Serving (Lessons 13–15)

Engine Architecture (Final State)

aios/
├── config.py                    # Engine configuration
├── sampling_params.py           # Per-request sampling parameters
├── llm.py                       # User-facing API
├── engine/
│   ├── llm_engine.py            # Orchestrator (model + scheduler + tokenizer)
│   ├── scheduler.py             # Prefill-first continuous batching scheduler
│   ├── model_runner.py          # GPU execution, KV cache, CUDA graphs, TP
│   ├── sequence.py              # Per-request state machine
│   └── block_manager.py         # Paged KV cache + prefix caching
├── models/
│   └── qwen3.py                 # Inference-only Qwen3 (no HF model deps)
├── layers/
│   ├── attention.py             # FlashAttention + Triton KV write
│   ├── linear.py                # TP-aware linear layers
│   ├── layernorm.py             # RMSNorm with fused residual add
│   ├── rotary_embedding.py      # Precomputed RoPE
│   ├── activation.py            # SiluAndMul (fused gate*up)
│   ├── embed_head.py            # Vocab-parallel embedding + LM head
│   └── sampler.py               # Gumbel-max sampling
└── utils/
    ├── loader.py                # Weight loader (safetensors → fused modules)
    └── context.py               # ThreadLocal context for attention metadata

Model Support

Supports all Qwen3 sizes. Exercises are model-size agnostic — config is loaded from config.json:

Model Parameters Recommended Use
Qwen3-0.6B 0.6B Development, fast iteration
Qwen3-1.7B 1.7B Development, basic testing
Qwen3-4B 4B Testing, light benchmarking
Qwen3-8B 8B Benchmarking
Qwen3-14B 14B Benchmarking (needs TP=2)
Qwen3-32B 32B Benchmarking (needs TP=4)

Quick Start

Prerequisites

  • Python 3.10+
  • PyTorch 2.0+
  • CUDA GPU (recommended: A100/H100 for full benchmarks, any GPU for learning)

Install Dependencies

pip install torch transformers safetensors flash-attn triton xxhash numpy tqdm

Run the Engine

# Simple generation
python generate.py --model /path/to/Qwen3-0.6B

# Benchmark throughput
python benchmark.py --model /path/to/Qwen3-0.6B --num-prompts 32

# Using the Python API
python -c "
from aios import LLM, SamplingParams
llm = LLM('/path/to/Qwen3-0.6B')
outputs = llm.generate(['Hello world'], SamplingParams(max_tokens=64))
print(outputs[0]['text'])
"

Follow the Course

Start from the beginning:

  1. Lesson 0: Introduction
  2. Lesson 1: LLM Basics
  3. Continue through Lessons 2–15

Each lesson includes:

  • README.md — Concepts, diagrams, step-by-step guide
  • run_lessonN.py — Standalone demo script
  • requirements.txt — Dependencies for this lesson

References

This project is inspired by:

  • nano-vllm (~1,200 lines) — Minimal vLLM clone with clean architecture
  • mini-sglang (~6,400 lines) — Feature-complete SGLang reference

License

Educational project. See individual files for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages