Cache-Augmented Generation (CAG) System

This entire implementation — setup, debugging, GPU validation, and documentation — was built autonomously using NEO, an autonomous AI agent. NEO wrote, debugged, and tested all the code, fixed 9 bugs across CUDA/Python/shell, and ran 11 GPU validation tests end-to-end. Try NEO in your IDE with the VS Code extension.

A RAG-less document QA system that loads entire documents into an LLM's KV cache once, saves it to disk, and restores it instantly before every query. No embeddings, no vector DB, no chunking.

What is CAG?

Traditional RAG breaks documents into chunks, embeds them, stores them in a vector DB, and retrieves the most relevant chunks at query time. This works but loses context — the model only ever sees fragments.

Cache-Augmented Generation takes a different approach: load the entire document into the model's context window, let the model prefill the full KV cache, save that cache to disk, and restore it in seconds before every query. The model sees everything, every time.

	Traditional RAG	Cache-Augmented Generation
Context quality	Chunked fragments	Full document, always
Query latency	Retrieval + generation	Instant restore + generation
Setup complexity	Embeddings + vector DB	One prefill, done
Repeated queries	Re-retrieve every time	Restore once per restart

The original concept was demonstrated by Han Xiao showing 1M-token inference on a single 24 GB GPU — the entire book in context at once.

How It Works

Step 1 — Ingest (done once per document)

The document is wrapped in a structured prompt and sent to llama-server. The model runs a full prefill pass, loading every token into the KV cache. This takes time — proportional to document size — but only happens once. The KV cache is then saved to a .bin file on disk.

Step 2 — Query (instant, repeatable)

Before each query, the saved .bin file is restored into llama-server's KV cache in ~1 second. The user's question is appended and the model generates an answer with full document context active. No re-reading, no re-embedding.

Step 3 — Persistence

KV slots survive server restarts. Kill the server, restart it, and your next query restores the cache from disk just as fast. The 24-minute prefill for War and Peace only needs to happen once — ever.

Validated Results

All 11 GPU tests were run on an NVIDIA RTX A6000 (48 GB VRAM) with Qwen3.5-35B-A3B Q3_K_M at 1,048,576 token context.

Performance

Metric	Reference (L4 24 GB)	Our Results (A6000 48 GB)
Cold prefill — War & Peace (922K tokens)	~68 minutes	24.3 minutes
KV slot restore from disk	~2.3 seconds	~1.2 seconds
Decode speed at 1M context	~9 tok/s	~100 tok/s
KV cache size at 1M context	23 GB (f16)	4 GB (turbo3) — 5.75× compression
VRAM used (model + KV cache)	~96% of 24 GB	43% of 48 GB

Test Results Summary

Test	Result
TurboQuant build — `turbo2/3/4` cache types available	✅ Pass
KV compression — 4 GB at 1M context (vs 23 GB f16)	✅ Pass
YaRN context extension — 262K → 1,048,576 tokens	✅ Pass
Slot save/restore timing	✅ Pass
VRAM profile at each stage	✅ Pass
Flash Attention active	✅ Pass
End-to-end demo — Alice in Wonderland + Peter Pan, 6/6 queries answered	✅ Pass
Concurrent query handling — 5 simultaneous requests, no corruption	✅ Pass
War & Peace stress test — 922K tokens, correct answers on characters & battles	✅ Pass
API authentication — key auth working	✅ Pass
Slot persistence across server restarts	✅ Pass

Sample Queries — War and Peace (922K tokens, 1M context)

Question	Result
"Who is Pierre Bezukhov?"	Correct, detailed answer
"What happened at the Battle of Borodino?"	Correct, detailed answer
"How does the novel end?"	Partial — lost-in-middle at extreme context depth (see Limitations)

Demo Run — Alice in Wonderland + Peter Pan

Metric	Result
Documents ingested	2/2
Queries answered correctly	6/6
Average slot restore time	0.1 ms
Average decode speed	103 tok/s
Average latency	9.7 ms/token
OOM errors	None

Quick Start

Prerequisites: Linux, NVIDIA GPU (8 GB+ VRAM), Python 3.8+

# 1. Build llama.cpp + download model (one-time, ~35 min)
./setup.sh

# 2. Start the LLM server
./start_server.sh

# 3. Start the API server
python3 src/api_server.py

# 4. Ingest a document
python3 src/ingest.py my_document.txt --corpus-id my_doc

# 5. Query it
python3 src/query.py my_doc "What is this document about?"

That's it. After step 4, the KV cache is saved to kv_slots/my_doc.bin. Every future query restores it instantly, and it survives server restarts.

Model Selection

setup.sh auto-detects GPU VRAM and picks the right model:

VRAM	Model	Context	Notes
24 GB+	Qwen3.5-35B-A3B Q3_K_M	1M tokens	Full CAG — matches original demo
16–24 GB	Qwen2.5-7B Q4_K_M	32K	Good quality
8–16 GB	Qwen2.5-7B Q4_0	16K	Lighter
CPU / <8 GB	Qwen2.5-3B Q4_K_M	4K	Demo only — very slow

The 24 GB+ path uses unsloth/Qwen3.5-35B-A3B-GGUF on HuggingFace and requires a free HF account + access token.

REST API

Start the API server with python3 src/api_server.py --port 8000 (optionally set CAG_API_KEY env var to enable key auth).

Endpoint	Method	Description
`/ingest`	POST	Ingest documents — returns a `job_id` immediately, runs in background
`/status/{job_id}`	GET	Poll ingestion job status (`pending` → `processing` → `completed`)
`/query`	POST	Query a corpus — supports `stream: true/false`
`/corpora`	GET	List all ingested corpora
`/corpora/{id}`	DELETE	Delete a corpus and its KV slot file
`/health`	GET	Health check (no auth required)

Full API docs available at http://localhost:8000/docs when the server is running.

Directory Structure

.
├── setup.sh              # Builds llama.cpp, downloads model
├── start_server.sh       # Launches llama-server with CAG flags
├── requirements.txt
├── src/
│   ├── api_server.py     # FastAPI REST API
│   ├── ingest.py         # CLI: ingest a document
│   ├── query.py          # CLI: query a corpus
│   └── demo.py           # End-to-end demo
├── docker/
│   ├── Dockerfile
│   └── docker-compose.yml
├── docs/
│   ├── REPORT.md         # Full GPU validation report with all 11 test results
│   └── GPU_TESTING.md    # GPU test checklist
├── models/               # GGUF weights (not committed)
├── kv_slots/             # Saved KV cache .bin files (not committed)
└── logs/                 # Runtime logs (not committed)

Limitations

Limitation	Detail
Linux + NVIDIA only	TurboQuant CUDA kernels require Linux + NVIDIA. No Windows, macOS, or AMD.
Long first prefill	900K tokens takes ~24 min on an A6000. One-time cost — all future queries restore in ~1 sec.
VRAM gating	Below 24 GB you get a smaller model with shorter context (see table above).
One active corpus	Single llama.cpp slot (slot 0). Switching corpora costs ~1 sec restore.
Lost-in-the-middle	YaRN 4× extrapolation attends strongly to document start/end. Mid-document content can be missed on very long docs.
Build time	First `./setup.sh` takes ~35 min to compile CUDA kernels. One-time cost.
HF token for large model	Qwen3.5-35B is gated. Free HF account + token required. Smaller models are public.

Troubleshooting

Server won't start — check nvidia-smi to confirm GPU is visible, check ports 8080/8000 aren't in use, tail logs/server.log.

Ingestion fails — confirm llama-server is healthy at http://localhost:8080/health, verify the model file exists in models/.

Query returns empty — run python3 src/query.py --list to confirm the corpus exists, check kv_slots/ for the .bin file.

Acknowledgments

NEO — Autonomous AI agent that built, debugged, and validated this entire implementation
llama-cpp-turboquant — TurboQuant KV compression fork
Han Xiao / Jina AI — Original CAG concept and demo
Qwen / Alibaba — Qwen3.5-35B-A3B MoE model
FastAPI — REST API framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cache-Augmented Generation (CAG) System

What is CAG?

How It Works

Validated Results

Performance

Test Results Summary

Sample Queries — War and Peace (922K tokens, 1M context)

Demo Run — Alice in Wonderland + Peter Pan

Quick Start

Model Selection

REST API

Directory Structure

Limitations

Troubleshooting

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docker		docker
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh
start_server.sh		start_server.sh

Folders and files

Latest commit

History

Repository files navigation

Cache-Augmented Generation (CAG) System

What is CAG?

How It Works

Validated Results

Performance

Test Results Summary

Sample Queries — War and Peace (922K tokens, 1M context)

Demo Run — Alice in Wonderland + Peter Pan

Quick Start

Model Selection

REST API

Directory Structure

Limitations

Troubleshooting

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages