hayate

Inference engine for Qwen3-4B. WIP.

Setup

hayate requires Python 3.12+.

uv venv
source .venv/bin/activate
uv pip install -e .

Run

Run the benchmark with batch size:

python benchmark.py 10

Use --verbose if you want the detailed per-mode breakdown:

python benchmark.py 10 --verbose

Pass --compile to wrap the model in torch.compile(..., dynamic=True). The default compile mode avoids CUDA Graph private pools, which keeps memory use lower on 24GB GPUs.

python benchmark.py 10 --compile

PyTorch SDPA backend selection is exposed through Engine and the benchmark. The default auto mode lets PyTorch pick the best available kernel; use flash, efficient, or math to force a backend for GPU experiments:

python benchmark.py 10 --sdpa-backend

Prefix caching is available as an opt-in engine feature for workloads where prompts share token prefixes:

from hayate.engine.engine import Engine

engine = Engine("Qwen/Qwen3-4B", enable_prefix_cache=True)

The cache stores reusable prompt KV prefixes with a token budget. The default budget is 4096 cached prefix tokens; pass prefix_cache_max_tokens=... to tune VRAM usage.

To benchmark prefix caching, run the benchmark with a generated shared-prefix workload:

python benchmark.py 10 --prefix-cache

Tune the synthetic workload with --prefix-shared-tokens, --prefix-suffix-tokens, and --prefix-cache-max-tokens.

Benchmark

Qwen/Qwen3-4B on an RTX 3090 (24GB), 10 requests, 5 reps.

without --compile:

mode                       mean        p50        p95  total tok/s
-------------------- ---------- ---------- ---------- ------------
single request           9.872s     9.801s    10.254s       48.73
submit all upfront      12.925s    12.875s    13.072s      372.15
staggered arrivals      13.105s    13.080s    13.508s      361.49

with --compile:

mode                       mean        p50        p95  total tok/s
-------------------- ---------- ---------- ---------- ------------
single request           4.382s     4.373s     4.571s      109.78
submit all upfront       9.459s     9.396s     9.727s      508.51
staggered arrivals       9.704s     9.627s    10.230s      485.16

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
hayate		hayate
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
benchmark.py		benchmark.py
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hayate

Setup

Run

Benchmark

Todo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hayate

Setup

Run

Benchmark

Todo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages