openkernel

Self-recursive GPU kernel optimization engine.

Give it a PyTorch operation and a target GPU -- it produces an optimized CUDA or Triton kernel through structured search with hardware profiler feedback.

What It Does

3-level hybrid search: Strategy evolution selects what to optimize, a world model plans the approach, and a refinement loop implements and validates with profiler feedback. Strategies that work are extracted into a skill library that compounds across problems.
Bring your own model (BYOM): Any LLM via OpenAI-compatible API. MiniMax M2.5 is the default. Claude, GLM, Kimi, Qwen all work. No lock-in.
Cloud-native on Modal: Compile, benchmark, and profile kernels on cloud GPUs (H100, A100, L40S). No local NVIDIA hardware required.

Quick Start

pip install openkernel kernel-code
export MINIMAX_API_KEY=your-key
openkernel optimize --reference my_kernel.py --backend triton

Options:

openkernel optimize --reference my_kernel.py --backend cuda --model claude-sonnet-4-20250514
openkernel evaluate --kernel optimized.py --reference my_kernel.py --eval-mode thorough
openkernel info

kernel code

kernel code is the terminal-native developer tool built on top of openkernel. It wraps the optimization engine with a Textual TUI purpose-built for kernel engineers:

+---------------------------------------------------------+
|  kernel code v0.1          [H100]  [Triton]  [L1#23]   |
+----------------------+----------------------------------+
|                      |  Optimization Trajectory          |
|   Chat / Agent       |  ████████████▓▓░░░  1.8x        |
|   Panel              +----------------------------------+
|                      |  Profiling Summary                |
|   > Analyzing        |  Bottleneck: memory_bound         |
|     reference...     |  Bandwidth:  72% of peak          |
|                      |  L2 hit:     45% (poor)           |
|                      +----------------------------------+
|   Critic: "L2 hit    |  Experiment Log                   |
|   rate improved to   |  #1  1.0x  keep   baseline        |
|   78%. Next: try     |  #3  1.3x  keep   shared mem      |
|   register blocking" |  #5  1.8x  keep   vectorized      |
+----------------------+----------------------------------+
|  [d]ashboard  [k]ernel diff  [r]oofline  [q]uit        |
+---------------------------------------------------------+

Press d to open a web dashboard with roofline plots, 3D optimization landscapes, strategy trees, and side-by-side kernel diffs with performance annotations.

kernel-code optimize --reference my_kernel.py --backend triton --no-mock
kernel-code dashboard

Supported Models

Model	Provider	Input / Output (per M tokens)	Recommended For
MiniMax M2.5 (default)	MiniMax	$0.30 / $1.20	General use, sweeps
GLM-5.1	Zhipu AI	$1.40 / $4.40	Deep optimization, hard problems
Kimi K2.5	Moonshot AI	$0.50 / $2.80	Speed, parallel exploration
Qwen3.5 397B	Alibaba	$0.20 / $0.80	Budget, local inference
Claude Sonnet 4	Anthropic	$3.00 / $15.00	Structured output, frontier fallback

All models are accessed via OpenAI-compatible APIs through litellm. Set the appropriate environment variable (MINIMAX_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.) and pass --model <model-id>.

Supported Backends

openkernel supports Triton and CUDA. The kernel engineer chooses the backend; the engine applies backend-specific optimization strategies:

Triton: @triton.autotune parametric search, Proton profiling, shared memory tiling
CUDA: Warp-level primitives, Tensor Core MMA, CUTLASS CuTe templates, inline PTX

Both backends share strategies for fusion discovery, algorithmic improvements, and memory access pattern optimization.

Architecture

openkernel uses a 3-level hybrid search:

Outer loop -- Strategy evolution: Maintains a Pareto frontier of optimization strategies. Strategies that produce results survive; dominated strategies are pruned. Successful strategies are persisted to a skill library for future problems.
Middle loop -- World model search: An LLM-managed tree of optimization intents. Decouples what to optimize from how to implement it. If a good strategy produces buggy code, the strategy survives for retry.
Inner loop -- Refinement: Generator produces kernel code, evaluates on Modal (compile + benchmark + profile), Critic diagnoses bottlenecks from profiler data, Generator produces an improved version. Repeat.

Two LLM roles -- Generator and Critic -- produce structured, inspectable reasoning. The Critic reads hardware profiler output and provides specific diagnoses ("L2 hit rate 45%, restructure to coalesced access with BLOCK_K=64 tiles").

See docs/openkernel-design.md for the full system design.

KernelBench

openkernel is designed to hill-climb KernelBench (Stanford, ICML 2025) -- 250 problems across 4 difficulty levels. We benchmark against KernelBench and publish comparison results against other systems.

Metrics tracked: fast_p at p={1.0, 1.5, 2.0}, geomean speedup, correctness rate, cost per kernel, iterations to convergence.

Results will be published when available.

Contributing

openkernel is open-core under the Apache 2.0 license. Contributions are welcome.

kernel code is the commercial product built on top of openkernel.

Docs

File	Description
docs/pitch.md	Project pitch and market thesis
docs/openkernel-design.md	Full system architecture (5 layers, search engine, memory system)
docs/kernel-code-design.md	kernel code product design (TUI, dashboards, trace capture)
docs/visualization-design.md	Dashboard and visualization specifications
docs/research-synthesis.md	Research survey of kernel optimization systems
docs/five-layer-cake.md	Detailed breakdown of the 5-layer architecture
docs/data-and-integrations.md	Data pipeline and external integrations
docs/codebase-structure.md	Repository layout and module organization
docs/build-plan.md	Build plan and implementation phases
docs/gtm.md	Go-to-market strategy

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
.claude		.claude
.kernel-code		.kernel-code
cache		cache
configs		configs
data		data
docs		docs
kernel_agent		kernel_agent
kernel_code		kernel_code
modal_infra		modal_infra
openkernel		openkernel
results		results
scripts		scripts
tests		tests
traces		traces
.gitignore		.gitignore
.python-version		.python-version
AB_TESTING.md		AB_TESTING.md
ADOPTION.md		ADOPTION.md
KERNEL.md		KERNEL.md
KERNEL_PLUS_PLAN.md		KERNEL_PLUS_PLAN.md
PLAN.md		PLAN.md
README.md		README.md
TODO.md		TODO.md
analysis.ipynb		analysis.ipynb
kernel.py		kernel.py
pyproject.toml		pyproject.toml
reference.py		reference.py
reference_histogram.py		reference_histogram.py
reference_layernorm.py		reference_layernorm.py
reference_relu.py		reference_relu.py
task.py		task.py
task_layernorm.py		task_layernorm.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

openkernel

What It Does

Quick Start

kernel code

Supported Models

Supported Backends

Architecture

KernelBench

Contributing

Docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

openkernel

What It Does

Quick Start

kernel code

Supported Models

Supported Backends

Architecture

KernelBench

Contributing

Docs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages