|
| 1 | +<!--- Licensed to the Apache Software Foundation (ASF) under one --> |
| 2 | +<!--- or more contributor license agreements. See the NOTICE file --> |
| 3 | +<!--- distributed with this work for additional information --> |
| 4 | +<!--- regarding copyright ownership. The ASF licenses this file --> |
| 5 | +<!--- to you under the Apache License, Version 2.0 (the --> |
| 6 | +<!--- "License"); you may not use this file except in compliance --> |
| 7 | +<!--- with the License. You may obtain a copy of the License at --> |
| 8 | + |
| 9 | +<!--- http://www.apache.org/licenses/LICENSE-2.0 --> |
| 10 | + |
| 11 | +<!--- Unless required by applicable law or agreed to in writing, --> |
| 12 | +<!--- software distributed under the License is distributed on an --> |
| 13 | +<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY --> |
| 14 | +<!--- KIND, either express or implied. See the License for the --> |
| 15 | +<!--- specific language governing permissions and limitations --> |
| 16 | +<!--- under the License. --> |
| 17 | + |
| 18 | +# Example NPU Backend |
| 19 | + |
| 20 | +A hands-on example showing how to build a Neural Processing Unit (NPU) backend for TVM's Relax framework using Bring Your Own Codegen (BYOC). |
| 21 | + |
| 22 | +## What This Is |
| 23 | + |
| 24 | +This is an educational template that demonstrates real NPU concepts without requiring actual NPU hardware. It shows developers how to: |
| 25 | + |
| 26 | +- **Pattern-based partitioning**: Identify and group operations that should run on specialized hardware |
| 27 | +- **Memory hierarchy management**: Handle different memory tiers (L0/L1/L2/L3) common in NPUs |
| 28 | +- **Automatic tiling**: Break large tensors into smaller chunks that fit in on-chip memory |
| 29 | +- **Quantization support**: Handle different data precisions efficiently |
| 30 | +- **BYOC integration**: Connect custom backends to TVM's compilation pipeline |
| 31 | +- **Operator availability checking**: Gracefully handle operators that may not be available in all TVM builds |
| 32 | + |
| 33 | +## Quick Start |
| 34 | + |
| 35 | +```python |
| 36 | +import tvm |
| 37 | +from tvm import relax |
| 38 | +from tvm.relax.backend.pattern_registry import get_patterns_with_prefix |
| 39 | +from tvm.relax.transform import FuseOpsByPattern, RunCodegen |
| 40 | + |
| 41 | +# Import to register patterns |
| 42 | +import tvm.relax.backend.contrib.example_npu |
| 43 | + |
| 44 | +# Get available patterns |
| 45 | +patterns = get_patterns_with_prefix("example_npu") |
| 46 | +print(f"Available patterns: {[p.name for p in patterns]}") |
| 47 | + |
| 48 | +# Your model gets automatically partitioned |
| 49 | +# Operations matching patterns get fused into "Composite" functions |
| 50 | +# Those get lowered to the example NPU backend |
| 51 | +``` |
| 52 | + |
| 53 | +The snippet above shows how to discover registered patterns. A minimal runnable example that demonstrates the BYOC flow (partition -> merge -> codegen) using the example test module looks like this: |
| 54 | + |
| 55 | +```python |
| 56 | +# This imports the example module used in the tests. Importing the test |
| 57 | +# module path directly works when running from the repo root (pytest does |
| 58 | +# this automatically). |
| 59 | +from tests.python.contrib.test_example_npu import MatmulReLU |
| 60 | +from tvm.relax.backend.pattern_registry import get_patterns_with_prefix |
| 61 | +from tvm.relax.transform import FuseOpsByPattern, MergeCompositeFunctions, RunCodegen |
| 62 | +import tvm.relax.backend.contrib.example_npu # registers patterns |
| 63 | + |
| 64 | +mod = MatmulReLU |
| 65 | +patterns = get_patterns_with_prefix("example_npu") |
| 66 | + |
| 67 | +# Apply partitioning and codegen annotation |
| 68 | +mod = FuseOpsByPattern(patterns, bind_constants=False, annotate_codegen=True)(mod) |
| 69 | +mod = MergeCompositeFunctions()(mod) |
| 70 | +mod = RunCodegen()(mod) |
| 71 | + |
| 72 | +print(mod) |
| 73 | +``` |
| 74 | + |
| 75 | +A compact visualization of the BYOC flow: |
| 76 | + |
| 77 | +``` |
| 78 | +Model source (Relax) |
| 79 | + │ |
| 80 | + ▼ |
| 81 | +Pattern-based partition (FuseOpsByPattern) |
| 82 | + │ |
| 83 | + ▼ |
| 84 | +Composite functions (MergeCompositeFunctions) |
| 85 | + │ |
| 86 | + ▼ |
| 87 | +Lower/Codegen for example NPU (RunCodegen / relax.ext.example_npu) |
| 88 | + │ |
| 89 | + ▼ |
| 90 | +Runtime dispatch to NPU runtime (runtime.ExampleNPUJSONRuntimeCreate) |
| 91 | +``` |
| 92 | + |
| 93 | +## Supported Operations |
| 94 | + |
| 95 | +The backend recognizes these common neural network patterns: |
| 96 | + |
| 97 | +### Core Operations (always available) |
| 98 | +- `example_npu.dense` - Dense/fully connected layers |
| 99 | +- `example_npu.matmul` - Matrix multiplication operations |
| 100 | +- `example_npu.conv1d` - 1D convolution for sequence processing |
| 101 | +- `example_npu.conv2d` - 2D convolution for image processing |
| 102 | +- `example_npu.depthwise_conv2d` - Depthwise separable convolutions |
| 103 | +- `example_npu.max_pool2d` - 2D max pooling |
| 104 | +- `example_npu.avg_pool2d` - 2D average pooling |
| 105 | +- `example_npu.batch_norm` - Batch normalization |
| 106 | + |
| 107 | +### Activation Functions (availability depends on TVM build) |
| 108 | +- `example_npu.relu` - ReLU activation |
| 109 | +- `example_npu.relu6` - ReLU6 activation (if available) |
| 110 | +- `example_npu.sigmoid` - Sigmoid activation (if available) |
| 111 | +- `example_npu.tanh` - Hyperbolic tangent (if available) |
| 112 | +- `example_npu.gelu` - Gaussian Error Linear Unit (if available) |
| 113 | + |
| 114 | +### Element-wise Operations |
| 115 | +- `example_npu.add` - Element-wise addition |
| 116 | +- `example_npu.multiply` - Element-wise multiplication |
| 117 | +- `example_npu.subtract` - Element-wise subtraction |
| 118 | +- `example_npu.divide` - Element-wise division |
| 119 | + |
| 120 | +### Quantization Support |
| 121 | +- `example_npu.quantize` - Quantization operations (if available) |
| 122 | +- `example_npu.dequantize` - Dequantization operations (if available) |
| 123 | + |
| 124 | +### Fused Patterns |
| 125 | +- `example_npu.conv2d_relu_fused` - Optimized Conv2D+ReLU fusion |
| 126 | + |
| 127 | +**Note**: Some operators may not be available in all TVM builds. The backend automatically skips registration for unavailable operators. |
| 128 | + |
| 129 | +## Files |
| 130 | + |
| 131 | +### Backend Implementation |
| 132 | +- `patterns.py` - Defines which operations get fused together, along with pattern metadata and architectural annotations used by the partitioner. Includes operator availability checking and NPU-specific constraints. |
| 133 | +- `__init__.py` - Registers the backend and its BYOC entry points with TVM so the compiler can discover and use the example NPU. |
| 134 | + |
| 135 | +### Runtime Implementation |
| 136 | +- `src/runtime/contrib/example_npu/example_npu_runtime.cc` - C++ runtime implementation that handles JSON-based graph execution for the NPU backend. |
| 137 | + |
| 138 | +### Tests and Examples |
| 139 | +- `tests/python/contrib/test_example_npu.py` - Comprehensive test suite containing example IRModules (e.g. `MatmulReLU`, `Conv2dReLU`) and demonstrating the complete BYOC flow from pattern registration to runtime execution. |
| 140 | + |
| 141 | +## Status / Build |
| 142 | + |
| 143 | +- The example backend is an educational, CPU-backed emulation. It does not require real NPU hardware. |
| 144 | +- The backend includes robust operator availability checking - patterns are only registered for operators that exist in the current TVM build. |
| 145 | +- Tests and runtime features are skipped automatically when the example codegen/runtime are not built into TVM. The test checks for the presence of these global functions before running: |
| 146 | + |
| 147 | +```python |
| 148 | +import tvm |
| 149 | +has_codegen = tvm.get_global_func("relax.ext.example_npu", True) |
| 150 | +has_runtime = tvm.get_global_func("runtime.ExampleNPUJSONRuntimeCreate", True) |
| 151 | +has_example_npu = has_codegen and has_runtime |
| 152 | +``` |
| 153 | + |
| 154 | +If `has_example_npu` is False, tests are skipped. This ensures compatibility across different TVM build configurations. |
| 155 | + |
| 156 | +## Testing |
| 157 | + |
| 158 | +Run the tests to see it in action: |
| 159 | + |
| 160 | +```bash |
| 161 | +pytest tests/python/contrib/test_example_npu.py -v |
| 162 | +``` |
| 163 | + |
| 164 | +Tests are skipped if the backend isn't built — see the test file for the exact runtime/codegen checks. Running `pytest` from the repository root ensures imports like `tests.python.contrib.test_example_npu` resolve correctly. |
| 165 | + |
| 166 | +The test suite includes: |
| 167 | +- Pattern registration verification (checks that core patterns are available) |
| 168 | +- Graph partitioning validation (ensures operations get grouped correctly) |
| 169 | +- End-to-end execution testing (verifies runtime integration) |
| 170 | +- Operator availability testing (graceful handling of missing operators) |
| 171 | + |
| 172 | +### Example output |
| 173 | + |
| 174 | +When you run the quick-start snippet or the test, you should see output similar to the following (truncated for brevity): |
| 175 | + |
| 176 | +``` |
| 177 | +Available patterns: ['example_npu.dense', 'example_npu.matmul', 'example_npu.conv1d', 'example_npu.conv2d', 'example_npu.depthwise_conv2d', 'example_npu.max_pool2d', 'example_npu.avg_pool2d', 'example_npu.batch_norm', 'example_npu.relu', 'example_npu.add', 'example_npu.multiply', 'example_npu.conv2d_relu_fused'] |
| 178 | +
|
| 179 | +Relax IRModule |
| 180 | +def @main(...) -> ... |
| 181 | + %0 = call_extern("relax.ext.example_npu", ...) |
| 182 | +
|
| 183 | +# composite functions |
| 184 | +def @composite_0(...) /* Composite */ = ... |
| 185 | +``` |
| 186 | + |
| 187 | +This shows the registered patterns and that matched subgraphs were turned into composite functions and lowered to the example NPU codegen/runtime. |
| 188 | + |
| 189 | +## Key Features Demonstrated |
| 190 | + |
| 191 | +### NPU Architectural Concepts |
| 192 | +- **Multi-tier memory hierarchy**: SRAM (256KB), CMX (512KB), and DRAM management |
| 193 | +- **Tiling constraints**: 32x32 tiles with 16-element vectors for optimal NPU utilization |
| 194 | +- **Quantization support**: INT8/INT16 for inference acceleration, mixed precision handling |
| 195 | +- **Specialized execution units**: Matrix engines (16x16), vector units (64-wide), pooling units |
| 196 | +- **Power management**: Support for different power modes (high_performance, balanced, low_power) |
| 197 | + |
| 198 | +### Pattern Matching Features |
| 199 | +- **Operator availability detection**: Gracefully handles missing operators in different TVM builds |
| 200 | +- **Memory constraint checking**: Validates tensor sizes against NPU memory limits |
| 201 | +- **Fusion opportunities**: Identifies conv+activation and other beneficial fusions |
| 202 | +- **Layout preferences**: NHWC channel-last layouts preferred by NPUs |
| 203 | + |
| 204 | +### Error Handling |
| 205 | +- **Robust exception handling**: Uses specific `TVMError` instead of generic exceptions |
| 206 | +- **Graceful degradation**: Continues operation when optional operators are unavailable |
| 207 | +- **Comprehensive testing**: Validates both successful cases and error conditions |
| 208 | + |
| 209 | +## Context |
| 210 | + |
| 211 | +NPUs are specialized for neural network workloads and can be 10-100x more efficient than general-purpose CPUs/GPUs for inference. This example shows the architectural patterns you'll encounter when building real NPU backends, making it easier to adapt to specific hardware like: |
| 212 | + |
| 213 | +- Mobile NPUs (AMD XDNA, Google Edge TPU, Samsung NPU) |
| 214 | +- Dedicated AI chips (Intel Movidius, Qualcomm Hexagon, MediaTek APU) |
| 215 | +- Cloud AI accelerators (AWS Inferentia, Google TPU, Microsoft Azure Maia) |
| 216 | +- Custom ASIC designs and embedded AI processors |
| 217 | + |
| 218 | +## Learn More |
| 219 | + |
| 220 | +This backend serves as both a working example and educational resource for understanding NPU integration patterns. The implementation demonstrates vendor-neutral concepts that apply across different NPU architectures, making it a valuable starting point for real NPU backend development. |
0 commit comments