Skip to content

Conversation

@Aristide021
Copy link

@Aristide021 Aristide021 commented Aug 9, 2025

Intel GNA (Gaussian Neural Accelerator) backend for TVM Relax, designed as a foundation for Intel NPU support. While GNA hardware is present in Intel Core Ultra processors, this backend serves as a stepping stone toward Intel's current NPU path with OpenVINO runtime integration.

Features:

  • Pattern-based graph partitioning for GNA/NPU-compatible operations
  • JSON serialization approach enabling seamless NPU migration
  • Software emulation mode for testing without dedicated hardware
  • Support for dense/linear, 1D convolution, and ReLU operations
  • Automatic shape and dtype extraction for optimization
  • Comprehensive test coverage with CI integration

Supported operations:

  • Dense/Linear layers (relax.matmul)
  • 1D Convolution (relax.nn.conv1d)
  • ReLU activation (relax.nn.relu)

This implementation provides a clean, minimal pattern for backend development while preparing the foundation for Intel's recommended NPU acceleration path through TVM's compilation pipeline.

@Aristide021 Aristide021 marked this pull request as draft August 10, 2025 11:54
@Aristide021 Aristide021 force-pushed the feature/gna_codegen branch 5 times, most recently from 141157b to 77b312a Compare August 11, 2025 19:52
@Aristide021 Aristide021 marked this pull request as ready for review August 11, 2025 19:58
@mshr-h
Copy link
Contributor

mshr-h commented Aug 21, 2025

@Aristide021
Thanks for the PR! A couple of points and questions:

  1. Status of GNA vs NPU
  2. CI & Software Emulation Mode
    • According to the OpenVINO docs, GNA plugin supports Software Emulation Mode (CPU fallback) when GNA HW isn't present. If we enable that in tests, we could run E2E coverage in our CI.

I also think this backend can serve as a very good example for codegen in Relax. It shows a clean and minimal pattern: partitioning with basic ops, handing off to JSON, and keeping the implementation relatively lightweight. Adding a short HOWTO or developer note ("Writing a minimal Relax backend") that references this code could be very helpful for the community.

cc @tqchen @Hzfengsy @cbalint13

@Aristide021
Copy link
Author

@Aristide021 Thanks for the PR! A couple of points and questions:

  1. Status of GNA vs NPU

  2. CI & Software Emulation Mode

    • According to the OpenVINO docs, GNA plugin supports Software Emulation Mode (CPU fallback) when GNA HW isn't present. If we enable that in tests, we could run E2E coverage in our CI.

I also think this backend can serve as a very good example for codegen in Relax. It shows a clean and minimal pattern: partitioning with basic ops, handing off to JSON, and keeping the implementation relatively lightweight. Adding a short HOWTO or developer note ("Writing a minimal Relax backend") that references this code could be very helpful for the community.

cc @tqchen @Hzfengsy @cbalint13

Thanks for the review and the excellent points! You're correct about GNA being archived. I designed this backend as a stepping stone toward NPU support with OpenVINO runtime integration in mind. The JSON serialization approach should make the transition to Intel's current NPU path relatively straightforward.

For the CI integration with Software Emulation Mode, I think that's a great suggestion. I can add CPU fallback support to enable E2E testing without requiring actual GNA hardware.

I'd also be happy to add documentation, positioning this as a foundation for NPU backends, and include a developer guide if that would be helpful for the community.

I'll go ahead and update the PR description to clarify the NPU migration path. My next step will be to add CPU emulation support for testing. Please let me know if you have any other suggestions.

@Aristide021 Aristide021 force-pushed the feature/gna_codegen branch 6 times, most recently from 9b955d4 to 2c036cc Compare August 23, 2025 19:42
This commit introduces the Intel GNA (Gaussian Neural Accelerator) backend
for TVM's Relax IR with a clean separation between hardware and emulation
runtimes to enable CI testing without GNA hardware.

Key components:
- GNA codegen for Relax IR (graph partitioning and code generation)
- Hardware runtime (gna_json_runtime.cc) for systems with GNA SDK
- CPU emulation runtime (gna_json_runtime_emulation.cc) for CI/testing
- Conditional CMake build based on GNA SDK availability
- Pattern registry for dense, conv1d, and relu operations
- Comprehensive test suite

Architecture decisions:
- Clean separation: Hardware and emulation in separate files (no mocking)
- CI-friendly: Emulation runtime has no GNA SDK dependencies
- Follows OpenVINO's Software Emulation Mode pattern
- Same API surface for both runtime implementations

The emulation runtime provides simplified reference implementations
sufficient for testing graph partitioning and codegen correctness.
For production CPU inference, use TVM's standard CPU backend.

This backend serves as a stepping stone toward Intel NPU support
and provides a minimal example for Relax backend development.
@tqchen
Copy link
Member

tqchen commented Aug 24, 2025

Thanks for the contribution, given GNA is archived, it perhaps does not make sense to maintain it in the main tree, adding ci will also add extra overhead here. However, i agree that having generic tutorials for BYOC NPU would be useful, if we can have something that support a current NPU that would be great

@Aristide021
Copy link
Author

Thanks for the contribution, given GNA is archived, it perhaps does not make sense to maintain it in the main tree, adding ci will also add extra overhead here. However, i agree that having generic tutorials for BYOC NPU would be useful, if we can have something that support a current NPU that would be great

I'd be happy to refactor this into a generic NPU tutorial targeting Intel's current NPU plugin. Should this live in the tutorials section or as a contrib module? I can adapt the JSON architecture for educational purposes.

@tqchen
Copy link
Member

tqchen commented Aug 24, 2025

i think starting as contrib is fine, and we can have a tutorial explaination point to the code

Aristide021 added a commit to Aristide021/tvm that referenced this pull request Aug 28, 2025
  This commit introduces an educational NPU backend example that teaches
  key architectural concepts common across Neural Processing Units.

  Key features:
  - Multi-tier memory hierarchy (L0/L1/L2/L3) management with spilling
  - Tiling engine for large tensors that exceed on-chip SRAM
  - Quantization support (INT8/INT16) with dedicated patterns
  - Multiple execution engines (matrix, vector, conv, pooling, activation)
  - Operation fusion patterns to reduce memory traffic
  - Power mode management for efficiency tuning

  Educational value:
  - Demonstrates NPU memory management strategies
  - Shows how tiling enables large model execution
  - Explains quantization's role in NPU acceleration
  - Illustrates operation-to-engine mapping
  - Provides CPU emulation for testing without hardware

This vendor-neutral implementation serves as a template for developers creating custom NPU backends, teaching BYOC integration patterns while  demonstrating real NPU architectural concepts.

Addresses feedback from apache#18201 requesting generic NPU BYOC tutorials.
Aristide021 added a commit to Aristide021/tvm that referenced this pull request Aug 28, 2025
  This commit introduces an educational NPU backend example that teaches
  key architectural concepts common across Neural Processing Units.

  Key features:
  - Multi-tier memory hierarchy (L0/L1/L2/L3) management with spilling
  - Tiling engine for large tensors that exceed on-chip SRAM
  - Quantization support (INT8/INT16) with dedicated patterns
  - Multiple execution engines (matrix, vector, conv, pooling, activation)
  - Operation fusion patterns to reduce memory traffic
  - Power mode management for efficiency tuning

  Educational value:
  - Demonstrates NPU memory management strategies
  - Shows how tiling enables large model execution
  - Explains quantization's role in NPU acceleration
  - Illustrates operation-to-engine mapping
  - Provides CPU emulation for testing without hardware

This vendor-neutral implementation serves as a template for developers creating custom NPU backends, teaching BYOC integration patterns while  demonstrating real NPU architectural concepts.

Addresses feedback from apache#18201 requesting generic NPU BYOC tutorials.
Aristide021 added a commit to Aristide021/tvm that referenced this pull request Aug 28, 2025
…cepts

This commit introduces a vendor-neutral NPU backend that demonstrates
architectural patterns common across Neural Processing Units.

The implementation covers key NPU concepts including multi-tier memory
hierarchy management, automatic tiling for large tensors, quantization
handling, and specialized execution engines. It shows how NPUs manage
memory across different tiers (L0/L1/L2/L3), tile operations to fit
in on-chip SRAM, and dispatch operations to dedicated compute units.

This serves as an educational template for developers creating NPU
backends, demonstrating BYOC integration while teaching NPU-specific
optimization strategies. Uses CPU emulation for testing without
requiring actual NPU hardware.

Addresses feedback from apache#18201 requesting generic NPU BYOC tutorials.
Aristide021 added a commit to Aristide021/tvm that referenced this pull request Aug 28, 2025
…cepts

This commit introduces a vendor-neutral NPU backend that demonstrates
architectural patterns common across Neural Processing Units.

The implementation covers key NPU concepts including multi-tier memory
hierarchy management, automatic tiling for large tensors, quantization
handling, and specialized execution engines. It shows how NPUs manage
memory across different tiers (L0/L1/L2/L3), tile operations to fit
in on-chip SRAM, and dispatch operations to dedicated compute units.

This serves as an educational template for developers creating NPU
backends, demonstrating BYOC integration while teaching NPU-specific
optimization strategies. Uses CPU emulation for testing without
requiring actual NPU hardware.

Addresses feedback from apache#18201 requesting generic NPU BYOC tutorials.
Aristide021 added a commit to Aristide021/tvm that referenced this pull request Aug 28, 2025
…cepts

This commit introduces a vendor-neutral NPU backend that demonstrates
architectural patterns common across Neural Processing Units.

The implementation covers key NPU concepts including multi-tier memory
hierarchy management, automatic tiling for large tensors, quantization
handling, and specialized execution engines. It shows how NPUs manage
memory across different tiers (L0/L1/L2/L3), tile operations to fit
in on-chip SRAM, and dispatch operations to dedicated compute units.

This serves as an educational template for developers creating NPU
backends, demonstrating BYOC integration while teaching NPU-specific
optimization strategies. Uses CPU emulation for testing without
requiring actual NPU hardware.

Addresses feedback from apache#18201 requesting generic NPU BYOC tutorials.
Aristide021 added a commit to Aristide021/tvm that referenced this pull request Aug 28, 2025
…cepts

This commit introduces a vendor-neutral NPU backend that demonstrates
architectural patterns common across Neural Processing Units.

The implementation covers key NPU concepts including multi-tier memory
hierarchy management, automatic tiling for large tensors, quantization
handling, and specialized execution engines. It shows how NPUs manage
memory across different tiers (L0/L1/L2/L3), tile operations to fit
in on-chip SRAM, and dispatch operations to dedicated compute units.

This serves as an educational template for developers creating NPU
backends, demonstrating BYOC integration while teaching NPU-specific
optimization strategies. Uses CPU emulation for testing without
requiring actual NPU hardware.

Addresses feedback from apache#18201 requesting generic NPU BYOC tutorials.
Aristide021 added a commit to Aristide021/tvm that referenced this pull request Aug 28, 2025
…cepts

This commit introduces a vendor-neutral NPU backend that demonstrates
architectural patterns common across Neural Processing Units.

The implementation covers key NPU concepts including multi-tier memory
hierarchy management, automatic tiling for large tensors, quantization
handling, and specialized execution engines. It shows how NPUs manage
memory across different tiers (L0/L1/L2/L3), tile operations to fit
in on-chip SRAM, and dispatch operations to dedicated compute units.

This serves as an educational template for developers creating NPU
backends, demonstrating BYOC integration while teaching NPU-specific
optimization strategies. Uses CPU emulation for testing without
requiring actual NPU hardware.

Addresses feedback from apache#18201 requesting generic NPU BYOC tutorials.
Aristide021 added a commit to Aristide021/tvm that referenced this pull request Aug 28, 2025
…cepts

This commit introduces a vendor-neutral NPU backend that demonstrates
architectural patterns common across Neural Processing Units.

The implementation covers key NPU concepts including multi-tier memory
hierarchy management, automatic tiling for large tensors, quantization
handling, and specialized execution engines. It shows how NPUs manage
memory across different tiers (L0/L1/L2/L3), tile operations to fit
in on-chip SRAM, and dispatch operations to dedicated compute units.

This serves as an educational template for developers creating NPU
backends, demonstrating BYOC integration while teaching NPU-specific
optimization strategies. Uses CPU emulation for testing without
requiring actual NPU hardware.

Addresses feedback from apache#18201 requesting generic NPU BYOC tutorials.
Aristide021 added a commit to Aristide021/tvm that referenced this pull request Aug 30, 2025
…cepts

This commit introduces a vendor-neutral NPU backend that demonstrates
architectural patterns common across Neural Processing Units.

The implementation covers key NPU concepts including multi-tier memory
hierarchy management, automatic tiling for large tensors, quantization
handling, and specialized execution engines. It shows how NPUs manage
memory across different tiers (L0/L1/L2/L3), tile operations to fit
in on-chip SRAM, and dispatch operations to dedicated compute units.

This serves as an educational template for developers creating NPU
backends, demonstrating BYOC integration while teaching NPU-specific
optimization strategies. Uses CPU emulation for testing without
requiring actual NPU hardware.

Addresses feedback from apache#18201 requesting generic NPU BYOC tutorials.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants