[Feature Request] Add ARM64 NEON and BF16 acceleration support

Hi firelzrd,

After seeing the great results from the ARM64 NEON support in the other project (thanks for the integration!), I’d like to propose a similar SIMD backend for Nap.

Currently, the Nap governor is primarily tailored for x86_64 due to its reliance on SSE2/AVX. To extend its high-performance capabilities to modern ARM64 platforms (such as the Snapdragon 8 Elite / Oryon cores), I propose implementing a dual-path ARM64 SIMD backend.

### Proposed Implementation: BF16 with NEON Fallback
To ensure both maximum performance on cutting-edge hardware and broad compatibility with older ARM64 devices, I suggest a tiered approach:

1.  **Primary Path: ARMv8.6-A / v9 BF16**
    * Leverage native `bfmmla` (Matrix Multiply-Accumulate) or `bfdot` instructions.
    * **Advantage:** BF16 maintains FP32’s dynamic range while doubling throughput and halving cache footprint. This is the "gold standard" for the 16-16-1 MLP used in Nap.
2.  **Fallback Path: Standard NEON (ASIMD)**
    * If the hardware does not support BF16 extensions, the implementation should automatically fall back to standard NEON FP32 vectorization.
    * **Advantage:** Ensures functional parity with the SSE2 backend on all ARM64 devices (v8.0+).

### Why BF16 is the Optimal Choice
Given the MLP architecture, BF16 presents several key advantages over FP32 or INT8:
* **Hardware Acceleration:** On ARMv9 (like the Snapdragon 8 Elite), BF16 instructions typically deliver 2x to 4x the throughput of standard NEON operations.
* **Zero Quantization Pain:** Unlike INT8, BF16 doesn't require complex scale/offset management, making it ideal for the log2-space regression used in sleep duration prediction.
* **Energy Efficiency:** Reducing register pressure and memory bandwidth usage directly translates to lower power consumption during governor execution.

### Technical Context
Modern ARM SoCs offer specific hardware features we can target:
* **BF16:** Native hardware acceleration via `FEAT_BF16`.
* **NEON:** 128-bit vectorization, serving as a direct functional equivalent to SSE2.
* **I8MM / DotProd:** (Optional future path) Dedicated hardware for integer math if quantization is ever explored.

### Impact
Enabling this tiered ARM64 acceleration would significantly enhance energy efficiency on mobile devices. Since CPUIdle governor precision is a primary driver of battery longevity, reducing the overhead of the prediction model is a high-value optimization.

As before, I have the hardware (Snapdragon 8 Elite) ready and would be more than happy to assist with benchmarking and debugging this "BF16-preferred" implementation if you decide to move forward.

Best regards!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add ARM64 NEON and BF16 acceleration support #7

Proposed Implementation: BF16 with NEON Fallback

Why BF16 is the Optimal Choice

Technical Context

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature Request] Add ARM64 NEON and BF16 acceleration support #7

Description

Proposed Implementation: BF16 with NEON Fallback

Why BF16 is the Optimal Choice

Technical Context

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions