Skip to content

[Feature Request] Add ARM64 NEON and BF16 acceleration support #7

@brokestar233

Description

@brokestar233

Hi firelzrd,

After seeing the great results from the ARM64 NEON support in the other project (thanks for the integration!), I’d like to propose a similar SIMD backend for Nap.

Currently, the Nap governor is primarily tailored for x86_64 due to its reliance on SSE2/AVX. To extend its high-performance capabilities to modern ARM64 platforms (such as the Snapdragon 8 Elite / Oryon cores), I propose implementing a dual-path ARM64 SIMD backend.

Proposed Implementation: BF16 with NEON Fallback

To ensure both maximum performance on cutting-edge hardware and broad compatibility with older ARM64 devices, I suggest a tiered approach:

  1. Primary Path: ARMv8.6-A / v9 BF16
    • Leverage native bfmmla (Matrix Multiply-Accumulate) or bfdot instructions.
    • Advantage: BF16 maintains FP32’s dynamic range while doubling throughput and halving cache footprint. This is the "gold standard" for the 16-16-1 MLP used in Nap.
  2. Fallback Path: Standard NEON (ASIMD)
    • If the hardware does not support BF16 extensions, the implementation should automatically fall back to standard NEON FP32 vectorization.
    • Advantage: Ensures functional parity with the SSE2 backend on all ARM64 devices (v8.0+).

Why BF16 is the Optimal Choice

Given the MLP architecture, BF16 presents several key advantages over FP32 or INT8:

  • Hardware Acceleration: On ARMv9 (like the Snapdragon 8 Elite), BF16 instructions typically deliver 2x to 4x the throughput of standard NEON operations.
  • Zero Quantization Pain: Unlike INT8, BF16 doesn't require complex scale/offset management, making it ideal for the log2-space regression used in sleep duration prediction.
  • Energy Efficiency: Reducing register pressure and memory bandwidth usage directly translates to lower power consumption during governor execution.

Technical Context

Modern ARM SoCs offer specific hardware features we can target:

  • BF16: Native hardware acceleration via FEAT_BF16.
  • NEON: 128-bit vectorization, serving as a direct functional equivalent to SSE2.
  • I8MM / DotProd: (Optional future path) Dedicated hardware for integer math if quantization is ever explored.

Impact

Enabling this tiered ARM64 acceleration would significantly enhance energy efficiency on mobile devices. Since CPUIdle governor precision is a primary driver of battery longevity, reducing the overhead of the prediction model is a high-value optimization.

As before, I have the hardware (Snapdragon 8 Elite) ready and would be more than happy to assist with benchmarking and debugging this "BF16-preferred" implementation if you decide to move forward.

Best regards!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions