Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ layout: learningpathall

## LiteRT, XNNPACK, KleidiAI and SME2

LiteRT (short for Lite Runtime), formerly known as TensorFlow Lite, is a runtime for on-device AI.
LiteRT (Lite Runtime), formerly known as TensorFlow Lite, is a runtime for on-device AI.
The default CPU acceleration library used by LiteRT is XNNPACK.

XNNPACK is an open-source library that provides highly optimized implementations of neural-network operators. It continuously integrates KleidiAI library to leverage new CPU features such as SME2.
Expand Down Expand Up @@ -50,4 +50,4 @@ When KleidiAI and SME2 are enabled at building stage, the KleidiAI SME2 micro-ke

During the model loading stage, when XNNPACK optimizes the subgraph, it checks the operator’s data type to determine whether a KleidiAI implementation is available. If KleidiAI supports it, XNNPACK bypasses its own default implementation. As a result, RHS packing is performed using the KleidiAI SME packing micro-kernel. In addition, because KleidiAI typically requires packing of the LHS, a flag is also set during this stage.

During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product.
During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product.
Original file line number Diff line number Diff line change
Expand Up @@ -6,49 +6,67 @@ weight: 3
layout: learningpathall
---

### Build the LiteRT benchamrk tool with KleidiAI and SME2 enabled
### Build the LiteRT benchmark tool with KleidiAI and SME2 enabled

LiteRT provides a tool called `benchmark_model` for evaluating the performance of LiteRT models. Use the following steps to build the LiteRT benchamrk tool.
LiteRT provides provides a standalone performance measurement utility called `benchmark_model` for evaluating the performance of LiteRT models.

In this section, you will build two versions of the benchmark tool:
* With KleidiAI + SME/SME2 enabled → uses Arm-optimized micro-kernels
* Without KleidiAI + SME/SME2 → baseline performance (NEON/SVE2 fallback)
This comparison clearly demonstrates the gains provided by SME2 acceleration.

First, clone the LiteRT repository.

``` bash
cd $WORKSPACE
git clone https://github.com/google-ai-edge/LiteRT.git
```
Because LiteRT integrates KleidiAI through XNNPACK, you must build LiteRT from source to enable SME2 micro-kernels.

Then, set up build environment using Docker in your Linux developement machine.
Next, set up your Android build environment using Docker on your Linux developement machine.
Google provides a Dockerfile that installs the toolchain needed for TFLite/LiteRT Android builds.

Download the Dockerfile:
``` bash
wget https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/lite/tools/tflite-android.Dockerfile
```
Build the Docker image:
```bash
docker build . -t tflite-builder -f tflite-android.Dockerfile
```
The Docker image includes Bazel, NDK, CMake, toolchains, and Python required for cross-compiling Android binaries.

Inside the container, run the following commands to download Android tools and libraries to build LiteRT for Android.

``` bash
You will now install Android SDK/NDK Components inside the Container.
Launch the docker container:
```bash
docker run -it -v $PWD:/host_dir tflite-builder bash
```
Install Android platform tools:
``` bash
sdkmanager \
"build-tools;${ANDROID_BUILD_TOOLS_VERSION}" \
"platform-tools" \
"platforms;android-${ANDROID_API_LEVEL}"
```

Inside the LiteRT source, run the script to configure the bazel paramters.
Configure LiteRT Build Options inside your running container:

``` bash
cd /host_dir/LiteRT
./configure
```

You can keep all options at their default values except for:

`Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]`
Use default values for all the prompts except when prompted:
```output
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]
```
Type in `y`.

Type in `y`, then the script will automatically detect the necessary files set up in the sdkmanager command and configure them accordingly.
LiteRT's configuration script will detect SDK + NDK paths, set toolchain versions, configure Android ABI (arm64-v8a) and initialize Bazel workspace rules.

Now, you can build the benchmark tool with the following commands.
Now, you can build the benchmark tool with KleidiAI + SME2 Enabled.

Enable XNNPACK, quantization paths, and SME2 acceleration:
``` bash
export BENCHMARK_TOOL_PATH="litert/tools:benchmark_model"
export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
Expand All @@ -58,16 +76,21 @@ export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
--define=xnn_enable_arm_sme=true \
--define=xnn_enable_arm_sme2=true \
--define=xnn_enable_kleidiai=true"

```
Build for Android:
```bash
bazel build -c opt --config=android_arm64 \
${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \
--repo_env=HERMETIC_PYTHON_VERSION=3.12
```

The above build enables the KleidiAI and SME2 micro-kernels integrated into XNNPACK.
This build enables the KleidiAI and SME2 micro-kernels integrated into XNNPACK and produces an Android binary under:

```output
bazel-bin/litert/tools/benchmark_model
```

### Build the LiteRT benchamrk tool without KleidiAI
### Build the LiteRT benchmark tool without KleidiAI (Baseline Comparison)

To compare the performance of KleidiAI SME2 implementation against XNNPACK’s original implementation, you can build another version of LiteRT benchmark tool without KleidiAI and SME2 enabled.

Expand All @@ -80,11 +103,18 @@ export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
--define=xnn_enable_arm_sme=false \
--define=xnn_enable_arm_sme2=false \
--define=xnn_enable_kleidiai=false"
```

Then rebuild:
```bash
bazel build -c opt --config=android_arm64 \
${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \
--repo_env=HERMETIC_PYTHON_VERSION=3.12
```

The path to the compiled benchmark tool binary will be displayed in the build output.
This build of the `benchmark_model` disables all SME2 micro-kernels and forces fallback to XNNPACK’s NEON/SVE2 kernels.
You can then use ADB to push the benchmark tool to your Android device.

```bash
adb push bazel-bin/litert/tools/benchmark_model /data/local/tmp/
adb shell chmod +x /data/local/tmp/benchmark_model
```
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,13 @@ layout: learningpathall

### KleidiAI SME2 support in LiteRT

LiteRT uses XNNPACK as its default CPU backend. KleidiAI micro-kernels are integrated through XNNPACK in LiteRT.
Only a subset of KleidiAI SME, SME2 micro-kernels has been integrated into XNNPACK.
These micro-kernels support operators using the following data types and quantization configurations in the LiteRT model.
Other operators are using XNNPACK’s default implementation during the inference.

* Fully connected
Fully connected:

| Activations | Weights | Output |
| ---------------------------- | --------------------------------------- | ---------------------------- |
| FP32 | FP32 | FP32 |
Expand All @@ -21,15 +23,16 @@ Other operators are using XNNPACK’s default implementation during the inferenc
| Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization |
| FP32 | Per-channel symmetric INT4 quantization | FP32 |

* Batch Matrix Multiply
Batch Matrix Multiply:

| Input A | Input B |
| ------- | --------------------------------------- |
| FP32 | FP32 |
| FP16 | FP16 |
| FP32 | Per-channel symmetric INT8 quantization |

Conv2D:

* Conv2D
| Activations | Weights | Output |
| ---------------------------- | ----------------------------------------------------- | ---------------------------- |
| FP32 | FP32, pointwise (kernerl size is 1) | FP32 |
Expand All @@ -38,15 +41,24 @@ Other operators are using XNNPACK’s default implementation during the inferenc
| Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization |


* TransposeConv
TransposeConv:

| Activations | Weights | Output |
| ---------------------------- | ----------------------------------------------------- | ---------------------------- |
| Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization |


### Create LiteRT models by Keras
### Create LiteRT models using Keras
To demonstrate SME2 acceleration on Android, you will construct simple single-layer models (e.g., Fully Connected) using Keras and convert them into LiteRT (.tflite) format.
This allows you to benchmark isolated operators and directly observe SME2 improvements.

Install the tensorflow package dependency for your script:

```bash
sudo pip3 install tensorflow
```

To evaluate the performance of SME2 acceleration per operator, the following script is provided as an example. It uses the Keras to create a simple model containing only a single fully connected operator and convert it into the LiteRT model.
Save the following script as an example in a file named `model.py`:

``` python
import tensorflow as tf
Expand Down Expand Up @@ -74,10 +86,24 @@ fc_fp32 = converter.convert()
save_litert_model(fc_fp32, "fc_fp32.tflite")
```

The model above is created in FP32 format. As mentioned in the previous section, this operator can invoke the KleidiAI SME2 micro-kernel for acceleration.
Now run the script:

```bash
python3 model.py
```

The model `fc_fp32.tflite` is created in FP32 format. As mentioned in the previous section, this operator can invoke the KleidiAI SME2 micro-kernel for acceleration.

You can then use ADB to push the model for benchmarking to your Android device.
```bash
adb push fc_fp32.tflite /data/local/tmp/
adb shell chmod +x /data/local/tmp/fc_fp32.tflite
```

You can also optimize this Keras model using post-training quantization to create a LiteRT model that suits your requirements.

## Post-Training Quantization Options

* Post-training FP16 quantization
``` python
# Convert to model with FP16 weights and FP32 activations
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,33 +5,44 @@ weight: 4
### FIXED, DO NOT MODIFY
layout: learningpathall
---

### Use the benchmark tool

First, you should check if your Android phone supports SME2. You can check it by the following command.

Once you have:
* A LiteRT model (for example, fc_fp32.tflite), and
* The benchmark_model binary built with and without KleidiAI + SME2,
you can run benchmarks directly on an SME2-capable Android device.
### Verify SME2 Support on the Device

First, you should check if your Android phone supports SME2.
On the device (via adb shell), run:
``` bash
cat /proc/cpuinfo

```
You should see a Features line similar to:
```output
...
processor : 7
BogoMIPS : 2000.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti mte ecv afp mte3 sme smei8i32 smef16f32 smeb16f32 smef32f32 wfxt rprfm sme2 smei16i32 smebi32i32 hbc lrcpc3
```

As you can see from the `Features`, the CPU 7 supports the SME2.

Then, you can run the `benchmark_model` tool on the CPU that supports the SME2. One of the example is as follows.

This example uses the `taskset` command to configure the benchmark workload to run on cores 7. It specifies to utilize 1 thread by the option `--num_threads=1`, and running the inferences at least 1000 times by the option `--num_runs=1000`. The CPU is selected. Also, it passes the option `--use_profiler=true` to produce a operator level profiling during inference.
## Run benchmark_model on an SME2 Core

Next, run the benchmark tool and bind execution to a core that supports SME2.
In this example, you will pin to CPU 7, use a single thread, and run enough iterations to get stable timing.

``` bash
taskset 80 ./benchmark_model --graph=./fc_fp32.tflite --num_runs=1000 --num_threads=1 --use_cpu=true --use_profiler=true
```

This example uses the `taskset` command to configure the benchmark workload to run on cores 7. It specifies to utilize 1 thread by the option `--num_threads=1`, and running the inferences at least 1000 times by the option `--num_runs=1000`. The CPU is selected. Also, it passes the option `--use_profiler=true` to produce a operator level profiling during inference.

You should see output similar to:
```output
...
INFO: [litert/runtime/accelerators/auto_registration.cc:148] CPU accelerator registered.
INFO: [litert/runtime/compiled_model.cc:415] Flatbuffer model initialized directly from incoming litert model.
INFO: Initialized TensorFlow Lite runtime.
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
VERBOSE: Replacing 1 out of 1 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 1 partitions for subgraph 0.
INFO: The input model file size (MB): 3.27774
Expand Down Expand Up @@ -90,7 +101,7 @@ Memory (bytes): count=0
5 nodes observed
```

As you can see from the results above, the results include the time spent on model initialization, warm-up, and inference, as well as memory usage. Since the profiler was enabled, the output also reports the execution time of each operator.
From the results above you will see the time spent on model initialization, warm-up, and inference, as well as memory usage. Since the profiler was enabled, the output also reports the execution time of each operator.

Because the model contains only a single fully connected layer, the node type `Fully Connected (NC, PF32) GEMM` shows the average execution time is 0.382 ms, accounting for 93.171% of the total inference time.

Expand All @@ -100,13 +111,15 @@ To verify the KleidiAI SME2 micro-kernels are invoked for the Fully Connected op

## Measure the performance impact of KleidiAI SME2 micro-kernels

To compare the performance of the KleidiAI SME2 implementation with the original XNNPACK implementation, you can run the `benchmark_model` tool without KleidiAI enabled and then benchmark again using the same parameters.
To compare the performance of the KleidiAI SME2 implementation with the original XNNPACK implementation, you can run the `benchmark_model` tool without KleidiAI enabled using the same parameters.

One example is as follows.
Run with the same parameters:

``` bash
taskset 80 ./benchmark_model --graph=./fc_fp32.tflite --num_runs=1000 --num_threads=1 --use_cpu=true --use_profiler=true

```
The output should look like:
```output
...
INFO: [litert/runtime/accelerators/auto_registration.cc:148] CPU accelerator registered.
INFO: [litert/runtime/compiled_model.cc:415] Flatbuffer model initialized directly from incoming litert model.
Expand Down Expand Up @@ -169,6 +182,9 @@ Memory (bytes): count=0
5 nodes observed
```

From these benchmarking results you should notice significant throughput uplift and speedup in inference time when KleidiAI SME2 micro-kernels are enabled.

### Interpreting Node Type Names for KleidiAI
As you can see from the results, for the same model, the XNNPACK node type name is different. For the non-KleidiAI implementation, the node type is `Fully Connected (NC, F32) GEMM`, whereas for the KleidiAI implementation, it is `Fully Connected (NC, PF32) GEMM`.

For other operators supported by KleidiAI, the per-operator profiling node types differ between the implementations with and without KleidiAI enabled in XNNPACK as follows:
Expand All @@ -188,10 +204,12 @@ For other operators supported by KleidiAI, the per-operator profiling node types
| Conv2D | Convolution (NHWC, PQS8, QS8, QC8W) | Convolution (NHWC, QC8) |
| TransposeConv | Deconvolution (NHWC, PQS8, QS8, QC8W) | Deconvolution (NC, QS8, QC8W) |

As you can see from the list, the letter “P” indicates that the node type corresponds to a KleidiAI implementation.
The letter “P” in the node type indicates that it corresponds to a KleidiAI implementation.

For example, in `Convolution (NHWC, PQS8, QS8, QC8W)`, this represents a Conv2D operator computation by KleidiAI micro-kernel, where the tensor is in NHWC layout.

* The input is packed INT8 quantized,
* The weights are per-channel INT8 quantized,
* The output is INT8 quantized.
* The output is INT8 quantized.

By comparing benchmark_model runs with and without KleidiAI + SME2, and inspecting the profiled node types (PF32, PF16, QP8, PQS8), you can reliably confirm that LiteRT is dispatching to SME2-optimized KleidiAI micro-kernels and quantify their performance impact on your Android device.
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ cascade:

minutes_to_complete: 30

who_is_this_for: This is an advanced topic for developers looking to leverage the SME2 instrutions to accelerate LiteRT models inference on Android.
who_is_this_for: This is an advanced topic for developers looking to leverage the Arm's Scalable Matrix Extension Version 2 (SME2) instructions to accelerate LiteRT models inference on Android.

learning_objectives:
- Understand how KleidiAI works in LiteRT.
Expand All @@ -17,8 +17,8 @@ learning_objectives:


prerequisites:
- A Linux development machine.
- An Android device that supports the SME2 Arm architecture features.
- An x86_64 Linux development machine.
- An Android device that supports the Arm SME2 architecture features - see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices)

author: Jiaming Guo

Expand All @@ -27,6 +27,7 @@ skilllevels: Advanced
subjects: ML
armips:
- Cortex-A
- Cortex-X
tools_software_languages:
- C
- Python
Expand Down