Skip to content

Commit 278f5fb

Browse files
authored
Merge pull request #2631 from pareenaverma/content_review
Kleidi+SME2 Android LP Tech review
2 parents db26dc4 + 13f773c commit 278f5fb

File tree

5 files changed

+120
-45
lines changed

5 files changed

+120
-45
lines changed

content/learning-paths/mobile-graphics-and-gaming/litert-sme/1-litert-kleidiai-sme2.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
## LiteRT, XNNPACK, KleidiAI and SME2
1010

11-
LiteRT (short for Lite Runtime), formerly known as TensorFlow Lite, is a runtime for on-device AI.
11+
LiteRT (Lite Runtime), formerly known as TensorFlow Lite, is a runtime for on-device AI.
1212
The default CPU acceleration library used by LiteRT is XNNPACK.
1313

1414
XNNPACK is an open-source library that provides highly optimized implementations of neural-network operators. It continuously integrates KleidiAI library to leverage new CPU features such as SME2.
@@ -50,4 +50,4 @@ When KleidiAI and SME2 are enabled at building stage, the KleidiAI SME2 micro-ke
5050

5151
During the model loading stage, when XNNPACK optimizes the subgraph, it checks the operator’s data type to determine whether a KleidiAI implementation is available. If KleidiAI supports it, XNNPACK bypasses its own default implementation. As a result, RHS packing is performed using the KleidiAI SME packing micro-kernel. In addition, because KleidiAI typically requires packing of the LHS, a flag is also set during this stage.
5252

53-
During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product.
53+
During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product.

content/learning-paths/mobile-graphics-and-gaming/litert-sme/2-build-tool.md

Lines changed: 47 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -6,49 +6,67 @@ weight: 3
66
layout: learningpathall
77
---
88

9-
### Build the LiteRT benchamrk tool with KleidiAI and SME2 enabled
9+
### Build the LiteRT benchmark tool with KleidiAI and SME2 enabled
1010

11-
LiteRT provides a tool called `benchmark_model` for evaluating the performance of LiteRT models. Use the following steps to build the LiteRT benchamrk tool.
11+
LiteRT provides provides a standalone performance measurement utility called `benchmark_model` for evaluating the performance of LiteRT models.
12+
13+
In this section, you will build two versions of the benchmark tool:
14+
* With KleidiAI + SME/SME2 enabled → uses Arm-optimized micro-kernels
15+
* Without KleidiAI + SME/SME2 → baseline performance (NEON/SVE2 fallback)
16+
This comparison clearly demonstrates the gains provided by SME2 acceleration.
1217

1318
First, clone the LiteRT repository.
1419

1520
``` bash
1621
cd $WORKSPACE
1722
git clone https://github.com/google-ai-edge/LiteRT.git
1823
```
24+
Because LiteRT integrates KleidiAI through XNNPACK, you must build LiteRT from source to enable SME2 micro-kernels.
1925

20-
Then, set up build environment using Docker in your Linux developement machine.
26+
Next, set up your Android build environment using Docker on your Linux developement machine.
27+
Google provides a Dockerfile that installs the toolchain needed for TFLite/LiteRT Android builds.
2128

29+
Download the Dockerfile:
2230
``` bash
2331
wget https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/lite/tools/tflite-android.Dockerfile
32+
```
33+
Build the Docker image:
34+
```bash
2435
docker build . -t tflite-builder -f tflite-android.Dockerfile
2536
```
37+
The Docker image includes Bazel, NDK, CMake, toolchains, and Python required for cross-compiling Android binaries.
2638

27-
Inside the container, run the following commands to download Android tools and libraries to build LiteRT for Android.
28-
29-
``` bash
39+
You will now install Android SDK/NDK Components inside the Container.
40+
Launch the docker container:
41+
```bash
3042
docker run -it -v $PWD:/host_dir tflite-builder bash
43+
```
44+
Install Android platform tools:
45+
``` bash
3146
sdkmanager \
3247
"build-tools;${ANDROID_BUILD_TOOLS_VERSION}" \
3348
"platform-tools" \
3449
"platforms;android-${ANDROID_API_LEVEL}"
3550
```
3651

37-
Inside the LiteRT source, run the script to configure the bazel paramters.
52+
Configure LiteRT Build Options inside your running container:
3853

3954
``` bash
4055
cd /host_dir/LiteRT
4156
./configure
4257
```
4358

44-
You can keep all options at their default values except for:
45-
46-
`Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]`
59+
Use default values for all the prompts except when prompted:
60+
```output
61+
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]
62+
```
63+
Type in `y`.
4764

48-
Type in `y`, then the script will automatically detect the necessary files set up in the sdkmanager command and configure them accordingly.
65+
LiteRT's configuration script will detect SDK + NDK paths, set toolchain versions, configure Android ABI (arm64-v8a) and initialize Bazel workspace rules.
4966

50-
Now, you can build the benchmark tool with the following commands.
67+
Now, you can build the benchmark tool with KleidiAI + SME2 Enabled.
5168

69+
Enable XNNPACK, quantization paths, and SME2 acceleration:
5270
``` bash
5371
export BENCHMARK_TOOL_PATH="litert/tools:benchmark_model"
5472
export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
@@ -58,16 +76,21 @@ export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
5876
--define=xnn_enable_arm_sme=true \
5977
--define=xnn_enable_arm_sme2=true \
6078
--define=xnn_enable_kleidiai=true"
61-
79+
```
80+
Build for Android:
81+
```bash
6282
bazel build -c opt --config=android_arm64 \
6383
${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \
6484
--repo_env=HERMETIC_PYTHON_VERSION=3.12
6585
```
6686

67-
The above build enables the KleidiAI and SME2 micro-kernels integrated into XNNPACK.
87+
This build enables the KleidiAI and SME2 micro-kernels integrated into XNNPACK and produces an Android binary under:
6888

89+
```output
90+
bazel-bin/litert/tools/benchmark_model
91+
```
6992

70-
### Build the LiteRT benchamrk tool without KleidiAI
93+
### Build the LiteRT benchmark tool without KleidiAI (Baseline Comparison)
7194

7295
To compare the performance of KleidiAI SME2 implementation against XNNPACK’s original implementation, you can build another version of LiteRT benchmark tool without KleidiAI and SME2 enabled.
7396

@@ -80,11 +103,18 @@ export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
80103
--define=xnn_enable_arm_sme=false \
81104
--define=xnn_enable_arm_sme2=false \
82105
--define=xnn_enable_kleidiai=false"
106+
```
83107

108+
Then rebuild:
109+
```bash
84110
bazel build -c opt --config=android_arm64 \
85111
${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \
86112
--repo_env=HERMETIC_PYTHON_VERSION=3.12
87113
```
88-
89-
The path to the compiled benchmark tool binary will be displayed in the build output.
114+
This build of the `benchmark_model` disables all SME2 micro-kernels and forces fallback to XNNPACK’s NEON/SVE2 kernels.
90115
You can then use ADB to push the benchmark tool to your Android device.
116+
117+
```bash
118+
adb push bazel-bin/litert/tools/benchmark_model /data/local/tmp/
119+
adb shell chmod +x /data/local/tmp/benchmark_model
120+
```

content/learning-paths/mobile-graphics-and-gaming/litert-sme/3-buid-model.md

Lines changed: 33 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,13 @@ layout: learningpathall
88

99
### KleidiAI SME2 support in LiteRT
1010

11+
LiteRT uses XNNPACK as its default CPU backend. KleidiAI micro-kernels are integrated through XNNPACK in LiteRT.
1112
Only a subset of KleidiAI SME, SME2 micro-kernels has been integrated into XNNPACK.
1213
These micro-kernels support operators using the following data types and quantization configurations in the LiteRT model.
1314
Other operators are using XNNPACK’s default implementation during the inference.
1415

15-
* Fully connected
16+
Fully connected:
17+
1618
| Activations | Weights | Output |
1719
| ---------------------------- | --------------------------------------- | ---------------------------- |
1820
| FP32 | FP32 | FP32 |
@@ -21,15 +23,16 @@ Other operators are using XNNPACK’s default implementation during the inferenc
2123
| Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization |
2224
| FP32 | Per-channel symmetric INT4 quantization | FP32 |
2325

24-
* Batch Matrix Multiply
26+
Batch Matrix Multiply:
27+
2528
| Input A | Input B |
2629
| ------- | --------------------------------------- |
2730
| FP32 | FP32 |
2831
| FP16 | FP16 |
2932
| FP32 | Per-channel symmetric INT8 quantization |
3033

34+
Conv2D:
3135

32-
* Conv2D
3336
| Activations | Weights | Output |
3437
| ---------------------------- | ----------------------------------------------------- | ---------------------------- |
3538
| FP32 | FP32, pointwise (kernerl size is 1) | FP32 |
@@ -38,15 +41,24 @@ Other operators are using XNNPACK’s default implementation during the inferenc
3841
| Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization |
3942

4043

41-
* TransposeConv
44+
TransposeConv:
45+
4246
| Activations | Weights | Output |
4347
| ---------------------------- | ----------------------------------------------------- | ---------------------------- |
4448
| Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization |
4549

4650

47-
### Create LiteRT models by Keras
51+
### Create LiteRT models using Keras
52+
To demonstrate SME2 acceleration on Android, you will construct simple single-layer models (e.g., Fully Connected) using Keras and convert them into LiteRT (.tflite) format.
53+
This allows you to benchmark isolated operators and directly observe SME2 improvements.
54+
55+
Install the tensorflow package dependency for your script:
56+
57+
```bash
58+
sudo pip3 install tensorflow
59+
```
4860

49-
To evaluate the performance of SME2 acceleration per operator, the following script is provided as an example. It uses the Keras to create a simple model containing only a single fully connected operator and convert it into the LiteRT model.
61+
Save the following script as an example in a file named `model.py`:
5062

5163
``` python
5264
import tensorflow as tf
@@ -74,10 +86,24 @@ fc_fp32 = converter.convert()
7486
save_litert_model(fc_fp32, "fc_fp32.tflite")
7587
```
7688

77-
The model above is created in FP32 format. As mentioned in the previous section, this operator can invoke the KleidiAI SME2 micro-kernel for acceleration.
89+
Now run the script:
90+
91+
```bash
92+
python3 model.py
93+
```
94+
95+
The model `fc_fp32.tflite` is created in FP32 format. As mentioned in the previous section, this operator can invoke the KleidiAI SME2 micro-kernel for acceleration.
96+
97+
You can then use ADB to push the model for benchmarking to your Android device.
98+
```bash
99+
adb push fc_fp32.tflite /data/local/tmp/
100+
adb shell chmod +x /data/local/tmp/fc_fp32.tflite
101+
```
78102

79103
You can also optimize this Keras model using post-training quantization to create a LiteRT model that suits your requirements.
80104

105+
## Post-Training Quantization Options
106+
81107
* Post-training FP16 quantization
82108
``` python
83109
# Convert to model with FP16 weights and FP32 activations

content/learning-paths/mobile-graphics-and-gaming/litert-sme/4-benchmark.md

Lines changed: 34 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,33 +5,44 @@ weight: 4
55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
9-
### Use the benchmark tool
10-
11-
First, you should check if your Android phone supports SME2. You can check it by the following command.
12-
8+
Once you have:
9+
* A LiteRT model (for example, fc_fp32.tflite), and
10+
* The benchmark_model binary built with and without KleidiAI + SME2,
11+
you can run benchmarks directly on an SME2-capable Android device.
12+
### Verify SME2 Support on the Device
13+
14+
First, you should check if your Android phone supports SME2.
15+
On the device (via adb shell), run:
1316
``` bash
1417
cat /proc/cpuinfo
15-
18+
```
19+
You should see a Features line similar to:
20+
```output
1621
...
1722
processor : 7
1823
BogoMIPS : 2000.00
1924
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti mte ecv afp mte3 sme smei8i32 smef16f32 smeb16f32 smef32f32 wfxt rprfm sme2 smei16i32 smebi32i32 hbc lrcpc3
2025
```
21-
2226
As you can see from the `Features`, the CPU 7 supports the SME2.
2327

24-
Then, you can run the `benchmark_model` tool on the CPU that supports the SME2. One of the example is as follows.
2528

26-
This example uses the `taskset` command to configure the benchmark workload to run on cores 7. It specifies to utilize 1 thread by the option `--num_threads=1`, and running the inferences at least 1000 times by the option `--num_runs=1000`. The CPU is selected. Also, it passes the option `--use_profiler=true` to produce a operator level profiling during inference.
29+
## Run benchmark_model on an SME2 Core
30+
31+
Next, run the benchmark tool and bind execution to a core that supports SME2.
32+
In this example, you will pin to CPU 7, use a single thread, and run enough iterations to get stable timing.
2733

2834
``` bash
2935
taskset 80 ./benchmark_model --graph=./fc_fp32.tflite --num_runs=1000 --num_threads=1 --use_cpu=true --use_profiler=true
36+
```
37+
38+
This example uses the `taskset` command to configure the benchmark workload to run on cores 7. It specifies to utilize 1 thread by the option `--num_threads=1`, and running the inferences at least 1000 times by the option `--num_runs=1000`. The CPU is selected. Also, it passes the option `--use_profiler=true` to produce a operator level profiling during inference.
3039

40+
You should see output similar to:
41+
```output
3142
...
3243
INFO: [litert/runtime/accelerators/auto_registration.cc:148] CPU accelerator registered.
3344
INFO: [litert/runtime/compiled_model.cc:415] Flatbuffer model initialized directly from incoming litert model.
34-
INFO: Initialized TensorFlow Lite runtime.
45+
INFO: Initialized TensorFlow Lite runtime.
3546
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
3647
VERBOSE: Replacing 1 out of 1 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 1 partitions for subgraph 0.
3748
INFO: The input model file size (MB): 3.27774
@@ -90,7 +101,7 @@ Memory (bytes): count=0
90101
5 nodes observed
91102
```
92103

93-
As you can see from the results above, the results include the time spent on model initialization, warm-up, and inference, as well as memory usage. Since the profiler was enabled, the output also reports the execution time of each operator.
104+
From the results above you will see the time spent on model initialization, warm-up, and inference, as well as memory usage. Since the profiler was enabled, the output also reports the execution time of each operator.
94105

95106
Because the model contains only a single fully connected layer, the node type `Fully Connected (NC, PF32) GEMM` shows the average execution time is 0.382 ms, accounting for 93.171% of the total inference time.
96107

@@ -100,13 +111,15 @@ To verify the KleidiAI SME2 micro-kernels are invoked for the Fully Connected op
100111

101112
## Measure the performance impact of KleidiAI SME2 micro-kernels
102113

103-
To compare the performance of the KleidiAI SME2 implementation with the original XNNPACK implementation, you can run the `benchmark_model` tool without KleidiAI enabled and then benchmark again using the same parameters.
114+
To compare the performance of the KleidiAI SME2 implementation with the original XNNPACK implementation, you can run the `benchmark_model` tool without KleidiAI enabled using the same parameters.
104115

105-
One example is as follows.
116+
Run with the same parameters:
106117

107118
``` bash
108119
taskset 80 ./benchmark_model --graph=./fc_fp32.tflite --num_runs=1000 --num_threads=1 --use_cpu=true --use_profiler=true
109-
120+
```
121+
The output should look like:
122+
```output
110123
...
111124
INFO: [litert/runtime/accelerators/auto_registration.cc:148] CPU accelerator registered.
112125
INFO: [litert/runtime/compiled_model.cc:415] Flatbuffer model initialized directly from incoming litert model.
@@ -169,6 +182,9 @@ Memory (bytes): count=0
169182
5 nodes observed
170183
```
171184

185+
From these benchmarking results you should notice significant throughput uplift and speedup in inference time when KleidiAI SME2 micro-kernels are enabled.
186+
187+
### Interpreting Node Type Names for KleidiAI
172188
As you can see from the results, for the same model, the XNNPACK node type name is different. For the non-KleidiAI implementation, the node type is `Fully Connected (NC, F32) GEMM`, whereas for the KleidiAI implementation, it is `Fully Connected (NC, PF32) GEMM`.
173189

174190
For other operators supported by KleidiAI, the per-operator profiling node types differ between the implementations with and without KleidiAI enabled in XNNPACK as follows:
@@ -188,10 +204,12 @@ For other operators supported by KleidiAI, the per-operator profiling node types
188204
| Conv2D | Convolution (NHWC, PQS8, QS8, QC8W) | Convolution (NHWC, QC8) |
189205
| TransposeConv | Deconvolution (NHWC, PQS8, QS8, QC8W) | Deconvolution (NC, QS8, QC8W) |
190206

191-
As you can see from the list, the letter “P” indicates that the node type corresponds to a KleidiAI implementation.
207+
The letter “P” in the node type indicates that it corresponds to a KleidiAI implementation.
192208

193209
For example, in `Convolution (NHWC, PQS8, QS8, QC8W)`, this represents a Conv2D operator computation by KleidiAI micro-kernel, where the tensor is in NHWC layout.
194210

195211
* The input is packed INT8 quantized,
196212
* The weights are per-channel INT8 quantized,
197-
* The output is INT8 quantized.
213+
* The output is INT8 quantized.
214+
215+
By comparing benchmark_model runs with and without KleidiAI + SME2, and inspecting the profiled node types (PF32, PF16, QP8, PQS8), you can reliably confirm that LiteRT is dispatching to SME2-optimized KleidiAI micro-kernels and quantify their performance impact on your Android device.

content/learning-paths/mobile-graphics-and-gaming/litert-sme/_index.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ cascade:
77

88
minutes_to_complete: 30
99

10-
who_is_this_for: This is an advanced topic for developers looking to leverage the SME2 instrutions to accelerate LiteRT models inference on Android.
10+
who_is_this_for: This is an advanced topic for developers looking to leverage the Arm's Scalable Matrix Extension Version 2 (SME2) instructions to accelerate LiteRT models inference on Android.
1111

1212
learning_objectives:
1313
- Understand how KleidiAI works in LiteRT.
@@ -17,8 +17,8 @@ learning_objectives:
1717

1818

1919
prerequisites:
20-
- A Linux development machine.
21-
- An Android device that supports the SME2 Arm architecture features.
20+
- An x86_64 Linux development machine.
21+
- An Android device that supports the Arm SME2 architecture features - see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices)
2222

2323
author: Jiaming Guo
2424

@@ -27,6 +27,7 @@ skilllevels: Advanced
2727
subjects: ML
2828
armips:
2929
- Cortex-A
30+
- Cortex-X
3031
tools_software_languages:
3132
- C
3233
- Python

0 commit comments

Comments
 (0)