You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/litert-sme/1-litert-kleidiai-sme2.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ layout: learningpathall
8
8
9
9
## LiteRT, XNNPACK, KleidiAI and SME2
10
10
11
-
LiteRT (short for Lite Runtime), formerly known as TensorFlow Lite, is a runtime for on-device AI.
11
+
LiteRT (Lite Runtime), formerly known as TensorFlow Lite, is a runtime for on-device AI.
12
12
The default CPU acceleration library used by LiteRT is XNNPACK.
13
13
14
14
XNNPACK is an open-source library that provides highly optimized implementations of neural-network operators. It continuously integrates KleidiAI library to leverage new CPU features such as SME2.
@@ -50,4 +50,4 @@ When KleidiAI and SME2 are enabled at building stage, the KleidiAI SME2 micro-ke
50
50
51
51
During the model loading stage, when XNNPACK optimizes the subgraph, it checks the operator’s data type to determine whether a KleidiAI implementation is available. If KleidiAI supports it, XNNPACK bypasses its own default implementation. As a result, RHS packing is performed using the KleidiAI SME packing micro-kernel. In addition, because KleidiAI typically requires packing of the LHS, a flag is also set during this stage.
52
52
53
-
During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product.
53
+
During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product.
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/litert-sme/2-build-tool.md
+47-17Lines changed: 47 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,49 +6,67 @@ weight: 3
6
6
layout: learningpathall
7
7
---
8
8
9
-
### Build the LiteRT benchamrk tool with KleidiAI and SME2 enabled
9
+
### Build the LiteRT benchmark tool with KleidiAI and SME2 enabled
10
10
11
-
LiteRT provides a tool called `benchmark_model` for evaluating the performance of LiteRT models. Use the following steps to build the LiteRT benchamrk tool.
11
+
LiteRT provides provides a standalone performance measurement utility called `benchmark_model` for evaluating the performance of LiteRT models.
12
+
13
+
In this section, you will build two versions of the benchmark tool:
14
+
* With KleidiAI + SME/SME2 enabled → uses Arm-optimized micro-kernels
15
+
* Without KleidiAI + SME/SME2 → baseline performance (NEON/SVE2 fallback)
16
+
This comparison clearly demonstrates the gains provided by SME2 acceleration.
The above build enables the KleidiAI and SME2 micro-kernels integrated into XNNPACK.
87
+
This build enables the KleidiAI and SME2 micro-kernels integrated into XNNPACK and produces an Android binary under:
68
88
89
+
```output
90
+
bazel-bin/litert/tools/benchmark_model
91
+
```
69
92
70
-
### Build the LiteRT benchamrk tool without KleidiAI
93
+
### Build the LiteRT benchmark tool without KleidiAI (Baseline Comparison)
71
94
72
95
To compare the performance of KleidiAI SME2 implementation against XNNPACK’s original implementation, you can build another version of LiteRT benchmark tool without KleidiAI and SME2 enabled.
To demonstrate SME2 acceleration on Android, you will construct simple single-layer models (e.g., Fully Connected) using Keras and convert them into LiteRT (.tflite) format.
53
+
This allows you to benchmark isolated operators and directly observe SME2 improvements.
54
+
55
+
Install the tensorflow package dependency for your script:
56
+
57
+
```bash
58
+
sudo pip3 install tensorflow
59
+
```
48
60
49
-
To evaluate the performance of SME2 acceleration per operator, the following script is provided as an example. It uses the Keras to create a simple model containing only a single fully connected operator and convert it into the LiteRT model.
61
+
Save the following script as an example in a file named `model.py`:
50
62
51
63
```python
52
64
import tensorflow as tf
@@ -74,10 +86,24 @@ fc_fp32 = converter.convert()
74
86
save_litert_model(fc_fp32, "fc_fp32.tflite")
75
87
```
76
88
77
-
The model above is created in FP32 format. As mentioned in the previous section, this operator can invoke the KleidiAI SME2 micro-kernel for acceleration.
89
+
Now run the script:
90
+
91
+
```bash
92
+
python3 model.py
93
+
```
94
+
95
+
The model `fc_fp32.tflite` is created in FP32 format. As mentioned in the previous section, this operator can invoke the KleidiAI SME2 micro-kernel for acceleration.
96
+
97
+
You can then use ADB to push the model for benchmarking to your Android device.
98
+
```bash
99
+
adb push fc_fp32.tflite /data/local/tmp/
100
+
adb shell chmod +x /data/local/tmp/fc_fp32.tflite
101
+
```
78
102
79
103
You can also optimize this Keras model using post-training quantization to create a LiteRT model that suits your requirements.
80
104
105
+
## Post-Training Quantization Options
106
+
81
107
* Post-training FP16 quantization
82
108
```python
83
109
# Convert to model with FP16 weights and FP32 activations
As you can see from the `Features`, the CPU 7 supports the SME2.
23
27
24
-
Then, you can run the `benchmark_model` tool on the CPU that supports the SME2. One of the example is as follows.
25
28
26
-
This example uses the `taskset` command to configure the benchmark workload to run on cores 7. It specifies to utilize 1 thread by the option `--num_threads=1`, and running the inferences at least 1000 times by the option `--num_runs=1000`. The CPU is selected. Also, it passes the option `--use_profiler=true` to produce a operator level profiling during inference.
29
+
## Run benchmark_model on an SME2 Core
30
+
31
+
Next, run the benchmark tool and bind execution to a core that supports SME2.
32
+
In this example, you will pin to CPU 7, use a single thread, and run enough iterations to get stable timing.
This example uses the `taskset` command to configure the benchmark workload to run on cores 7. It specifies to utilize 1 thread by the option `--num_threads=1`, and running the inferences at least 1000 times by the option `--num_runs=1000`. The CPU is selected. Also, it passes the option `--use_profiler=true` to produce a operator level profiling during inference.
30
39
40
+
You should see output similar to:
41
+
```output
31
42
...
32
43
INFO: [litert/runtime/accelerators/auto_registration.cc:148] CPU accelerator registered.
33
44
INFO: [litert/runtime/compiled_model.cc:415] Flatbuffer model initialized directly from incoming litert model.
34
-
INFO: Initialized TensorFlow Lite runtime.
45
+
INFO: Initialized TensorFlow Lite runtime.
35
46
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
36
47
VERBOSE: Replacing 1 out of 1 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 1 partitions for subgraph 0.
37
48
INFO: The input model file size (MB): 3.27774
@@ -90,7 +101,7 @@ Memory (bytes): count=0
90
101
5 nodes observed
91
102
```
92
103
93
-
As you can see from the results above, the results include the time spent on model initialization, warm-up, and inference, as well as memory usage. Since the profiler was enabled, the output also reports the execution time of each operator.
104
+
From the results above you will see the time spent on model initialization, warm-up, and inference, as well as memory usage. Since the profiler was enabled, the output also reports the execution time of each operator.
94
105
95
106
Because the model contains only a single fully connected layer, the node type `Fully Connected (NC, PF32) GEMM` shows the average execution time is 0.382 ms, accounting for 93.171% of the total inference time.
96
107
@@ -100,13 +111,15 @@ To verify the KleidiAI SME2 micro-kernels are invoked for the Fully Connected op
100
111
101
112
## Measure the performance impact of KleidiAI SME2 micro-kernels
102
113
103
-
To compare the performance of the KleidiAI SME2 implementation with the original XNNPACK implementation, you can run the `benchmark_model` tool without KleidiAI enabled and then benchmark again using the same parameters.
114
+
To compare the performance of the KleidiAI SME2 implementation with the original XNNPACK implementation, you can run the `benchmark_model` tool without KleidiAI enabled using the same parameters.
INFO: [litert/runtime/accelerators/auto_registration.cc:148] CPU accelerator registered.
112
125
INFO: [litert/runtime/compiled_model.cc:415] Flatbuffer model initialized directly from incoming litert model.
@@ -169,6 +182,9 @@ Memory (bytes): count=0
169
182
5 nodes observed
170
183
```
171
184
185
+
From these benchmarking results you should notice significant throughput uplift and speedup in inference time when KleidiAI SME2 micro-kernels are enabled.
186
+
187
+
### Interpreting Node Type Names for KleidiAI
172
188
As you can see from the results, for the same model, the XNNPACK node type name is different. For the non-KleidiAI implementation, the node type is `Fully Connected (NC, F32) GEMM`, whereas for the KleidiAI implementation, it is `Fully Connected (NC, PF32) GEMM`.
173
189
174
190
For other operators supported by KleidiAI, the per-operator profiling node types differ between the implementations with and without KleidiAI enabled in XNNPACK as follows:
@@ -188,10 +204,12 @@ For other operators supported by KleidiAI, the per-operator profiling node types
As you can see from the list, the letter “P” indicates that the node type corresponds to a KleidiAI implementation.
207
+
The letter “P” in the node type indicates that it corresponds to a KleidiAI implementation.
192
208
193
209
For example, in `Convolution (NHWC, PQS8, QS8, QC8W)`, this represents a Conv2D operator computation by KleidiAI micro-kernel, where the tensor is in NHWC layout.
194
210
195
211
* The input is packed INT8 quantized,
196
212
* The weights are per-channel INT8 quantized,
197
-
* The output is INT8 quantized.
213
+
* The output is INT8 quantized.
214
+
215
+
By comparing benchmark_model runs with and without KleidiAI + SME2, and inspecting the profiled node types (PF32, PF16, QP8, PQS8), you can reliably confirm that LiteRT is dispatching to SME2-optimized KleidiAI micro-kernels and quantify their performance impact on your Android device.
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/litert-sme/_index.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ cascade:
7
7
8
8
minutes_to_complete: 30
9
9
10
-
who_is_this_for: This is an advanced topic for developers looking to leverage the SME2 instrutions to accelerate LiteRT models inference on Android.
10
+
who_is_this_for: This is an advanced topic for developers looking to leverage the Arm's Scalable Matrix Extension Version 2 (SME2) instructions to accelerate LiteRT models inference on Android.
11
11
12
12
learning_objectives:
13
13
- Understand how KleidiAI works in LiteRT.
@@ -17,8 +17,8 @@ learning_objectives:
17
17
18
18
19
19
prerequisites:
20
-
- A Linux development machine.
21
-
- An Android device that supports the SME2 Arm architecture features.
20
+
- An x86_64 Linux development machine.
21
+
- An Android device that supports the Arm SME2 architecture features - see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices)
0 commit comments