Merge pull request #2631 from pareenaverma/content_review

pareenaverma · web-flow · commit 278f5fb5cd84 · 2025-12-09T18:00:56.000-05:00
Kleidi+SME2 Android LP Tech review
diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/1-litert-kleidiai-sme2.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/1-litert-kleidiai-sme2.md
@@ -8,7 +8,7 @@ layout: learningpathall
 
 ## LiteRT, XNNPACK, KleidiAI and SME2
 
-LiteRT (short for Lite Runtime), formerly known as TensorFlow Lite, is a runtime for on-device AI.
+LiteRT (Lite Runtime), formerly known as TensorFlow Lite, is a runtime for on-device AI.
 The default CPU acceleration library used by LiteRT is XNNPACK.
 
 XNNPACK is an open-source library that provides highly optimized implementations of neural-network operators. It continuously integrates KleidiAI library to leverage new CPU features such as SME2.
@@ -50,4 +50,4 @@ When KleidiAI and SME2 are enabled at building stage, the KleidiAI SME2 micro-ke
 
 During the model loading stage, when XNNPACK optimizes the subgraph, it checks the operator’s data type to determine whether a KleidiAI implementation is available. If KleidiAI supports it, XNNPACK bypasses its own default implementation. As a result, RHS packing is performed using the KleidiAI SME packing micro-kernel. In addition, because KleidiAI typically requires packing of the LHS, a flag is also set during this stage.
 
-During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product.
+During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product.
diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/2-build-tool.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/2-build-tool.md
@@ -6,49 +6,67 @@ weight: 3
 layout: learningpathall
 ---
 
-### Build the LiteRT benchamrk tool with KleidiAI and SME2 enabled
+### Build the LiteRT benchmark tool with KleidiAI and SME2 enabled
 
-LiteRT provides a tool called `benchmark_model` for evaluating the performance of LiteRT models. Use the following steps to build the LiteRT benchamrk tool.
+LiteRT provides provides a standalone performance measurement utility called `benchmark_model` for evaluating the performance of LiteRT models. 
+
+In this section, you will build two versions of the benchmark tool:
+  * With KleidiAI + SME/SME2 enabled → uses Arm-optimized micro-kernels
+  * Without KleidiAI + SME/SME2 → baseline performance (NEON/SVE2 fallback)
+This comparison clearly demonstrates the gains provided by SME2 acceleration.
 
 First, clone the LiteRT repository.
 
 ``` bash
 cd $WORKSPACE
 git clone https://github.com/google-ai-edge/LiteRT.git
 ```
+Because LiteRT integrates KleidiAI through XNNPACK, you must build LiteRT from source to enable SME2 micro-kernels.
 
-Then, set up build environment using Docker in your Linux developement machine.
+Next, set up your Android build environment using Docker on your Linux developement machine.
+Google provides a Dockerfile that installs the toolchain needed for TFLite/LiteRT Android builds.
 
+Download the Dockerfile:
 ``` bash
 wget https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/lite/tools/tflite-android.Dockerfile
+```
+Build the Docker image:
+```bash
 docker build . -t tflite-builder -f tflite-android.Dockerfile
 ```
+The Docker image includes Bazel, NDK, CMake, toolchains, and Python required for cross-compiling Android binaries.
 
-Inside the container, run the following commands to download Android tools and libraries to build LiteRT for Android.
-
-``` bash
+You will now install Android SDK/NDK Components inside the Container.
+Launch the docker container:
+```bash
 docker run -it -v $PWD:/host_dir tflite-builder bash
+```
+Install Android platform tools:
+``` bash
 sdkmanager \
   "build-tools;${ANDROID_BUILD_TOOLS_VERSION}" \
   "platform-tools" \
   "platforms;android-${ANDROID_API_LEVEL}"
 ```
 
-Inside the LiteRT source, run the script to configure the bazel paramters.
+Configure LiteRT Build Options inside your running container:
 
 ``` bash
 cd /host_dir/LiteRT
 ./configure
 ```
 
-You can keep all options at their default values except for:
-
-`Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]`
+Use default values for all the prompts except when prompted:
+```output
+Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]
+```
+Type in `y`.
 
-Type in `y`, then the script will automatically detect the necessary files set up in the sdkmanager command and configure them accordingly.
+LiteRT's configuration script will detect SDK + NDK paths, set toolchain versions, configure Android ABI (arm64-v8a) and initialize Bazel workspace rules.
 
-Now, you can build the benchmark tool with the following commands.
+Now, you can build the benchmark tool with KleidiAI + SME2 Enabled.
 
+Enable XNNPACK, quantization paths, and SME2 acceleration:
 ``` bash
 export BENCHMARK_TOOL_PATH="litert/tools:benchmark_model"
 export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
@@ -58,16 +76,21 @@ export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
 --define=xnn_enable_arm_sme=true \
 --define=xnn_enable_arm_sme2=true \
 --define=xnn_enable_kleidiai=true"
-
+```
+Build for Android:
+```bash
 bazel build -c opt --config=android_arm64 \
 ${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \
 --repo_env=HERMETIC_PYTHON_VERSION=3.12
 ```
 
-The above build enables the KleidiAI and SME2 micro-kernels integrated into XNNPACK.
+This build enables the KleidiAI and SME2 micro-kernels integrated into XNNPACK and produces an Android binary under:
 
+```output
+bazel-bin/litert/tools/benchmark_model
+```
 
-### Build the LiteRT benchamrk tool without KleidiAI
+### Build the LiteRT benchmark tool without KleidiAI (Baseline Comparison)
 
 To compare the performance of KleidiAI SME2 implementation against XNNPACK’s original implementation, you can build another version of LiteRT benchmark tool without KleidiAI and SME2 enabled.
 
@@ -80,11 +103,18 @@ export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
 --define=xnn_enable_arm_sme=false \
 --define=xnn_enable_arm_sme2=false \
 --define=xnn_enable_kleidiai=false"
+```
 
+Then rebuild:
+```bash
 bazel build -c opt --config=android_arm64 \
 ${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \
 --repo_env=HERMETIC_PYTHON_VERSION=3.12
 ```
-
-The path to the compiled benchmark tool binary will be displayed in the build output.
+This build of the `benchmark_model` disables all SME2 micro-kernels and forces fallback to XNNPACK’s NEON/SVE2 kernels.
 You can then use ADB to push the benchmark tool to your Android device.
+
+```bash
+adb push bazel-bin/litert/tools/benchmark_model /data/local/tmp/
+adb shell chmod +x /data/local/tmp/benchmark_model
+```
diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/3-buid-model.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/3-buid-model.md
@@ -8,11 +8,13 @@ layout: learningpathall
 
 ### KleidiAI SME2 support in LiteRT
 
+LiteRT uses XNNPACK as its default CPU backend. KleidiAI micro-kernels are integrated through XNNPACK in LiteRT.
 Only a subset of KleidiAI SME, SME2 micro-kernels has been integrated into XNNPACK.
 These micro-kernels support operators using the following data types and quantization configurations in the LiteRT model.
 Other operators are using XNNPACK’s default implementation during the inference.
 
-* Fully connected 
+Fully connected:
+
 | Activations                  | Weights                                 | Output                       |
 | ---------------------------- | --------------------------------------- | ---------------------------- |
 | FP32                         | FP32                                    | FP32                         |
@@ -21,15 +23,16 @@ Other operators are using XNNPACK’s default implementation during the inferenc
 | Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization |
 | FP32                         | Per-channel symmetric INT4 quantization | FP32                         |
 
-* Batch Matrix Multiply
+Batch Matrix Multiply:
+
 | Input A | Input B                                 |
 | ------- | --------------------------------------- |
 | FP32    | FP32                                    |
 | FP16    | FP16                                    |   
 | FP32    | Per-channel symmetric INT8 quantization |
 
+Conv2D:
 
-* Conv2D
 | Activations                  | Weights                                               | Output                       |
 | ---------------------------- | ----------------------------------------------------- | ---------------------------- |
 | FP32                         | FP32, pointwise (kernerl size is 1)                   | FP32                         |
@@ -38,15 +41,24 @@ Other operators are using XNNPACK’s default implementation during the inferenc
 | Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization |
 
 
-* TransposeConv
+TransposeConv:
+
 | Activations                  | Weights                                               | Output                       |
 | ---------------------------- | ----------------------------------------------------- | ---------------------------- |
 | Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization |
 
 
-### Create LiteRT models by Keras
+### Create LiteRT models using Keras
+To demonstrate SME2 acceleration on Android, you will construct simple single-layer models (e.g., Fully Connected) using Keras and convert them into LiteRT (.tflite) format.
+This allows you to benchmark isolated operators and directly observe SME2 improvements.
+
+Install the tensorflow package dependency for your script:
+
+```bash
+sudo pip3 install tensorflow
+```
 
-To evaluate the performance of SME2 acceleration per operator, the following script is provided as an example. It uses the Keras to create a simple model containing only a single fully connected operator and convert it into the LiteRT model.
+Save the following script as an example in a file named `model.py`:
 
 ``` python
 import tensorflow as tf
@@ -74,10 +86,24 @@ fc_fp32 = converter.convert()
 save_litert_model(fc_fp32, "fc_fp32.tflite")
 ```
 
-The model above is created in FP32 format. As mentioned in the previous section, this operator can invoke the KleidiAI SME2 micro-kernel for acceleration.
+Now run the script:
+
+```bash
+python3 model.py
+```
+
+The model `fc_fp32.tflite` is created in FP32 format. As mentioned in the previous section, this operator can invoke the KleidiAI SME2 micro-kernel for acceleration. 
+
+You can then use ADB to push the model for benchmarking to your Android device.
+```bash
+adb push fc_fp32.tflite /data/local/tmp/
+adb shell chmod +x /data/local/tmp/fc_fp32.tflite
+```
 
 You can also optimize this Keras model using post-training quantization to create a LiteRT model that suits your requirements.
 
+## Post-Training Quantization Options
+
 * Post-training FP16 quantization
 ``` python
 # Convert to model with FP16 weights and FP32 activations
diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/4-benchmark.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/4-benchmark.md
@@ -5,33 +5,44 @@ weight: 4
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-
-### Use the benchmark tool
-
-First, you should check if your Android phone supports SME2. You can check it by the following command.
-
+Once you have:
+  * A LiteRT model (for example, fc_fp32.tflite), and
+  * The benchmark_model binary built with and without KleidiAI + SME2,
+you can run benchmarks directly on an SME2-capable Android device.
+### Verify SME2 Support on the Device
+
+First, you should check if your Android phone supports SME2. 
+On the device (via adb shell), run:
 ``` bash
 cat /proc/cpuinfo
-
+```
+You should see a Features line similar to:
+```output
 ...
 processor	: 7
 BogoMIPS	: 2000.00
 Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti mte ecv afp mte3 sme smei8i32 smef16f32 smeb16f32 smef32f32 wfxt rprfm sme2 smei16i32 smebi32i32 hbc lrcpc3
 ```
-
 As you can see from the `Features`, the CPU 7 supports the SME2.
 
-Then, you can run the `benchmark_model` tool on the CPU that supports the SME2. One of the example is as follows. 
 
-This example uses the `taskset` command to configure the benchmark workload to run on cores 7. It specifies to utilize 1 thread by the option `--num_threads=1`, and running the inferences at least 1000 times by the option `--num_runs=1000`. The CPU is selected. Also, it passes the option `--use_profiler=true` to produce a operator level profiling during inference.
+## Run benchmark_model on an SME2 Core
+
+Next, run the benchmark tool and bind execution to a core that supports SME2.
+In this example, you will pin to CPU 7, use a single thread, and run enough iterations to get stable timing.
 
 ``` bash
 taskset 80 ./benchmark_model --graph=./fc_fp32.tflite --num_runs=1000 --num_threads=1 --use_cpu=true --use_profiler=true
+```
+
+This example uses the `taskset` command to configure the benchmark workload to run on cores 7. It specifies to utilize 1 thread by the option `--num_threads=1`, and running the inferences at least 1000 times by the option `--num_runs=1000`. The CPU is selected. Also, it passes the option `--use_profiler=true` to produce a operator level profiling during inference.
 
+You should see output similar to:
+```output
 ...
 INFO: [litert/runtime/accelerators/auto_registration.cc:148] CPU accelerator registered.
 INFO: [litert/runtime/compiled_model.cc:415] Flatbuffer model initialized directly from incoming litert model.
-INFO: Initialized TensorFlow Lite runtime.
+INFO: Initialized TensorFlow Lite runtime.
 INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
 VERBOSE: Replacing 1 out of 1 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 1 partitions for subgraph 0.
 INFO: The input model file size (MB): 3.27774
@@ -90,7 +101,7 @@ Memory (bytes): count=0
 5 nodes observed
 ```
 
-As you can see from the results above, the results include the time spent on model initialization, warm-up, and inference, as well as memory usage. Since the profiler was enabled, the output also reports the execution time of each operator.
+From the results above you will see the time spent on model initialization, warm-up, and inference, as well as memory usage. Since the profiler was enabled, the output also reports the execution time of each operator.
 
 Because the model contains only a single fully connected layer, the node type `Fully Connected (NC, PF32) GEMM` shows the average execution time is 0.382 ms, accounting for 93.171% of the total inference time.
 
@@ -100,13 +111,15 @@ To verify the KleidiAI SME2 micro-kernels are invoked for the Fully Connected op
 
 ## Measure the performance impact of KleidiAI SME2 micro-kernels
 
-To compare the performance of the KleidiAI SME2 implementation with the original XNNPACK implementation, you can run the `benchmark_model` tool without KleidiAI enabled and then benchmark again using the same parameters.
+To compare the performance of the KleidiAI SME2 implementation with the original XNNPACK implementation, you can run the `benchmark_model` tool without KleidiAI enabled using the same parameters.
 
-One example is as follows.
+Run with the same parameters:
 
 ``` bash
 taskset 80 ./benchmark_model --graph=./fc_fp32.tflite --num_runs=1000 --num_threads=1 --use_cpu=true --use_profiler=true
-
+```
+The output should look like:
+```output
 ...
 INFO: [litert/runtime/accelerators/auto_registration.cc:148] CPU accelerator registered.
 INFO: [litert/runtime/compiled_model.cc:415] Flatbuffer model initialized directly from incoming litert model.
@@ -169,6 +182,9 @@ Memory (bytes): count=0
 5 nodes observed
 ```
 
+From these benchmarking results you should notice significant throughput uplift and speedup in inference time when KleidiAI SME2 micro-kernels are enabled.
+
+### Interpreting Node Type Names for KleidiAI 
 As you can see from the results, for the same model, the XNNPACK node type name is different. For the non-KleidiAI implementation, the node type is `Fully Connected (NC, F32) GEMM`, whereas for the KleidiAI implementation, it is `Fully Connected (NC, PF32) GEMM`.
 
 For other operators supported by KleidiAI, the per-operator profiling node types differ between the implementations with and without KleidiAI enabled in XNNPACK as follows:
@@ -188,10 +204,12 @@ For other operators supported by KleidiAI, the per-operator profiling node types
 | Conv2D                                 | Convolution (NHWC, PQS8, QS8, QC8W)                   | Convolution (NHWC, QC8)                                |
 | TransposeConv                          | Deconvolution (NHWC, PQS8, QS8, QC8W)                 | Deconvolution (NC, QS8, QC8W)                          |
 
-As you can see from the list, the letter “P” indicates that the node type corresponds to a KleidiAI implementation.
+The letter “P” in the node type indicates that it corresponds to a KleidiAI implementation.
 
 For example, in `Convolution (NHWC, PQS8, QS8, QC8W)`, this represents a Conv2D operator computation by KleidiAI micro-kernel, where the tensor is in NHWC layout.
 
 * The input is packed INT8 quantized,
 * The weights are per-channel INT8 quantized,
-* The output is INT8 quantized.
+* The output is INT8 quantized.
+
+By comparing benchmark_model runs with and without KleidiAI + SME2, and inspecting the profiled node types (PF32, PF16, QP8, PQS8), you can reliably confirm that LiteRT is dispatching to SME2-optimized KleidiAI micro-kernels and quantify their performance impact on your Android device.
diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/_index.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/_index.md
@@ -7,7 +7,7 @@ cascade:
 
 minutes_to_complete: 30
 
-who_is_this_for: This is an advanced topic for developers looking to leverage the SME2 instrutions to accelerate LiteRT models inference on Android.
+who_is_this_for: This is an advanced topic for developers looking to leverage the Arm's Scalable Matrix Extension Version 2 (SME2) instructions to accelerate LiteRT models inference on Android.
 
 learning_objectives: 
     - Understand how KleidiAI works in LiteRT.
@@ -17,8 +17,8 @@ learning_objectives:
 
 
 prerequisites:
-    - A Linux development machine.
-    - An Android device that supports the SME2 Arm architecture features.
+    - An x86_64 Linux development machine.
+    - An Android device that supports the Arm SME2 architecture features - see the list of [devices with SME2 support](/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices)
 
 author: Jiaming Guo
 
@@ -27,6 +27,7 @@ skilllevels: Advanced
 subjects: ML
 armips:
     - Cortex-A
+    - Cortex-X
 tools_software_languages:
     - C
     - Python