Merge pull request #2240 from pareenaverma/content_review

pareenaverma · web-flow · commit 5cb308b8ae52 · 2025-08-19T13:45:57.000-04:00
Tech review of Azure Spark LP
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md
@@ -1,22 +1,26 @@
 ---
 title: Run Spark applications on the Microsoft Azure Cobalt 100 processors
 
+draft: true
+cascade:
+    draft: true
+
 minutes_to_complete: 60
 
-who_is_this_for: This Learning Path introduces Spark deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Spark applications from x86_64 to Arm with minimal or no changes.
+who_is_this_for: This is an advanced topic that introduces Spark deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Spark applications from x86_64 to Arm.
 
 learning_objectives: 
-    - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu as the base image.
+    - Provision an Azure Arm64 virtual machine using Azure console.
     - Learn how to create an Azure Linux 3.0 Docker container.
-    - Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container and an Azure Linux 3.0 custom-image based Azure virtual machine.
-    - Perform Spark benchmarking inside the container as well as the custom virtual machine.
+    - Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container or an Azure Linux 3.0 custom-image based Azure virtual machine.
+    - Run a suite of Spark benchmarks to understand and evaluate performance on the Azure Cobalt 100 virtual machine.
 
 prerequisites:
     - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6).
     - A machine with [Docker](/install-guides/docker/) installed.
     - Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/).
 
-author: Jason Andrews
+author: Pareena Verma
 
 ### Tags
 skilllevels: Advanced
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/background.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/background.md
@@ -1,14 +1,14 @@
 ---
-title: "About Cobalt 100 Arm-based processor and Apache Spark"
+title: "Overview"
 
 weight: 2
 
 layout: "learningpathall"
 ---
 
-## What is Cobalt 100 Arm-based processor?
+## What is the Azure Cobalt 100 processor?
 
-Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance.
+Azure’s Cobalt 100 is built on Microsoft's first-generation Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance.
 
 To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).
 
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/baseline.md
@@ -1,16 +1,16 @@
 ---
-title: Baseline Testing
+title: Functional Validation
 weight: 6
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
 
-## Baseline Testing
+## Functional Validation
 Since Apache Spark is installed successfully on your Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output.
 
-Run a simple PySpark script, create a file named `test_spark.py`, and add the below content to it:
+Using a file editor of your choice, create a file named `test_spark.py`, and add the below content to it:
 
 ```python
 from pyspark.sql import SparkSession
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md
@@ -1,5 +1,5 @@
 ---
-title: Spark Internal Benchmarking
+title: Benchmark Spark
 weight: 7
 
 ### FIXED, DO NOT MODIFY
@@ -32,9 +32,9 @@ This compiles Spark and its dependencies, enabling the benchmarks build profile
 ```console
 ./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark"
 ```
-This executes the **JoinBenchmark**, which measures the performance of various SQL join operations (e.g., SortMergeJoin, BroadcastHashJoin) under different query plans. It helps evaluate how Spark SQL optimizes and executes join strategies, especially with and without WholeStageCodegen, a technique that compiles entire query stages into efficient bytecode for faster execution.
+This executes the `JoinBenchmark`, which measures the performance of various SQL join operations (e.g., SortMergeJoin, BroadcastHashJoin) under different query plans. It helps evaluate how Spark SQL optimizes and executes join strategies, especially with and without WholeStageCodegen, a technique that compiles entire query stages into efficient bytecode for faster execution.
 
-You should see an output similar to:
+The output should look similar to:
 ```output
 [info] Running benchmark: Join w long
 [info]   Running case: Join w long wholestage off
@@ -183,36 +183,9 @@ You should see an output similar to:
 Benchmarking was performed in both an Azure Linux 3.0 Docker container and an Azure Linux 3.0 virtual machine. The benchmark results were found to be comparable.
 {{% /notice %}}
 
-Accordingly, this Learning path includes benchmark results from virtual machines only, for both x86 and Arm64 platforms.
-### Benchmark summary on x86_64:
-The following benchmark results are collected on an x86_64 **D4s_v4 Azure virtual machine using the Azure Linux 3.0 image published by Ntegral Inc**.
-| Benchmark                                | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
-|------------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------|
-| Join w long                               | Off        | 3168           | 3185           | 24         | 6.6         | 151.1          | 1.0X     |
-|                                           | On         | 1509           | 1562           | 61         | 13.9        | 72.0           | 2.1X     |
-| Join w long duplicated                    | Off        | 2490           | 2504           | 20         | 8.4         | 118.7          | 1.0X     |
-|                                           | On         | 1151           | 1181           | 27         | 18.2        | 54.9           | 2.2X     |
-| Join w 2 ints                             | Off        | 217074         | 219364         | 3239       | 0.1         | 10350.9        | 1.0X     |
-|                                           | On         | 119692         | 119756         | 74         | 0.2         | 5707.4         | 1.8X     |
-| Join w 2 longs                            | Off        | 4367           | 4401           | 49         | 4.8         | 208.2          | 1.0X     |
-|                                           | On         | 2952           | 3003           | 35         | 7.1         | 140.8          | 1.5X     |
-| Join w 2 longs duplicated                 | Off        | 10255          | 10286          | 45         | 2.0         | 489.0          | 1.0X     |
-|                                           | On         | 7243           | 7300           | 36         | 2.9         | 345.4          | 1.4X     |
-| Outer join w long                         | Off        | 2401           | 2422           | 30         | 8.7         | 114.5          | 1.0X     |
-|                                           | On         | 1544           | 1564           | 17         | 13.6        | 73.6           | 1.6X     |
-| Semi join w long                          | Off        | 1344           | 1350           | 8          | 15.6        | 64.1           | 1.0X     |
-|                                           | On         | 673            | 685            | 12         | 31.2        | 32.1           | 2.0X     |
-| Sort merge join                           | Off        | 1144           | 1145           | 1          | 1.8         | 545.6          | 1.0X     |
-|                                           | On         | 1177           | 1228           | 46         | 1.8         | 561.4          | 1.0X     |
-| Sort merge join w/ duplicates             | Off        | 2075           | 2113           | 55         | 1.0         | 989.4          | 1.0X     |
-|                                           | On         | 1704           | 1720           | 14         | 1.2         | 812.3          | 1.2X     |
-| Shuffle hash join                         | Off        | 672            | 674            | 2          | 6.2         | 160.3          | 1.0X     |
-|                                           | On         | 524            | 525            | 1          | 8.0         | 124.9          | 1.3X     |
-| Broadcast nested loop join               | Off        | 36060          | 36103          | 62         | 0.6         | 1719.5         | 1.0X     |
-|                                           | On         | 31254          | 31346          | 78         | 0.7         | 1490.3         | 1.2X     |
 
 ### Benchmark summary on Arm64:
-The following benchmark results were collected on an Arm64 **D4ps_v6 Azure virtual machine created from a custom Azure Linux 3.0 image using the AArch64 ISO**. 
+For easier comparison, shown here is a summary of benchmark results collected on an Arm64 `D4ps_v6` Azure virtual machine created from a custom Azure Linux 3.0 image using the AArch64 ISO.
 | Benchmark                              | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
 |----------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------|
 | Join w long                             | Off        | 2345           | 2649           | 429        | 8.9         | 111.8          | 1.0X     |
@@ -238,10 +211,41 @@ The following benchmark results were collected on an Arm64 **D4ps_v6 Azure virtu
 | Broadcast nested loop join             | Off        | 26847          | 26870          | 32         | 0.8         | 1280.2         | 1.0X     |
 |                                        | On         | 18857          | 18928          | 84         | 1.1         | 899.2          | 1.4X     |
 
-### **Highlights from Azure Linux Arm64 virtual machine**
+### Benchmark summary on x86_64:
+Shown here is a summary of the benchmark results collected on an `x86_64` `D4s_v4` Azure virtual machine using the Azure Linux 3.0 image published by Ntegral Inc.
+| Benchmark                                | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
+|------------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------|
+| Join w long                               | Off        | 3168           | 3185           | 24         | 6.6         | 151.1          | 1.0X     |
+|                                           | On         | 1509           | 1562           | 61         | 13.9        | 72.0           | 2.1X     |
+| Join w long duplicated                    | Off        | 2490           | 2504           | 20         | 8.4         | 118.7          | 1.0X     |
+|                                           | On         | 1151           | 1181           | 27         | 18.2        | 54.9           | 2.2X     |
+| Join w 2 ints                             | Off        | 217074         | 219364         | 3239       | 0.1         | 10350.9        | 1.0X     |
+|                                           | On         | 119692         | 119756         | 74         | 0.2         | 5707.4         | 1.8X     |
+| Join w 2 longs                            | Off        | 4367           | 4401           | 49         | 4.8         | 208.2          | 1.0X     |
+|                                           | On         | 2952           | 3003           | 35         | 7.1         | 140.8          | 1.5X     |
+| Join w 2 longs duplicated                 | Off        | 10255          | 10286          | 45         | 2.0         | 489.0          | 1.0X     |
+|                                           | On         | 7243           | 7300           | 36         | 2.9         | 345.4          | 1.4X     |
+| Outer join w long                         | Off        | 2401           | 2422           | 30         | 8.7         | 114.5          | 1.0X     |
+|                                           | On         | 1544           | 1564           | 17         | 13.6        | 73.6           | 1.6X     |
+| Semi join w long                          | Off        | 1344           | 1350           | 8          | 15.6        | 64.1           | 1.0X     |
+|                                           | On         | 673            | 685            | 12         | 31.2        | 32.1           | 2.0X     |
+| Sort merge join                           | Off        | 1144           | 1145           | 1          | 1.8         | 545.6          | 1.0X     |
+|                                           | On         | 1177           | 1228           | 46         | 1.8         | 561.4          | 1.0X     |
+| Sort merge join w/ duplicates             | Off        | 2075           | 2113           | 55         | 1.0         | 989.4          | 1.0X     |
+|                                           | On         | 1704           | 1720           | 14         | 1.2         | 812.3          | 1.2X     |
+| Shuffle hash join                         | Off        | 672            | 674            | 2          | 6.2         | 160.3          | 1.0X     |
+|                                           | On         | 524            | 525            | 1          | 8.0         | 124.9          | 1.3X     |
+| Broadcast nested loop join               | Off        | 36060          | 36103          | 62         | 0.6         | 1719.5         | 1.0X     |
+|                                           | On         | 31254          | 31346          | 78         | 0.7         | 1490.3         | 1.2X     |
+
+
+### Benchmark comparison insights
+
+When you compare the benchmark results you will notice that on the Azure Linux Arm64 virtual machine:
+
+- Whole-stage codegen improves performance by up to 2.8× on complex joins (e.g., with long columns).
+- Simple joins (e.g., on integers) show negligible performance gain, remains comparable to performance on `x86_64`.
+- Broadcast and shuffle-based joins benefit with 1.4× to 1.5× improvements.
+- Overall enabling whole-stage codegen consistently improves performance across most join types.
 
-- **Whole-stage codegen improves performance by up to 2.8×** on complex joins (e.g., with long columns).
-- **Simple joins (e.g., on integers)** show **negligible performance gain**, remaining close to 1.0×.
-- **Broadcast and shuffle-based joins**benefit moderately, with **1.4× to 1.5× improvements**.
-- **Overall**, enabling whole-stage codegen consistently improves performance across most join types.
-- **Benchmark results were consistent**across both Docker and virtual machine on Arm64.
+You have successfully learnt how to deploy Apache Spark on an Azure Cobalt 100 virtual machine and measure the performance uplift. 
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/container-setup.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/container-setup.md
@@ -7,28 +7,28 @@ layout: learningpathall
 ---
 
 
-You have an option to choose between working with the Azure Linux 3.0 Docker image or inside the virtual machine created with the OS image.
+You can choose between deploying your Spark workload either in an Azure Linux 3.0 Docker container or on a virtual machine created from a custom Azure Linux 3.0 image.
 
 ### Working inside Azure Linux 3.0 Docker container
-The Azure Linux Container Host is an operating system image that's optimized for running container workloads on Azure Kubernetes Service (AKS). Microsoft maintains the Azure Linux Container Host and based it on CBL-Mariner, an open-source Linux distribution created by Microsoft. To know more about Azure Linux 3.0, kindly refer [What is Azure Linux Container Host for AKS](https://learn.microsoft.com/en-us/azure/azure-linux/intro-azure-linux).
+The Azure Linux Container Host is an operating system image that's optimized for running container workloads on Azure Kubernetes Service (AKS). Microsoft maintains the Azure Linux Container Host and based it on CBL-Mariner, an open-source Linux distribution created by Microsoft. To know more about Azure Linux 3.0, refer to [What is Azure Linux Container Host for AKS](https://learn.microsoft.com/en-us/azure/azure-linux/intro-azure-linux).
  
-Azure Linux 3.0 offers support for AArch64. However, the standalone virtual machine image for Azure Linux 3.0 or CBL Mariner 3.0 is not available for Arm. Hence, to use the default software stack provided by the Microsoft team, you can create a docker container with Azure Linux 3.0 as a base image, and run the Spark application inside the container.
+Azure Linux 3.0 offers support for AArch64. However, the standalone virtual machine image for Azure Linux 3.0 or CBL Mariner 3.0 is not available for Arm. To use the default software stack provided by the Microsoft, you can run a docker container with Azure Linux 3.0 as a base image, and run the Spark application inside the container.
 
-#### Create Azure Linux 3.0 Docker Container
+#### Option 1: Run an Azure Linux 3.0 Docker Container
 The [Microsoft Artifact Registry](https://mcr.microsoft.com/en-us/artifact/mar/azurelinux/base/core/about) offers updated docker image for the Azure Linux 3.0.
 
-To create a docker container, install docker, and then follow the below instructions:
+To run a docker container with Azure Linux 3.0, install [docker](/install-guides/docker/docker-engine/), and then run the command:
 
 ```console
 sudo docker run -it --rm mcr.microsoft.com/azurelinux/base/core:3.0
 ```
-The default container startup command is bash. tdnf and dnf are the default package managers.
+The default container starts up with a bash shell. `tdnf` and `dnf` are the default package managers available to use on the container.
 
-### Working with Azure Linux 3.0 OS image
-As of now, the Azure Marketplace offers official virtual machine images of Azure Linux 3.0 only for x64-based architectures, published by Ntegral Inc. However, native Arm64 (AArch64) images are not yet officially available. Hence, for this Learning Path, you can create your own custom Azure Linux 3.0 virtual machine image for AArch64 using the [AArch64 ISO for Azure Linux 3.0](https://github.com/microsoft/azurelinux#iso).
+### Option 2: Create a virtual machine instance with Azure Linux 3.0 OS image
+As of now, the Azure Marketplace offers official virtual machine images of Azure Linux 3.0 only for `x86_64` based architectures, published by Ntegral Inc. While native Arm64 (AArch64) images are not yet officially available, you can create your own custom Azure Linux 3.0 virtual machine image for AArch64 using the [AArch64 ISO for Azure Linux 3.0](https://github.com/microsoft/azurelinux#iso).
 
-Refer [Create an Azure Linux 3.0 virtual machine with Cobalt 100 processors](https://learn.arm.com/learning-paths/servers-and-cloud-computing/azure-vm) for the details.
+Refer to [Create an Azure Linux 3.0 virtual machine with Cobalt 100 processors](/learning-paths/servers-and-cloud-computing/azure-vm) for the detailed steps.
 
-Whether you're using an Azure Linux 3.0 Docker container, or a virtual machine created from a custom Azure Linux 3.0 image, the deployment and benchmarking steps remain the same.
+Whether you choose to use an Azure Linux 3.0 Docker container, or a virtual machine created from a custom Azure Linux 3.0 image, the Spark deployment and benchmarking steps in the following sections will remain the same.
 
-Once the setup has been established, you can proceed with the Spark Installation ahead.
+Once the setup is complete, you can proceed with installing and running Spark in the next section.
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/create-instance.md
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/deploy.md