Skip to content

Commit 5cb308b

Browse files
authored
Merge pull request #2240 from pareenaverma/content_review
Tech review of Azure Spark LP
2 parents 2441f72 + f834d8d commit 5cb308b

File tree

7 files changed

+98
-73
lines changed

7 files changed

+98
-73
lines changed

content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,26 @@
11
---
22
title: Run Spark applications on the Microsoft Azure Cobalt 100 processors
33

4+
draft: true
5+
cascade:
6+
draft: true
7+
48
minutes_to_complete: 60
59

6-
who_is_this_for: This Learning Path introduces Spark deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Spark applications from x86_64 to Arm with minimal or no changes.
10+
who_is_this_for: This is an advanced topic that introduces Spark deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Spark applications from x86_64 to Arm.
711

812
learning_objectives:
9-
- Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu as the base image.
13+
- Provision an Azure Arm64 virtual machine using Azure console.
1014
- Learn how to create an Azure Linux 3.0 Docker container.
11-
- Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container and an Azure Linux 3.0 custom-image based Azure virtual machine.
12-
- Perform Spark benchmarking inside the container as well as the custom virtual machine.
15+
- Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container or an Azure Linux 3.0 custom-image based Azure virtual machine.
16+
- Run a suite of Spark benchmarks to understand and evaluate performance on the Azure Cobalt 100 virtual machine.
1317

1418
prerequisites:
1519
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6).
1620
- A machine with [Docker](/install-guides/docker/) installed.
1721
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/).
1822

19-
author: Jason Andrews
23+
author: Pareena Verma
2024

2125
### Tags
2226
skilllevels: Advanced

content/learning-paths/servers-and-cloud-computing/spark-on-azure/background.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
---
2-
title: "About Cobalt 100 Arm-based processor and Apache Spark"
2+
title: "Overview"
33

44
weight: 2
55

66
layout: "learningpathall"
77
---
88

9-
## What is Cobalt 100 Arm-based processor?
9+
## What is the Azure Cobalt 100 processor?
1010

11-
Azure’s Cobalt 100 is built on Microsoft's first-generation, in-house Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance.
11+
Azure’s Cobalt 100 is built on Microsoft's first-generation Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance.
1212

1313
To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).
1414

content/learning-paths/servers-and-cloud-computing/spark-on-azure/baseline.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
---
2-
title: Baseline Testing
2+
title: Functional Validation
33
weight: 6
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

99

10-
## Baseline Testing
10+
## Functional Validation
1111
Since Apache Spark is installed successfully on your Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output.
1212

13-
Run a simple PySpark script, create a file named `test_spark.py`, and add the below content to it:
13+
Using a file editor of your choice, create a file named `test_spark.py`, and add the below content to it:
1414

1515
```python
1616
from pyspark.sql import SparkSession

content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md

Lines changed: 41 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Spark Internal Benchmarking
2+
title: Benchmark Spark
33
weight: 7
44

55
### FIXED, DO NOT MODIFY
@@ -32,9 +32,9 @@ This compiles Spark and its dependencies, enabling the benchmarks build profile
3232
```console
3333
./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark"
3434
```
35-
This executes the **JoinBenchmark**, which measures the performance of various SQL join operations (e.g., SortMergeJoin, BroadcastHashJoin) under different query plans. It helps evaluate how Spark SQL optimizes and executes join strategies, especially with and without WholeStageCodegen, a technique that compiles entire query stages into efficient bytecode for faster execution.
35+
This executes the `JoinBenchmark`, which measures the performance of various SQL join operations (e.g., SortMergeJoin, BroadcastHashJoin) under different query plans. It helps evaluate how Spark SQL optimizes and executes join strategies, especially with and without WholeStageCodegen, a technique that compiles entire query stages into efficient bytecode for faster execution.
3636

37-
You should see an output similar to:
37+
The output should look similar to:
3838
```output
3939
[info] Running benchmark: Join w long
4040
[info] Running case: Join w long wholestage off
@@ -183,36 +183,9 @@ You should see an output similar to:
183183
Benchmarking was performed in both an Azure Linux 3.0 Docker container and an Azure Linux 3.0 virtual machine. The benchmark results were found to be comparable.
184184
{{% /notice %}}
185185

186-
Accordingly, this Learning path includes benchmark results from virtual machines only, for both x86 and Arm64 platforms.
187-
### Benchmark summary on x86_64:
188-
The following benchmark results are collected on an x86_64 **D4s_v4 Azure virtual machine using the Azure Linux 3.0 image published by Ntegral Inc**.
189-
| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
190-
|------------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------|
191-
| Join w long | Off | 3168 | 3185 | 24 | 6.6 | 151.1 | 1.0X |
192-
| | On | 1509 | 1562 | 61 | 13.9 | 72.0 | 2.1X |
193-
| Join w long duplicated | Off | 2490 | 2504 | 20 | 8.4 | 118.7 | 1.0X |
194-
| | On | 1151 | 1181 | 27 | 18.2 | 54.9 | 2.2X |
195-
| Join w 2 ints | Off | 217074 | 219364 | 3239 | 0.1 | 10350.9 | 1.0X |
196-
| | On | 119692 | 119756 | 74 | 0.2 | 5707.4 | 1.8X |
197-
| Join w 2 longs | Off | 4367 | 4401 | 49 | 4.8 | 208.2 | 1.0X |
198-
| | On | 2952 | 3003 | 35 | 7.1 | 140.8 | 1.5X |
199-
| Join w 2 longs duplicated | Off | 10255 | 10286 | 45 | 2.0 | 489.0 | 1.0X |
200-
| | On | 7243 | 7300 | 36 | 2.9 | 345.4 | 1.4X |
201-
| Outer join w long | Off | 2401 | 2422 | 30 | 8.7 | 114.5 | 1.0X |
202-
| | On | 1544 | 1564 | 17 | 13.6 | 73.6 | 1.6X |
203-
| Semi join w long | Off | 1344 | 1350 | 8 | 15.6 | 64.1 | 1.0X |
204-
| | On | 673 | 685 | 12 | 31.2 | 32.1 | 2.0X |
205-
| Sort merge join | Off | 1144 | 1145 | 1 | 1.8 | 545.6 | 1.0X |
206-
| | On | 1177 | 1228 | 46 | 1.8 | 561.4 | 1.0X |
207-
| Sort merge join w/ duplicates | Off | 2075 | 2113 | 55 | 1.0 | 989.4 | 1.0X |
208-
| | On | 1704 | 1720 | 14 | 1.2 | 812.3 | 1.2X |
209-
| Shuffle hash join | Off | 672 | 674 | 2 | 6.2 | 160.3 | 1.0X |
210-
| | On | 524 | 525 | 1 | 8.0 | 124.9 | 1.3X |
211-
| Broadcast nested loop join | Off | 36060 | 36103 | 62 | 0.6 | 1719.5 | 1.0X |
212-
| | On | 31254 | 31346 | 78 | 0.7 | 1490.3 | 1.2X |
213186

214187
### Benchmark summary on Arm64:
215-
The following benchmark results were collected on an Arm64 **D4ps_v6 Azure virtual machine created from a custom Azure Linux 3.0 image using the AArch64 ISO**.
188+
For easier comparison, shown here is a summary of benchmark results collected on an Arm64 `D4ps_v6` Azure virtual machine created from a custom Azure Linux 3.0 image using the AArch64 ISO.
216189
| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
217190
|----------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------|
218191
| Join w long | Off | 2345 | 2649 | 429 | 8.9 | 111.8 | 1.0X |
@@ -238,10 +211,41 @@ The following benchmark results were collected on an Arm64 **D4ps_v6 Azure virtu
238211
| Broadcast nested loop join | Off | 26847 | 26870 | 32 | 0.8 | 1280.2 | 1.0X |
239212
| | On | 18857 | 18928 | 84 | 1.1 | 899.2 | 1.4X |
240213

241-
### **Highlights from Azure Linux Arm64 virtual machine**
214+
### Benchmark summary on x86_64:
215+
Shown here is a summary of the benchmark results collected on an `x86_64` `D4s_v4` Azure virtual machine using the Azure Linux 3.0 image published by Ntegral Inc.
216+
| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
217+
|------------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------|
218+
| Join w long | Off | 3168 | 3185 | 24 | 6.6 | 151.1 | 1.0X |
219+
| | On | 1509 | 1562 | 61 | 13.9 | 72.0 | 2.1X |
220+
| Join w long duplicated | Off | 2490 | 2504 | 20 | 8.4 | 118.7 | 1.0X |
221+
| | On | 1151 | 1181 | 27 | 18.2 | 54.9 | 2.2X |
222+
| Join w 2 ints | Off | 217074 | 219364 | 3239 | 0.1 | 10350.9 | 1.0X |
223+
| | On | 119692 | 119756 | 74 | 0.2 | 5707.4 | 1.8X |
224+
| Join w 2 longs | Off | 4367 | 4401 | 49 | 4.8 | 208.2 | 1.0X |
225+
| | On | 2952 | 3003 | 35 | 7.1 | 140.8 | 1.5X |
226+
| Join w 2 longs duplicated | Off | 10255 | 10286 | 45 | 2.0 | 489.0 | 1.0X |
227+
| | On | 7243 | 7300 | 36 | 2.9 | 345.4 | 1.4X |
228+
| Outer join w long | Off | 2401 | 2422 | 30 | 8.7 | 114.5 | 1.0X |
229+
| | On | 1544 | 1564 | 17 | 13.6 | 73.6 | 1.6X |
230+
| Semi join w long | Off | 1344 | 1350 | 8 | 15.6 | 64.1 | 1.0X |
231+
| | On | 673 | 685 | 12 | 31.2 | 32.1 | 2.0X |
232+
| Sort merge join | Off | 1144 | 1145 | 1 | 1.8 | 545.6 | 1.0X |
233+
| | On | 1177 | 1228 | 46 | 1.8 | 561.4 | 1.0X |
234+
| Sort merge join w/ duplicates | Off | 2075 | 2113 | 55 | 1.0 | 989.4 | 1.0X |
235+
| | On | 1704 | 1720 | 14 | 1.2 | 812.3 | 1.2X |
236+
| Shuffle hash join | Off | 672 | 674 | 2 | 6.2 | 160.3 | 1.0X |
237+
| | On | 524 | 525 | 1 | 8.0 | 124.9 | 1.3X |
238+
| Broadcast nested loop join | Off | 36060 | 36103 | 62 | 0.6 | 1719.5 | 1.0X |
239+
| | On | 31254 | 31346 | 78 | 0.7 | 1490.3 | 1.2X |
240+
241+
242+
### Benchmark comparison insights
243+
244+
When you compare the benchmark results you will notice that on the Azure Linux Arm64 virtual machine:
245+
246+
- Whole-stage codegen improves performance by up to 2.8× on complex joins (e.g., with long columns).
247+
- Simple joins (e.g., on integers) show negligible performance gain, remains comparable to performance on `x86_64`.
248+
- Broadcast and shuffle-based joins benefit with 1.4× to 1.5× improvements.
249+
- Overall enabling whole-stage codegen consistently improves performance across most join types.
242250

243-
- **Whole-stage codegen improves performance by up to 2.8×** on complex joins (e.g., with long columns).
244-
- **Simple joins (e.g., on integers)** show **negligible performance gain**, remaining close to 1.0×.
245-
- **Broadcast and shuffle-based joins**benefit moderately, with **1.4× to 1.5× improvements**.
246-
- **Overall**, enabling whole-stage codegen consistently improves performance across most join types.
247-
- **Benchmark results were consistent**across both Docker and virtual machine on Arm64.
251+
You have successfully learnt how to deploy Apache Spark on an Azure Cobalt 100 virtual machine and measure the performance uplift.

content/learning-paths/servers-and-cloud-computing/spark-on-azure/container-setup.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,28 +7,28 @@ layout: learningpathall
77
---
88

99

10-
You have an option to choose between working with the Azure Linux 3.0 Docker image or inside the virtual machine created with the OS image.
10+
You can choose between deploying your Spark workload either in an Azure Linux 3.0 Docker container or on a virtual machine created from a custom Azure Linux 3.0 image.
1111

1212
### Working inside Azure Linux 3.0 Docker container
13-
The Azure Linux Container Host is an operating system image that's optimized for running container workloads on Azure Kubernetes Service (AKS). Microsoft maintains the Azure Linux Container Host and based it on CBL-Mariner, an open-source Linux distribution created by Microsoft. To know more about Azure Linux 3.0, kindly refer [What is Azure Linux Container Host for AKS](https://learn.microsoft.com/en-us/azure/azure-linux/intro-azure-linux).
13+
The Azure Linux Container Host is an operating system image that's optimized for running container workloads on Azure Kubernetes Service (AKS). Microsoft maintains the Azure Linux Container Host and based it on CBL-Mariner, an open-source Linux distribution created by Microsoft. To know more about Azure Linux 3.0, refer to [What is Azure Linux Container Host for AKS](https://learn.microsoft.com/en-us/azure/azure-linux/intro-azure-linux).
1414

15-
Azure Linux 3.0 offers support for AArch64. However, the standalone virtual machine image for Azure Linux 3.0 or CBL Mariner 3.0 is not available for Arm. Hence, to use the default software stack provided by the Microsoft team, you can create a docker container with Azure Linux 3.0 as a base image, and run the Spark application inside the container.
15+
Azure Linux 3.0 offers support for AArch64. However, the standalone virtual machine image for Azure Linux 3.0 or CBL Mariner 3.0 is not available for Arm. To use the default software stack provided by the Microsoft, you can run a docker container with Azure Linux 3.0 as a base image, and run the Spark application inside the container.
1616

17-
#### Create Azure Linux 3.0 Docker Container
17+
#### Option 1: Run an Azure Linux 3.0 Docker Container
1818
The [Microsoft Artifact Registry](https://mcr.microsoft.com/en-us/artifact/mar/azurelinux/base/core/about) offers updated docker image for the Azure Linux 3.0.
1919

20-
To create a docker container, install docker, and then follow the below instructions:
20+
To run a docker container with Azure Linux 3.0, install [docker](/install-guides/docker/docker-engine/), and then run the command:
2121

2222
```console
2323
sudo docker run -it --rm mcr.microsoft.com/azurelinux/base/core:3.0
2424
```
25-
The default container startup command is bash. tdnf and dnf are the default package managers.
25+
The default container starts up with a bash shell. `tdnf` and `dnf` are the default package managers available to use on the container.
2626

27-
### Working with Azure Linux 3.0 OS image
28-
As of now, the Azure Marketplace offers official virtual machine images of Azure Linux 3.0 only for x64-based architectures, published by Ntegral Inc. However, native Arm64 (AArch64) images are not yet officially available. Hence, for this Learning Path, you can create your own custom Azure Linux 3.0 virtual machine image for AArch64 using the [AArch64 ISO for Azure Linux 3.0](https://github.com/microsoft/azurelinux#iso).
27+
### Option 2: Create a virtual machine instance with Azure Linux 3.0 OS image
28+
As of now, the Azure Marketplace offers official virtual machine images of Azure Linux 3.0 only for `x86_64` based architectures, published by Ntegral Inc. While native Arm64 (AArch64) images are not yet officially available, you can create your own custom Azure Linux 3.0 virtual machine image for AArch64 using the [AArch64 ISO for Azure Linux 3.0](https://github.com/microsoft/azurelinux#iso).
2929

30-
Refer [Create an Azure Linux 3.0 virtual machine with Cobalt 100 processors](https://learn.arm.com/learning-paths/servers-and-cloud-computing/azure-vm) for the details.
30+
Refer to [Create an Azure Linux 3.0 virtual machine with Cobalt 100 processors](/learning-paths/servers-and-cloud-computing/azure-vm) for the detailed steps.
3131

32-
Whether you're using an Azure Linux 3.0 Docker container, or a virtual machine created from a custom Azure Linux 3.0 image, the deployment and benchmarking steps remain the same.
32+
Whether you choose to use an Azure Linux 3.0 Docker container, or a virtual machine created from a custom Azure Linux 3.0 image, the Spark deployment and benchmarking steps in the following sections will remain the same.
3333

34-
Once the setup has been established, you can proceed with the Spark Installation ahead.
34+
Once the setup is complete, you can proceed with installing and running Spark in the next section.

0 commit comments

Comments
 (0)