Pipelines with Spark

Setup

Install kind

MacOS

brew install kind

Linux

# For AMD64 / x86_64
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.23.0/kind-linux-amd64
# For ARM64
[ $(uname -m) = aarch64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.23.0/kind-linux-arm64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

Install kubectl

MacOS

brew install kubernetes-cli

Linux

# We assume you have curl installed
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

Setup Kubernetes

Create Docker registry and Kubernetes cluster

./kubernetes/kind-with-registry.sh

Create and configure Kubernetes service account for Spark

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

Setup Python Environment

python -m venv .venv
source .venv/bin/activate
python -m pip install -e .
python -m pip install build

Build Spark Image with Python 3.12

Why? Because I am angry that the official Spark image is using Python 3.8.10. That was three years ago.

docker build -t localhost:5001/spark:python3.12 -f Dockerfile-spark .
docker push localhost:5001/spark:python3.12

Get the Data

On Kaggle, download the India Higher Education Analytics dataset and unpack it to a folder data within this project directory.

Run

Note that the code as-is will likely hit an OutOfMemory (OOM) error.

docker build -t localhost:5001/pipelines-with-spark:latest .
docker push localhost:5001/pipelines-with-spark:latest
spark-submit \
  --master "k8s://$(kubectl config view --minify --output jsonpath='{.clusters[*].cluster.server}')" \
  --deploy-mode cluster \
  --name pipelines-with-spark \
  --conf spark.kubernetes.namespace=default \
  --conf spark.kubernetes.container.image=localhost:5001/pipelines-with-spark:latest \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  local:///opt/spark/work-dir/main.py

Cleanup

Delete Kubernetes Cluster

kind delete cluster
docker stop kind-registry && docker rm kind-registry

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
kubernetes		kubernetes
src/pipelines		src/pipelines
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile-spark		Dockerfile-spark
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pipelines with Spark

Setup

Install kind

MacOS

Linux

Install kubectl

MacOS

Linux

Setup Kubernetes

Setup Python Environment

Build Spark Image with Python 3.12

Get the Data

Run

Cleanup

Delete Kubernetes Cluster

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

CodeSeoul/pipelines-with-spark

Folders and files

Latest commit

History

Repository files navigation

Pipelines with Spark

Setup

Install kind

MacOS

Linux

Install kubectl

MacOS

Linux

Setup Kubernetes

Setup Python Environment

Build Spark Image with Python 3.12

Get the Data

Run

Cleanup

Delete Kubernetes Cluster

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages