Introduction

Developed a Lakehouse-based data pipeline using Sakila dataset to analyze movie sale and rental trends. The lakehouse was designed according to Delta architecture

Extracted events in database by using CDC (Debezium), then published events to Kafka which ensures scalable and fault-tolerant message processing
Processed streaming event by Spark Streaming and writes to Delta Tables in MinIO, combining with Trino query engine to provide real-time insights via Superset dashboards
Transformed periodically event data in Delta Tables into staging and mart tables for deep analytics and machine learning using DBT

Setup platforms

Apache Spark cluster

Config files in these folders: spark, notebook, hive-metastore

Run this command to create Docker containers of Apache Spark cluster

docker-compose up -f ./docker-compose.yaml

Apache Kafka

Config files in this folders: kafka

Run this command to create Apache Kafka's containers

docker-compose up -f ./kafka/docker-compose.yaml

Trino and Superset

Config files in this folder: trino-superset

In trino-superset/trino-conf/catalog, create delta.properties with following parameters

connector.name=delta-lake
hive.metastore.uri=thrift://160.191.244.13:9083
hive.s3.aws-access-key=minio
hive.s3.aws-secret-key=minio123
hive.s3.endpoint=http://160.191.244.13:9000
hive.s3.path-style-access=true

Then run this command

docker-compose up --build

Apache Airflow

Note: this version is sequential not parallel

Config files in this folder: dbt-airlow

In dbt-airflow, run this command to create Airflow container

docker-compose up --build

Run project

To start streaming process, run this command on jupyter notebook's terminal which is running in spark-notebook container

python3 stream_events.py

To run data warehouse's transformations, just need trigger this DAG in Airflow's UI or it will automatically run daily at 23:00 PM

Real-time dashboard for trend analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Setup platforms

Apache Spark cluster

Apache Kafka

Trino and Superset

Apache Airflow

Run project

Real-time dashboard for trend analysis

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
dbt-airflow		dbt-airflow
hive-metastore		hive-metastore
kafka		kafka
notebook		notebook
spark		spark
trino-superset		trino-superset
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Narius2030/Sakila-Lakehouse

Folders and files

Latest commit

History

Repository files navigation

Introduction

Setup platforms

Apache Spark cluster

Apache Kafka

Trino and Superset

Apache Airflow

Run project

Real-time dashboard for trend analysis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages