Developed a Lakehouse-based data pipeline using Sakila dataset to analyze movie sale and rental trends. The lakehouse was designed according to Delta
architecture
- Extracted events in database by using CDC (Debezium), then published events to Kafka which ensures
scalable and fault-tolerant
message processing - Processed streaming event by Spark Streaming and writes to Delta Tables in MinIO, combining with Trino query engine to provide
real-time insights
via Superset dashboards - Transformed periodically event data in Delta Tables into staging and mart tables for
deep analytics and machine learning
using DBT
Config files in these folders: spark
, notebook
, hive-metastore
Run this command to create Docker containers of Apache Spark cluster
docker-compose up -f ./docker-compose.yaml
Config files in this folders: kafka
Run this command to create Apache Kafka's containers
docker-compose up -f ./kafka/docker-compose.yaml
Config files in this folder: trino-superset
In trino-superset/trino-conf/catalog, create delta.properties
with following parameters
connector.name=delta-lake
hive.metastore.uri=thrift://160.191.244.13:9083
hive.s3.aws-access-key=minio
hive.s3.aws-secret-key=minio123
hive.s3.endpoint=http://160.191.244.13:9000
hive.s3.path-style-access=true
Then run this command
docker-compose up --build
Note: this version is sequential not parallel
Config files in this folder: dbt-airlow
In dbt-airflow, run this command to create Airflow container
docker-compose up --build
To start streaming process, run this command on jupyter notebook's terminal
which is running in spark-notebook container
python3 stream_events.py
To run data warehouse's transformations, just need trigger this DAG in Airflow's UI or it will automatically run daily at 23:00 PM