Skip to content

This data infrastructure employs a suite of powerful tools to move, transform, and store data. It utilizes Apache Airflow for orchestration, Apache Kafka for message brokering, Apache Spark for data processing, and Cassandra as the database. To simplify deployment and enable scaling, all components are packaged as Docker containers.

Notifications You must be signed in to change notification settings

RustX2802/Data-Engineering

Repository files navigation

Realtime Data Streaming: End-to-End Data Engineering Project

A Comprehensive Guide to Building a Modern Data Pipeline

Table of Contents

  1. Introduction
  2. System Architecture
  3. Key Learnings
  4. Technology Stack

Introduction

This project provides a hands-on experience in building a complete, end-to-end data engineering pipeline. It shows how to ingest, process, and store data using a combination of industry-standard tools and technologies. The entire project is containerized using Docker, making it easy to deploy and replicate the environment.

System Architecture

System Architecture

The project is designed with the following components:

  • Data Source: We use randomuser.me API to generate random user data for our pipeline.
  • Apache Airflow: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
  • Apache Kafka and Zookeeper: Used for streaming data from PostgreSQL to the processing engine.
  • Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
  • Apache Spark: For data processing with its master and worker nodes.
  • Cassandra: Where the processed data will be stored.

Key Learnings

  • Setting up a data pipeline with Apache Airflow
  • Real-time data streaming with Apache Kafka
  • Distributed synchronization with Apache Zookeeper
  • Data processing techniques with Apache Spark
  • Data storage solutions with Cassandra and PostgreSQL
  • Containerizing your entire data engineering setup with Docker

Technology Stack

  • Apache Airflow: Data pipeline orchestration.
  • Python: For data processing and Airflow DAGs.
  • Apache Kafka: Real-time stream processing.
  • Apache Zookeeper: Distributed coordination for Kafka.
  • Apache Spark: Distributed data processing.
  • Cassandra: Highly scalable database.
  • PostgreSQL: Relational database for initial data storage.
  • Docker: Containerization platform.

About

This data infrastructure employs a suite of powerful tools to move, transform, and store data. It utilizes Apache Airflow for orchestration, Apache Kafka for message brokering, Apache Spark for data processing, and Cassandra as the database. To simplify deployment and enable scaling, all components are packaged as Docker containers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published