Skip to content

yoismail/payflow_case_study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


πŸš€ PAYFLOW CASE STUDY β€” End‑to‑End Data Engineering Pipeline

A production‑grade ETL + Data Warehouse project built with Python, SQL, and PostgreSQL


πŸ“Œ Overview

This project implements a fully automated, reproducible, production‑style data pipeline for the Brazilian E‑Commerce Public Dataset (Olist).
It demonstrates real data engineering skills across:

  • Raw data ingestion
  • Data cleaning & standardization
  • Staging schema modeling
  • Star schema warehouse design
  • Fact & dimension construction
  • Orchestration & observability
  • Idempotent environment resets

The pipeline is modular, testable, and mirrors real‑world enterprise ETL workflows.


🧱 Architecture Summary

The pipeline follows a classic multi‑layer warehouse architecture:

Raw Data β†’ Staging β†’ Transform β†’ Analytics Warehouse β†’ BI Layer

Layers

  • Raw Layer

    • Stores downloaded Kaggle CSVs
    • Immutable source of truth
  • Staging Layer

    • Cleaned, standardized tables
    • 1:1 with raw data but normalized
    • Loaded via SQLAlchemy
  • Analytics Layer (Star Schema)

    • Dimensions: Customer, Seller, Product, Payment Type, Date
    • Facts: Orders, Order Items, Payments
    • Surrogate keys, FKs, indexes
  • Orchestration Layer

    • run_all.py executes the full DAG
    • wipe_all.py resets schemas & folders
    • Logging + timing decorators

πŸ—‚ Project Structure

PAYFLOW_CASE_STUDY/
β”‚
β”œβ”€β”€ data_base/
β”‚   β”œβ”€β”€ raw_data/          # Downloaded Kaggle data
β”‚   └── cleaned_data/      # Cleaned CSV outputs
β”‚
β”œβ”€β”€ etl/
β”‚   β”œβ”€β”€ extract.py         # Download + extract + validate raw data
β”‚   β”œβ”€β”€ explore.py         # Automated dataset exploration
β”‚   β”œβ”€β”€ clean.py           # Cleaning + staging load
β”‚   β”œβ”€β”€ transform.py       # Star schema builder
β”‚   β”œβ”€β”€ run_all.py         # Full pipeline orchestrator
β”‚   β”œβ”€β”€ wipe_all.py        # Environment reset tool
β”‚   β”œβ”€β”€ logger.py          # Color logging + timing
β”‚   └── db_config.py       # DB connection loader
β”‚
β”œβ”€β”€ sql/
β”‚   β”œβ”€β”€ create_staging_tables.sql
β”‚   β”œβ”€β”€ create_analytics_tables.sql
β”‚   └── setup_database.sql
β”‚
β”œβ”€β”€ .env
β”œβ”€β”€ .gitignore
β”œβ”€β”€ README.md
└── requirements.txt

πŸ”„ Pipeline Flow

1. Wipe Phase

Resets the environment to a clean state:

  • Deletes raw + cleaned folders
  • Drops & recreates staging and analytics schemas
  • Ensures deterministic pipeline runs

2. Extract Phase

  • Downloads dataset from Kaggle
  • Extracts ZIP
  • Validates all CSVs
  • Logs row counts & missing values

3. Explore Phase

  • Auto-discovers CSVs
  • Logs:
    • shape
    • head
    • dtypes
    • missing values

4. Clean Phase

  • Cleans customers, sellers, transactions
  • Handles cancellations
  • Converts timestamps
  • Saves cleaned CSVs
  • Loads into staging schema

5. Transform Phase

Builds a full star schema:

Dimensions

  • dim_customer
  • dim_merchant
  • dim_product
  • dim_payment_type
  • dim_date

Facts

  • fact_orders
  • fact_order_items
  • fact_payments

Includes:

  • surrogate keys
  • date key mapping
  • lifecycle status
  • item counts
  • payment sequences
  • referential integrity

6. Orchestration

run_all.py executes:

  1. wipe_all
  2. extract
  3. clean
  4. transform

All steps are timed, logged, and fail‑fast.


🧠 Key Engineering Concepts Demonstrated

βœ” Modular ETL Architecture

Each stage is isolated, testable, and reusable.

βœ” Production‑style Logging

Color‑coded logs, section banners, and timing decorators.

βœ” Schema‑Driven Warehouse Design

All tables defined explicitly in SQL, not implicitly in Python.

βœ” Idempotent Pipeline Execution

wipe_all.py ensures clean, repeatable runs.

βœ” Star Schema Modeling

Optimized for analytics and BI workloads.

βœ” SQL + Python Integration

SQLAlchemy used for staging + analytics loads.

βœ” Data Quality Awareness

Validation at extract, clean, and transform stages.


πŸ›  Tech Stack

Layer Tools
Language Python 3.x
Data Processing pandas
Database PostgreSQL
ORM / Loader SQLAlchemy
Environment dotenv
Logging custom ColorFormatter
Orchestration Python subprocess DAG
Source Data Kaggle (Olist Brazilian E‑Commerce)

▢️ Running the Pipeline

1. Install dependencies

pip install -r requirements.txt

2. Set your .env

DB_URL=postgresql://user:password@localhost:5432/payflow

3. Run the full pipeline

python -m etl.run_all

πŸ“Š Warehouse Schema (Star Model)

Dimensions

  • Customer
  • Merchant
  • Product
  • Payment Type
  • Date

Facts

  • Orders
  • Order Items
  • Payments

Each fact table links to dimensions via surrogate keys.


Author

Yomi Ismail Data Engineer & Product Operations Specialist


About

Payflow case study is a production-style ETL pipeline transforming raw e-commerce data into a PostgreSQL star schema warehouse with clean, staging, and analytics layers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages