A productionβgrade ETL + Data Warehouse project built with Python, SQL, and PostgreSQL
This project implements a fully automated, reproducible, productionβstyle data pipeline for the Brazilian EβCommerce Public Dataset (Olist).
It demonstrates real data engineering skills across:
- Raw data ingestion
- Data cleaning & standardization
- Staging schema modeling
- Star schema warehouse design
- Fact & dimension construction
- Orchestration & observability
- Idempotent environment resets
The pipeline is modular, testable, and mirrors realβworld enterprise ETL workflows.
The pipeline follows a classic multiβlayer warehouse architecture:
Raw Data β Staging β Transform β Analytics Warehouse β BI Layer
-
Raw Layer
- Stores downloaded Kaggle CSVs
- Immutable source of truth
-
Staging Layer
- Cleaned, standardized tables
- 1:1 with raw data but normalized
- Loaded via SQLAlchemy
-
Analytics Layer (Star Schema)
- Dimensions: Customer, Seller, Product, Payment Type, Date
- Facts: Orders, Order Items, Payments
- Surrogate keys, FKs, indexes
-
Orchestration Layer
run_all.pyexecutes the full DAGwipe_all.pyresets schemas & folders- Logging + timing decorators
PAYFLOW_CASE_STUDY/
β
βββ data_base/
β βββ raw_data/ # Downloaded Kaggle data
β βββ cleaned_data/ # Cleaned CSV outputs
β
βββ etl/
β βββ extract.py # Download + extract + validate raw data
β βββ explore.py # Automated dataset exploration
β βββ clean.py # Cleaning + staging load
β βββ transform.py # Star schema builder
β βββ run_all.py # Full pipeline orchestrator
β βββ wipe_all.py # Environment reset tool
β βββ logger.py # Color logging + timing
β βββ db_config.py # DB connection loader
β
βββ sql/
β βββ create_staging_tables.sql
β βββ create_analytics_tables.sql
β βββ setup_database.sql
β
βββ .env
βββ .gitignore
βββ README.md
βββ requirements.txt
Resets the environment to a clean state:
- Deletes raw + cleaned folders
- Drops & recreates
stagingandanalyticsschemas - Ensures deterministic pipeline runs
- Downloads dataset from Kaggle
- Extracts ZIP
- Validates all CSVs
- Logs row counts & missing values
- Auto-discovers CSVs
- Logs:
- shape
- head
- dtypes
- missing values
- Cleans customers, sellers, transactions
- Handles cancellations
- Converts timestamps
- Saves cleaned CSVs
- Loads into staging schema
Builds a full star schema:
dim_customerdim_merchantdim_productdim_payment_typedim_date
fact_ordersfact_order_itemsfact_payments
Includes:
- surrogate keys
- date key mapping
- lifecycle status
- item counts
- payment sequences
- referential integrity
run_all.py executes:
- wipe_all
- extract
- clean
- transform
All steps are timed, logged, and failβfast.
Each stage is isolated, testable, and reusable.
Colorβcoded logs, section banners, and timing decorators.
All tables defined explicitly in SQL, not implicitly in Python.
wipe_all.py ensures clean, repeatable runs.
Optimized for analytics and BI workloads.
SQLAlchemy used for staging + analytics loads.
Validation at extract, clean, and transform stages.
| Layer | Tools |
|---|---|
| Language | Python 3.x |
| Data Processing | pandas |
| Database | PostgreSQL |
| ORM / Loader | SQLAlchemy |
| Environment | dotenv |
| Logging | custom ColorFormatter |
| Orchestration | Python subprocess DAG |
| Source Data | Kaggle (Olist Brazilian EβCommerce) |
pip install -r requirements.txt
DB_URL=postgresql://user:password@localhost:5432/payflow
python -m etl.run_all
- Customer
- Merchant
- Product
- Payment Type
- Date
- Orders
- Order Items
- Payments
Each fact table links to dimensions via surrogate keys.
Yomi Ismail Data Engineer & Product Operations Specialist