This project implements a complete ETL (Extract, Transform, Load) pipeline for an e-commerce dataset. It extracts raw CSV data, cleans and transforms it using Pandas, models it into relational tables, and loads it into a PostgreSQL database under a dedicated schema (yinc).
The pipeline is modular, reproducible, and structured to reflect real-world data engineering workflows.
This case study demonstrates:
- Data extraction from raw CSV files
- Data cleaning and transformation using Pandas
- Creation of relational tables
- Automated table creation in PostgreSQL
- Loading cleaned data into relational tables
- Environment-variable-based DB connection handling
- A fully reproducible ETL workflow
The final dataset is organized into 5 tables:
customerproductshippingorderpayment_method
yinc_ecommerce_case_study/
│
├── dataset/
│ ├── raw_data/
│ │ └── yinc_ecommerce.csv
│ └── cleaned_data/
│ ├── customer.csv
│ ├── product.csv
│ ├── shipping.csv
│ ├── order.csv
│ └── payment_method.csv
│
├── yinc_etl.ipynb
├── .gitignore
└── README.mdThe raw dataset is loaded using Pandas:
import pandas as pd
path = r'dataset\\raw_data\\yinc_ecommerce.csv'
yinc_df = pd.read_csv(path)yinc_df.columns = (
yinc_df.columns
.str.strip()
.str.lower()
.str.replace(" ", "_")
)- Dropping rows missing critical identifiers
- Converting
order_dateto datetime - Generating synthetic
email_address - Splitting dataset into normalized tables
customer_df = yinc_df[['customer_id','customer_name', 'email', 'phone_number']] \
.drop_duplicates().reset_index(drop=True)
del customer_df["email"]
customer_df['email_address'] = (
customer_df['customer_name']
.str.lower()
.str.replace(" ","_")
+ '@' + customer_df['customer_id'] + '.com'
)from dotenv import load_dotenv
import os
import psycopg2
def get_db_connection():
load_dotenv()
db_url = os.getenv("DB_URL")
return psycopg2.connect(db_url)CREATE SCHEMA IF NOT EXISTS yinc;
CREATE TABLE IF NOT EXISTS yinc.customer (...);
CREATE TABLE IF NOT EXISTS yinc.product (...);
CREATE TABLE IF NOT EXISTS yinc.shipping (...);
CREATE TABLE IF NOT EXISTS yinc.order (...);
CREATE TABLE IF NOT EXISTS yinc.payment_method (...);def load_data_from_csv(csv_path):
conn = get_db_connection()
cursor = conn.cursor()
cursor.execute(
"INSERT INTO yinc.customer (...) VALUES (%s, %s, %s, %s)",
row
)
conn.commit()
cursor.close()
conn.close()customer.csvproduct.csvshipping.csvorder.csvpayment_method.csv
DB_URL=postgresql://username:password@localhost:5432/database
⚠️ Ensure.envis in.gitignore
pip install pandas numpy psycopg2 python-dotenv- Place raw data in:
dataset/raw_data/
- Run:
yinc_etl.ipynb
- Output:
- Cleaned CSVs
- PostgreSQL tables
- Loaded data
yinc.customer
yinc.product
yinc.shipping
yinc.order
yinc.payment_methodYomi Ismail Data Engineer & Product Operations Specialist