ETL-pipeline-python

Background

ETL pipelines are the backbone of any data related task. ETL is an acronym for Extract, Transform, and Load. Real world data is often clumsy and requires certain degree of preprocessing before it can be fed to end applications. ETL pipelines are designed to achieve the goal of data preprocessing. They take in raw data and spit out clean data stored in SQL or NoSQL databases.

ETL Explanation

Every ETL pipeline has three stages:

Data extraction from some source/s (say flat files or some database)
Data transformation using some tools
Loading of data in SQL or NoSQL databases

The loaded data is then used in end applications. One possible use case is analyzing the cleaned data for business reporting using tools like Tableau and MS Power BI.

Explaining Work in Repo

This repository builds custom ETL pipelines in Python. Two data pipelines have been designed using the files in this repository:

In the first ETL pipeline, raw data is extracted from flat files using Pandas. The extracted data is then transformed using Pandas and Numpy. Finally, the transformed data is loaded in a NoSQL database, MongoDB, hosted on MongoDB Atlas. The data loading process is carried out with the help of PyMongo — MongoDB's native driver for Python.
In the second ETL pipeline, data is extracted from flat files using Pandas. Data is then transformed according to the requirements using Pandas and Numpy. Finally, the data is loaded in a SQL database, SQL Server, hosted locally. MS SQL Server's recommended connector for Python, pyodbc, has been used in data loading step. Due diligence has been exercised in ensuring compliance with key database engineering concepts such as database normalization.

Files in Repo

tk-5-normalization.py

This file implements the complete pipeline from extracting the data from flat files to creating three relational tables (two facts and one dim table) in SQL Server. The script also populates over 25k rows of data in 24+ columns in the three tables of the database in a way that the database is in 2nd Normal form.
tk-5.py

This file was created to create a simple database (First Normal Form) in SQL Server while populating db table with values by iterating over a pandas dataframe.
sqlserver.py

This file was created to test db connection with SQL Server hosted on the local machine. Moreover, a table was created in the db and a few rows were added in it.
mongo.py

This file was created to establish a connection with MongoDB database hosted on AWS from MongoDB Atlas. This file uses config module that contains the required credentials (connection string, user, password) to connect to MongoDB. The config module is not a part of this repo and you can use your own config file with mongo.py to establish a connection with MongoDB.
data.py

This file uses a custom python class to extract and transform raw data from flat files so that it can be inserted into a db on MongoDB cluster on Atlas. Two different collections are created in the db to store half a dozen documents in each collection.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data.py		data.py
etl.png		etl.png
etl_mongo.jpg		etl_mongo.jpg
etl_sqlserver.jpg		etl_sqlserver.jpg
mongo.py		mongo.py
q1_grades.py		q1_grades.py
readme.md		readme.md
sqlserver.py		sqlserver.py
tk-5-normalization.py		tk-5-normalization.py
tk-5.py		tk-5.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL-pipeline-python

Background

ETL Explanation

Explaining Work in Repo

Files in Repo

About

Uh oh!

Releases

Packages

Languages

hussam95/etl-pipeline-python

Folders and files

Latest commit

History

Repository files navigation

ETL-pipeline-python

Background

ETL Explanation

Explaining Work in Repo

Files in Repo

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages