Skip to content

HKasie/Data-Modelling-with-Postgres

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data-Modelling-with-Postgres

Project to build ETL pipeline in Postgres

Purpose of Project:

This Postgres database was created for Sparkify (a startup) to aid the analytics team in understanding what songs users are listening to, by analysing the data they’ve been collecting on songs and user activity on their new music streaming app. The Analytics team don’t have an easy way to query these data which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

How To Create tables

The Postgres database was created with tables designed to optimize queries on song play analysis. To create the database, a fact table (songplays) and four-dimension tables (songs, artists, time and users) were defined for a star schema for particular analytics focus. And an ETL pipeline written that to transfer data from files in two local directories into these tables in Postgres using Python and SQL.

Justification for creating Fact and Dimensions tables

The decision to normalize the database by defining facts and dimension tables for a star schema was to reduce redundancy and increase the data integrity. Also, the normalization of the database will make the data more intuitive and easy to understand by the users thus they will be able to make simplified queries to retrieve the data insight they are looking for which solves the problem the analytics team in Sparkify are facing.

Explanation of the files in the project:

Song data file: consists of files in JSON format which are a subset of real data from the Million Song Dataset. Each file contains metadata about a song and the artist of that song and are partitioned by the first three letters of each song's track ID.

Log data file: the file consists of activity log files in JSON format generated by an event simulator based on the songs in the song data file. The log files in the dataset are partitioned by year and month.

Test.ipynb : displays the first few rows of each table in the database. Hence can be used to check the database for tables created and data inserted.

Create_tables.py: this python file drops and creates tables. File should be run always to reset the tables before each time the etl.py file is run.

Etl.ipynb: this notebook reads and processes a single file from song and log data files and loads it into the database tables. It contains instructions for the ETL process for each table.

Etl.py: This file is for reading and processing files from song and log data files and loading the data into the database tables created. the file is completed with the content from the Etl.ipynb notebook above.

Sql_queries: the file contains the SQL queries for the ETL processes. The table has been imported into the following files: create_tables.py, etl.ipynb and etl.py files.

Readme: this file introduces and explains other files in the project. It contains information that is commonly required to understand what the project is about.

How to run the Python scripts:

Step 1: Run create_tables.py to create the database and tables. Step 2: Run test.ipynb to confirm the creation of the tables with the correct columns. Click "Restart kernel" to close the connection to the database. Step 3: Run etl.ipynb notebook to develop ETL processes for each table. Step 4: Run test.ipynb notebook to confirm data is inserted into the tables. Step 5: Rerun create_tables.py to reset the tables. Step 6: Run etl.py file to process the entire datasets. Step 7: Run test.ipynb notebook to confirm records were inserted into all tables. Note: Run the create_tables.py file always before running the etl.py file.

About

Project to build ETL pipeline in Postgres

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors