Sparkify: Creating a Relational Data Model using PostgreSQL

Author: Jocelyn Lutes

Project Notes: Created as part of Udacity's Data Engineering Nanodegree

Background

Sparkify is a new startup that is looking to revolutionize music streaming through the use of its Sparkify Music App. The analytics team at Sparkify is interested in understanding user behavior on the app (particularly which songs users are listening to). Currently, there is no easy way for the analytics team to retrieve the data that they need to answer their questions.

Problem Statement: For this project, the data engineering team has been tasked to build a relational database that is optimized for queries on song play analysis.

Contents of Repository

Data
sql_queries.py - Python script containing SQL queries to create tables, insert data into tables, drop tables, and perform other queries
create_tables.py - Python script that creates a postgres database containing empty tables
etl.py - Python script that extracts data from JSON files, transforms it to the appropriate data type or format, and loads it into a SQL table
test.ipynb - Jupyter Notebook containing sample queries

Instructions for Repository

Run create_tables.py to create database and tables
Run etl.py to load data into appropriate tables
Run cells in test.ipynb to test that data was loaded correctly

NOTE: To reset data tables, create_tables.py should ALWAYS be run before running etl.py

Summary of Workflow

Data

Song Dataset (JSON) - Contains data about a song, including song_id, title, song_duration, artist_id, artist_name, and information about the artist's location.
Log Dataset (JSON) - Contains simulated data related to a music streaming event, including user information and details about the session

Relational Data Schema

For this database, a star schema was used in which a fact table containing information about a listening session was related to four dimensional tables that provided expanded information about the session.
NOTE: For description of tables, see schema.md.

ETL Pipeline

Create postgreSQL database named sparkifydb
Create fact and dimension tables in sparkifydb according to schema
Extract data necessary for the songs table from song dataset using Python and pandas
Use psycopg2 and song_table_insert from sql_queries.py to insert song data into songs table
Extract data necessary for the artists table from song dataset using Python and pandas
Use psycopg2 and artist_table_insert from sql_queries.py to insert artist data into artist table
Extract time information from the log dataset and transform it into a pandas datetime object
Use psycopg2 and time_table_insert from sql_queries.py to insert time data into time table
Extract user information from the log dataset
Use psycopg2 and user_table_insert from sql_queries.py to insert user data into user table
Use select_song query from sql_queries.py to retrieve the song ID and artist ID for songs in the log data set and extract the remaining songplay data from the log data set
Use psycopg2 and the songplay_table_insert from sql_queries.py to insert data related to the songplay session into the songplay table

Sample Queries

SELECT * FROM songs
LIMIT 5;

SELECT * FROM songplays
WHERE song_id IS NOT NULL;

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
code		code
data		data
.gitignore		.gitignore
README.md		README.md
schema.md		schema.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify: Creating a Relational Data Model using PostgreSQL

Author: Jocelyn Lutes

Project Notes: Created as part of Udacity's Data Engineering Nanodegree

Background

Contents of Repository

Instructions for Repository

Summary of Workflow

Data

Relational Data Schema

ETL Pipeline

Sample Queries

About

Releases

Packages

Languages

jlu90/Sparkify-Postgres

Folders and files

Latest commit

History

Repository files navigation

Sparkify: Creating a Relational Data Model using PostgreSQL

Author: Jocelyn Lutes

Project Notes: Created as part of Udacity's Data Engineering Nanodegree

Background

Contents of Repository

Instructions for Repository

Summary of Workflow

Data

Relational Data Schema

ETL Pipeline

Sample Queries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages