Sparkify is a new startup that is looking to revolutionize music streaming through the use of its Sparkify Music App. The analytics team at Sparkify is interested in understanding user behavior on the app (particularly which songs users are listening to). Currently, there is no easy way for the analytics team to retrieve the data that they need to answer their questions.
Problem Statement: For this project, the data engineering team has been tasked to build a relational database that is optimized for queries on song play analysis.
- Data
- sql_queries.py - Python script containing SQL queries to create tables, insert data into tables, drop tables, and perform other queries
- create_tables.py - Python script that creates a postgres database containing empty tables
- etl.py - Python script that extracts data from JSON files, transforms it to the appropriate data type or format, and loads it into a SQL table
- test.ipynb - Jupyter Notebook containing sample queries
- Run
create_tables.py
to create database and tables - Run
etl.py
to load data into appropriate tables - Run cells in
test.ipynb
to test that data was loaded correctly
- NOTE: To reset data tables,
create_tables.py
should ALWAYS be run before runningetl.py
- Song Dataset (JSON) - Contains data about a song, including song_id, title, song_duration, artist_id, artist_name, and information about the artist's location.
- Log Dataset (JSON) - Contains simulated data related to a music streaming event, including user information and details about the session
-
For this database, a star schema was used in which a fact table containing information about a listening session was related to four dimensional tables that provided expanded information about the session.
-
NOTE: For description of tables, see
schema.md
.
- Create postgreSQL database named sparkifydb
- Create fact and dimension tables in sparkifydb according to schema
- Extract data necessary for the songs table from song dataset using Python and pandas
- Use psycopg2 and
song_table_insert
fromsql_queries.py
to insert song data into songs table - Extract data necessary for the artists table from song dataset using Python and pandas
- Use psycopg2 and
artist_table_insert
fromsql_queries.py
to insert artist data into artist table - Extract time information from the log dataset and transform it into a pandas datetime object
- Use psycopg2 and
time_table_insert
fromsql_queries.py
to insert time data into time table - Extract user information from the log dataset
- Use psycopg2 and
user_table_insert
fromsql_queries.py
to insert user data into user table - Use
select_song
query fromsql_queries.py
to retrieve the song ID and artist ID for songs in the log data set and extract the remaining songplay data from the log data set - Use psycopg2 and the
songplay_table_insert
fromsql_queries.py
to insert data related to the songplay session into the songplay table
SELECT * FROM songs
LIMIT 5;
SELECT * FROM songplays
WHERE song_id IS NOT NULL;