Data-Lake is a database for the music streaming startup company Sparkify. The data is a collection of songs and user activities on their new music streaming app. The goal of the program is to understand what songs users are listening to. The information comes from two JSON directories the datasets are in S3 AWS. The program is a database code in PySpark with tables designed to optimize queries on song play analysis and an ETL pipeline.
ETL explanation
Data Lake Architecture
The architecture was made in AWS and consists of a EMR parallelised cluster group. the clusters accesses two S3 folders and loads 2 JSON dataset. From the 2 datasets the dim. and fac. tables are created and loaded in other S3 folder.
Song Dataset
Folder path: 's3://udacity-dend/song_data'
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.
columns: artist_id, artist_latitude, artist_location,artist_longitude, artist_name, duration, num_songs, song_id, title, year\
Log Dataset
Folder path: 's3://udacity-dend/log_data'
The second dataset consists of log files in JSON format generated by this event simulator. These simulate activity logs from a music streaming app based on specified configurations.
columns: artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song, status, ts, userAgent, userId
It will be only load the row if page = NextSong
Dimension Table
folder path: 's3a://datalakeeu/'
-Data from Song Dataset (song_table, artist_table)
- song_table: song_id, title, artist_id, year, duration
- artist_table: artist_id, artist_name, artist_location, artist_latitude, artist_longitude
-Data from Log Dataset (time_table, users) + page = NextSong
- time_table: start_time, hour, day, week, month, year, weekday
- users: userId, firstName, lastName, gender, level
Fact Table
folder path: 's3a://datalakeeu/'
JOIN beteween Song Dataset und Long Dataset (songplay)
songplay: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent\
etl.py is the only skript of this program and create the dimension and fact tables from the two dataset and load the informatio on a S3 folder. The skript hast two functions process_song_data and process_log_data. The program conect to spark afert the conection run the two programs first process_song_data, second process_log_data.
1) def process_song_data()
1.1) read song data as dataframe (df)
1.2) extract from df the columns that correspond to songs_table as dataframe
1.3) save songs_table as parquet format in S3
1.4) extract from df the columns that correspond to artist_table as dataframe
1.5) save artist_table as partque format in S3
2) def process_log_data()
2.1) read log data as dataframe (df)
2.2) filter the data frame with column page have to be NextSong
2.3) extract from df the columns that correspond to users_table as dataframe
2.4) save users_table as parquet format in S3
2.5) create and add the timestep column with a udf function (user-defined function) to the df
2.6) create and add the columns hour, day, week, month, year and weekday with udf functions to the df
2.7) extract from df the columns that correspond to time_table as dataframe
2.8) save time_table as parquet format in S3
2.9) join between the tables df und songs_table in the column song_df.title == df.song. The columns for songplay table are extracted in a dataframe
2.10) save songplay_table as parquet in S3