The NBA Data Lake is an end-to-end data analytics pipeline designed to fetch, process, store, and transform NBA player data for analytics and visualization. This README provides an overview of the project structure, components, and instructions to set up and use the system.
-
The NBA Data Lake performs the following tasks:
-
Fetches raw NBA data from an external API.
-
Stores the raw data in an S3 bucket.
-
Processes and cleans the data using an AWS Glue ETL job.
-
Queries the processed data using Amazon Athena.
-
Visualizes the data in Amazon QuickSight.
The architecture comprises the following AWS components:
-
Amazon S3: Stores raw and processed data.
-
AWS Glue:
-
Crawler: Catalogs the data and creates a schema in the Glue Data Catalog.
-
ETL Job: Transforms raw data into a structured format.
-
-
Amazon Athena: Queries the processed data for analytics.
-
Amazon QuickSight: Provides data visualization and reporting.
-
AWS IAM: Manages permissions for resources and services.
nba-data-lake/
├── media/ # media file
|
├── src/
│ └── nba_data_lake.py # python script
│
├── .env.example # configurable env variables
├── .gitignore # Ignored files
├── manifest.json # manifest file for quicksight
├── README.md # Project documentation
└── requirements.txt # python dependencies -
Prerequisites
-
An AWS account with access to the following services: S3, Glue, Athena, and QuickSight.
-
AWS CLI installed and configured.
-
API KEY from soortsdata.io
- Clone the Repository:
git clone https://github.com/oyogbeche/nba_data_lake.git
cd nba_data_lake-
Configure the
.envusing.env.example -
Create and activate venv
python -m venv myenv
myenv\Scripts\Activate- Install dependencies
pip install -r requirements.txt- Configure aws credentials
aws configure- Run the application
python src/nba_data_lake.py