Movies_Analysis

Overview

Architecture Diagram

Tech Stack

Big Data Frameworks

PySpark
Hive

Python Libraries

requests
boto3
Beautiful Soup
IMDbPY

AWS

EMR
S3
Athena
QuickSight

Pipeline Orchestration

Apache Airflow

Project Directory Structure

Movie_Analysis
├─ .github
│  └─ ISSUE_TEMPLATE
│     ├─ story.md
│     └─ task.md
├─ .gitignore
├─ app
│  ├─ code
│  │  ├─ create_bucket.py
│  │  ├─ data_cleaning.py
│  │  ├─ data_quality_check.py
│  │  ├─ get_movie_data.py
│  │  ├─ schema_creation.py
│  │  └─ __init__.py
│  ├─ conf
│  │  ├─ config.yml
│  │  └─ __init__.py
│  ├─ data
│  ├─ parquet
│  ├─ run.py
│  ├─ utils
│  │  ├─ helper.py
│  │  └─ s3_helper.py
│  └─ __init__.py
├─ dags
│  └─ movie_data_dag.py
├─ images
│  ├─ movie_data_dag.PNG
│  └─ movie_ER.jpg
├─ LICENSE
├─ main.py
├─ README.md
└─ run.sh

Files Description

config.yml - configuration file
main.py - main file to run all the modules
get_movie_data.py - extracts recent movie data from using IMDbPY
data_cleaning.py - preprocesses the data using PySpark
schema_creation.py - creates schema for tables in AWS Athena
data_quality_check.py - performs the following checks:-
- row count check
- null values check
- table exists check
helper.py - general purpose helper functions
s3_helper.py - AWS S3 helper functions
movie_data_dag.py - DAG which runs on every Friday to extract recently released movies data

Setup

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movies_Analysis

Overview

Architecture Diagram

Tech Stack

Big Data Frameworks

Python Libraries

AWS

Pipeline Orchestration

Project Directory Structure

Files Description

Setup

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.idea		.idea
app		app
dags		dags
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

Movies_Analysis

Overview

Architecture Diagram

Tech Stack

Big Data Frameworks

Python Libraries

AWS

Pipeline Orchestration

Project Directory Structure

Files Description

Setup

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages