🎬 IMDB Analytics Platform

A comprehensive data platform built on IMDB datasets, featuring automated data ingestion, dbt transformations, and an LLM-powered chat interface for natural language queries.

📋 Table of Contents

Overview
Architecture
Features
Tech Stack
Getting Started
Project Structure
Usage
Deployment
Data Model
Development
Acknowledgments
License

🎯 Overview

This project demonstrates a complete modern data platform workflow:

Data Ingestion: Automated daily ingestion from IMDB datasets to BigQuery
Data Transformation: SQL-based transformations using dbt for analytical models
LLM Interface: Natural language query interface powered by OpenAI GPT-4o

The platform enables users to ask questions like:

"What are the top 10 highest-rated movies?"
"Show me Christopher Nolan's filmography"
"Which actors worked with Martin Scorsese?"

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                    IMDB Datasets                            │
│              (7 TSV files, updated daily)                   │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
              ┌────────────────┐
              │   Ingestion    │
              │   (Python)     │
              └────────┬───────┘
                       │
                       ▼
              ┌────────────────┐
              │   BigQuery     │
              │ Bronze Layer   │
              │  (7 tables)    │
              └────────┬───────┘
                       │
                       ▼
              ┌────────────────┐
              │  dbt Transform │
              │ Silver Layer   │
              │  (8 models)    │
              └────────┬───────┘
                       │
                       ▼
       ┌───────────────┴────────────────┐
       │                                │
       ▼                                ▼
┌──────────────┐              ┌─────────────────┐
│  Streamlit   │              │   Cron Job      │
│  + GPT-4o    │              │  (Daily 2 AM)   │
│ Chat Interface│              │  Auto-ingestion │
└──────────────┘              └─────────────────┘

✨ Features

Data Ingestion

✅ Automated download from IMDB datasets (7 TSV files)
✅ Conversion to Parquet format for efficiency
✅ Load into BigQuery bronze layer
✅ Idempotent pipeline (safe to re-run)
✅ Scheduled daily execution via cron job

Data Transformation (dbt)

✅ 5 staging models for data cleaning
✅ 3 marts models (movies, dim_actors, dim_directors)
✅ Partitioned and clustered tables for performance
✅ Comprehensive data quality tests
✅ Full documentation and lineage

LLM Chat Interface

✅ Natural language to SQL conversion via GPT-4o
✅ Automatic SQL execution on BigQuery
✅ Interactive chat with conversation history
✅ Model selection (GPT-4o, GPT-4o-mini, GPT-4-turbo)
✅ Usage statistics tracking (tokens, questions, session time)
✅ CSV export functionality

🛠️ Tech Stack

Layer	Technology
Cloud Platform	Google Cloud Platform (GCP)
Data Warehouse	BigQuery
Data Pipeline	Python 3.11, Pandas, PyArrow
Transformation	dbt
LLM	OpenAI GPT-4o
Frontend	Streamlit
Containerization	Docker
CI/CD	GitHub Actions
Package Management	uv
Code Quality	Ruff (linter + formatter)

🚀 Getting Started

Prerequisites

Python 3.11+
Google Cloud Platform account with BigQuery API enabled
OpenAI API key
uv package manager (installation)

Installation

Clone the repository

git clone https://github.com/Richard-GOZAN/imdb-analytics-platform.git
cd imdb-analytics-platform

Install dependencies
```
uv sync
```

Configure environment

cp .env.template .env
# Edit .env with your credentials

Run ingestion
```
uv run python ingestion/ingest.py
```
Run dbt transformations
```
cd dbt/transform
uv run dbt run
```

Launch chat app

uv run streamlit run app/chat.py
# Access: http://localhost:8501

📁 Project Structure

imdb-analytics-platform/
├── ingestion/              # Data ingestion pipeline
│   ├── config.py          # Configuration
│   ├── ingest.py          # Main script
│   └── utils.py           # Helper functions
├── dbt/transform/         # dbt transformation project
│   ├── models/
│   │   ├── staging/       # Staging models (5)
│   │   └── marts/         # Marts models (3)
│   ├── dbt_project.yml
│   └── profiles.yml
├── app/                   # LLM chat application
│   ├── chat.py           # Streamlit interface
│   ├── agent.py          # OpenAI agent
│   ├── bigquery_tool.py  # BigQuery execution
│   └── stats.py          # Usage tracking
├── docker/               # Docker configurations
│   ├── Dockerfile.app
│   └── Dockerfile.ingestion
├── scripts/              # Automation scripts
│   ├── run_ingestion.sh  # Pipeline wrapper
│   └── setup_cron.sh     # Cron installation
├── .github/workflows/    # CI/CD
│   ├── test.yml         # Tests
│   └── deploy-app.yml   # Deployment
├── docker-compose.yml
├── pyproject.toml
└── README.md

💡 Usage

Ingestion Pipeline

# Full ingestion
uv run python ingestion/ingest.py

# Force re-download
uv run python ingestion/ingest.py --force-download

Datasets ingested:

name.basics - Person information
title.basics - Movie information
title.ratings - Ratings and votes
title.crew - Director assignments
title.principals - Principal cast/crew
title.akas - Alternative titles
title.episode - TV episode info

dbt Transformations

cd dbt/transform

# Run all models
uv run dbt run

# Run tests
uv run dbt test

# Generate docs
uv run dbt docs generate && dbt docs serve

Models:

Staging (5 models): Data cleaning and filtering
Marts (3 models):
- movies - Fact table (11.9K rows)
- dim_actors - Actor dimension (30.7K actors)
- dim_directors - Director dimension (3.6K directors)

Chat Application

uv run streamlit run app/chat.py

Example queries:

"Top 10 highest-rated movies"
"Christopher Nolan's filmography"
"Actors who worked with Scorsese"

🚢 Deployment

Local with Docker

# Build
docker compose build

# Run app
docker compose up app

# Run ingestion
docker compose run --rm ingestion

Production (GCP VM)

# Setup cron job
./scripts/setup_cron.sh

# Runs daily at 2 AM automatically

CI/CD

GitHub Actions workflows:

Lint: Automatic linting on push
Deploy: Manual deployment to VM via SSH

📊 Data Model

Movies (Fact Table)

movie_id          STRING
title             STRING
release_year      INTEGER
average_rating    FLOAT64
num_votes         INTEGER
directors         ARRAY<STRUCT<id, name, birth_year>>
actors            ARRAY<STRUCT<id, name, characters, gender>>

Dimensions

dim_directors - Director details with filmography
dim_actors - Actor details with career stats

🧪 Development

Code Quality

Pre-commit hooks automatically run:

Code formatting (Ruff)
Import sorting (Ruff)
Style checks (Ruff)

# Run linting
uv run ruff check .

# Run formatting
uv run ruff format .

🙏 Acknowledgments

📝 License

MIT License - see LICENSE

⭐ If you find this project useful, please star it!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
app		app
dbt/transform		dbt/transform
docker		docker
ingestion		ingestion
scripts		scripts
.dockerignore		.dockerignore
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

🎬 IMDB Analytics Platform

📋 Table of Contents

🎯 Overview

🏗️ Architecture

✨ Features

Data Ingestion

Data Transformation (dbt)

LLM Chat Interface

🛠️ Tech Stack

🚀 Getting Started

Prerequisites

Installation

📁 Project Structure

💡 Usage

Ingestion Pipeline

dbt Transformations

Chat Application

🚢 Deployment

Local with Docker

Production (GCP VM)

CI/CD

📊 Data Model

Movies (Fact Table)

Dimensions

🧪 Development

Code Quality

🙏 Acknowledgments

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages