A comprehensive data platform built on IMDB datasets, featuring automated data ingestion, dbt transformations, and an LLM-powered chat interface for natural language queries.
- Overview
- Architecture
- Features
- Tech Stack
- Getting Started
- Project Structure
- Usage
- Deployment
- Data Model
- Development
- Acknowledgments
- License
This project demonstrates a complete modern data platform workflow:
- Data Ingestion: Automated daily ingestion from IMDB datasets to BigQuery
- Data Transformation: SQL-based transformations using dbt for analytical models
- LLM Interface: Natural language query interface powered by OpenAI GPT-4o
The platform enables users to ask questions like:
- "What are the top 10 highest-rated movies?"
- "Show me Christopher Nolan's filmography"
- "Which actors worked with Martin Scorsese?"
┌─────────────────────────────────────────────────────────────┐
│ IMDB Datasets │
│ (7 TSV files, updated daily) │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌────────────────┐
│ Ingestion │
│ (Python) │
└────────┬───────┘
│
▼
┌────────────────┐
│ BigQuery │
│ Bronze Layer │
│ (7 tables) │
└────────┬───────┘
│
▼
┌────────────────┐
│ dbt Transform │
│ Silver Layer │
│ (8 models) │
└────────┬───────┘
│
▼
┌───────────────┴────────────────┐
│ │
▼ ▼
┌──────────────┐ ┌─────────────────┐
│ Streamlit │ │ Cron Job │
│ + GPT-4o │ │ (Daily 2 AM) │
│ Chat Interface│ │ Auto-ingestion │
└──────────────┘ └─────────────────┘
- ✅ Automated download from IMDB datasets (7 TSV files)
- ✅ Conversion to Parquet format for efficiency
- ✅ Load into BigQuery bronze layer
- ✅ Idempotent pipeline (safe to re-run)
- ✅ Scheduled daily execution via cron job
- ✅ 5 staging models for data cleaning
- ✅ 3 marts models (movies, dim_actors, dim_directors)
- ✅ Partitioned and clustered tables for performance
- ✅ Comprehensive data quality tests
- ✅ Full documentation and lineage
- ✅ Natural language to SQL conversion via GPT-4o
- ✅ Automatic SQL execution on BigQuery
- ✅ Interactive chat with conversation history
- ✅ Model selection (GPT-4o, GPT-4o-mini, GPT-4-turbo)
- ✅ Usage statistics tracking (tokens, questions, session time)
- ✅ CSV export functionality
| Layer | Technology |
|---|---|
| Cloud Platform | Google Cloud Platform (GCP) |
| Data Warehouse | BigQuery |
| Data Pipeline | Python 3.11, Pandas, PyArrow |
| Transformation | dbt |
| LLM | OpenAI GPT-4o |
| Frontend | Streamlit |
| Containerization | Docker |
| CI/CD | GitHub Actions |
| Package Management | uv |
| Code Quality | Ruff (linter + formatter) |
- Python 3.11+
- Google Cloud Platform account with BigQuery API enabled
- OpenAI API key
uvpackage manager (installation)
-
Clone the repository
git clone https://github.com/Richard-GOZAN/imdb-analytics-platform.git cd imdb-analytics-platform -
Install dependencies
uv sync
-
Configure environment
cp .env.template .env # Edit .env with your credentials -
Run ingestion
uv run python ingestion/ingest.py
-
Run dbt transformations
cd dbt/transform uv run dbt run -
Launch chat app
uv run streamlit run app/chat.py # Access: http://localhost:8501
imdb-analytics-platform/
├── ingestion/ # Data ingestion pipeline
│ ├── config.py # Configuration
│ ├── ingest.py # Main script
│ └── utils.py # Helper functions
├── dbt/transform/ # dbt transformation project
│ ├── models/
│ │ ├── staging/ # Staging models (5)
│ │ └── marts/ # Marts models (3)
│ ├── dbt_project.yml
│ └── profiles.yml
├── app/ # LLM chat application
│ ├── chat.py # Streamlit interface
│ ├── agent.py # OpenAI agent
│ ├── bigquery_tool.py # BigQuery execution
│ └── stats.py # Usage tracking
├── docker/ # Docker configurations
│ ├── Dockerfile.app
│ └── Dockerfile.ingestion
├── scripts/ # Automation scripts
│ ├── run_ingestion.sh # Pipeline wrapper
│ └── setup_cron.sh # Cron installation
├── .github/workflows/ # CI/CD
│ ├── test.yml # Tests
│ └── deploy-app.yml # Deployment
├── docker-compose.yml
├── pyproject.toml
└── README.md
# Full ingestion
uv run python ingestion/ingest.py
# Force re-download
uv run python ingestion/ingest.py --force-downloadDatasets ingested:
name.basics- Person informationtitle.basics- Movie informationtitle.ratings- Ratings and votestitle.crew- Director assignmentstitle.principals- Principal cast/crewtitle.akas- Alternative titlestitle.episode- TV episode info
cd dbt/transform
# Run all models
uv run dbt run
# Run tests
uv run dbt test
# Generate docs
uv run dbt docs generate && dbt docs serveModels:
- Staging (5 models): Data cleaning and filtering
- Marts (3 models):
movies- Fact table (11.9K rows)dim_actors- Actor dimension (30.7K actors)dim_directors- Director dimension (3.6K directors)
uv run streamlit run app/chat.pyExample queries:
- "Top 10 highest-rated movies"
- "Christopher Nolan's filmography"
- "Actors who worked with Scorsese"
# Build
docker compose build
# Run app
docker compose up app
# Run ingestion
docker compose run --rm ingestion# Setup cron job
./scripts/setup_cron.sh
# Runs daily at 2 AM automaticallyGitHub Actions workflows:
- Lint: Automatic linting on push
- Deploy: Manual deployment to VM via SSH
movie_id STRING
title STRING
release_year INTEGER
average_rating FLOAT64
num_votes INTEGER
directors ARRAY<STRUCT<id, name, birth_year>>
actors ARRAY<STRUCT<id, name, characters, gender>>dim_directors- Director details with filmographydim_actors- Actor details with career stats
Pre-commit hooks automatically run:
- Code formatting (Ruff)
- Import sorting (Ruff)
- Style checks (Ruff)
# Run linting
uv run ruff check .
# Run formatting
uv run ruff format .MIT License - see LICENSE
⭐ If you find this project useful, please star it!