Skip to content

datamindedacademy/skill-boost-data-modeling

Repository files navigation

Skill Boost: Data Modeling Techniques

Open in GitHub Codespaces

Overview

Welcome to the Data Modeling Skill Boost training session! This hands-on workshop explores three distinct data modeling techniques using the same dataset (TPC-H). You'll experience firsthand how different modeling approaches affect query complexity, maintainability, and analytical ease.

What You'll Learn

This training uses the TPC-H benchmark dataset (a standard dataset for database benchmarking) to implement three different modeling techniques:

  1. Dimensional Modeling (Kimball) - Star schema with facts and dimensions
  2. Data Vault 2.0 - Flexible, auditable enterprise data warehouse
  3. One Big Table (OBT) - Fully denormalized approach

Each technique is implemented as a separate dbt project using DuckDB. You'll write queries against each model to answer business questions and compare the experience across approaches.

Getting Started

Option 1: GitHub Codespaces (Recommended)

Click the "Open in GitHub Codespaces" badge above. Everything is pre-configured and will be ready in 3-5 minutes!

The setup automatically:

  • Installs uv and all dependencies
  • Creates three DuckDB databases with TPC-H data
  • Configures the Python environment

Option 2: Local Setup

Prerequisites:

  • Python 3.11 or higher
  • Git

Installation:

# Clone the repository
git clone https://github.com/datamindedacademy/skill-boost-data-modeling
cd skill-boost-data-modeling

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate

# Follow the setup instructions in each project directory to initialize data and run dbt

Project Structure

.
├── dimensional-modeling/    # Star schema implementation
├── data-vault-20/          # Data Vault 2.0 implementation
├── one-big-table/          # Denormalized OBT implementation
├── questions/              # Business questions to answer
└── pyproject.toml          # Python dependencies

The Three Modeling Techniques

Each modeling approach directory contains detailed information about the philosophy, benefits, drawbacks, and references. Explore each to understand the trade-offs:

  • dimensional-modeling/ - Star schema with facts and dimensions (Kimball approach)
  • data-vault-20/ - "Flexible", auditable enterprise warehouse with hubs, links, and satellites
  • one-big-table/ - Fully denormalized single table approach

Training Exercises

Navigate to the questions/ directory to find business questions to answer using each modeling technique. Compare:

  • Query Complexity: How many joins? How readable is the SQL?
  • Performance: How fast do queries execute?
  • Flexibility: How easy is it to answer new questions?
  • Maintainability: How would schema changes impact the model?

Visualizing with Apache Superset

Explore your data models visually using Apache Superset with automatic DuckDB connections to all three modeling approaches.

Quick Start

# One-time setup (first time only)
./superset-cli setup

# Start Superset
./superset-cli start

# Wait 1-2 minutes, then access at http://localhost:8088
# Login: admin / admin

# Add DuckDB database connections (if not already added)
./superset-cli connect

# Stop Superset when done
./superset-cli stop

Once connected, you can query and visualize data from all three models

Tip: Run ./superset-cli help to see all available commands.

Additional Resources

Contributing

Found an issue or have suggestions? Open an issue or submit a pull request!

License

This training material is provided for educational purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages