Modern Data Lakehouse Stack: Iceberg + Hive + Trino + Spark + Shiny

This project demonstrates a production-ready modern data lakehouse architecture featuring cross-engine Apache Iceberg with multiple query and processing engines. Built with Docker Compose for easy deployment and featuring a Shiny for Python web interface for end users.

🏗️ Software Stack & Architecture

📊 Query & Processing Engines

🔍 Trino (435) - High-performance distributed SQL query engine
- Purpose: Interactive analytics, ad-hoc queries, cross-engine data access
- Use Case: Business intelligence, data exploration, real-time analytics
⚡ Apache Spark (3.5.0) - Unified analytics engine for large-scale data processing
- Purpose: ETL processing, machine learning, batch/stream processing, Iceberg branch management
- Use Case: Data transformations, feature engineering, model training, data pipeline orchestration

🗄️ Data Storage & Metadata

🧊 Apache Iceberg (1.4.2) - Modern table format with ACID transactions
- Purpose: Schema evolution, time travel, partition management, cross-engine compatibility
- Use Case: Data versioning, snapshot isolation, efficient data updates/deletes
🏛️ Apache Hive Metastore (4.0.0) - Centralized metadata repository
- Purpose: Shared schema registry, table definitions, partition information across engines
- Use Case: Cross-engine table discovery, metadata consistency, governance
🐘 PostgreSQL (15.4) - Relational database backend
- Purpose: Persistent storage for Hive Metastore metadata
- Use Case: ACID metadata operations, concurrent access, backup/recovery

🌐 User Interface & Storage

✨ Shiny for Python - Interactive web application framework
- Purpose: User-friendly data exploration interface, query execution, visualization
- Use Case: Self-service analytics, business user data access, dashboard creation
📁 Local File Storage - Parquet format data warehouse
- Purpose: Persistent data storage with columnar format optimization
- Use Case: Analytics workloads, compression efficiency, schema evolution support

🚀 Building the Stack

📋 Prerequisites

# Required software
- Docker & Docker Compose
- Make (for simplified commands)
- 8GB+ RAM recommended
- 10GB+ disk space for warehouse data

🛠️ Quick Build & Start

# Clone or navigate to the project directory
cd /path/to/trino-iceberg-stack

# Build and start the entire stack
make start

# This command:
# 1. Pulls all Docker images (Trino, Spark, Hive, PostgreSQL, Shiny)
# 2. Creates shared networks and volumes
# 3. Initializes PostgreSQL with Hive schema
# 4. Starts Hive Metastore service
# 5. Launches Trino with Iceberg connector
# 6. Starts Spark cluster (master + worker)
# 7. Deploys Shiny web application
# 8. Creates persistent warehouse directory

🔧 Step-by-Step Manual Build

# 1. Start infrastructure services (PostgreSQL + Hive)
docker-compose up -d postgres hive-metastore

# 2. Wait for Hive Metastore initialization
sleep 30

# 3. Start query engines
docker-compose up -d trino spark-master spark-worker

# 4. Launch web frontend
docker-compose up -d shiny-app

# 5. Verify all services
make status

🏗️ Architecture Flow

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Shiny App     │    │      Trino       │    │     Spark       │
│  (Port 8000)    │────│   (Port 8081)    │────│  (Port 8082)    │
│  Web Interface  │    │  Query Engine    │    │ Processing Engine│
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌─────────────────────┐
                    │   Hive Metastore    │
                    │    (Port 9083)      │
                    │  Metadata Service   │
                    └─────────────────────┘
                                 │
                    ┌─────────────────────┐
                    │    PostgreSQL       │
                    │    (Port 5432)      │
                    │  Metadata Storage   │
                    └─────────────────────┘
                                 │
                    ┌─────────────────────┐
                    │   Iceberg Tables    │
                    │     ./warehouse     │
                    │   Parquet Files     │
                    └─────────────────────┘

🚀 Quick Start Guide

🎯 Makefile Commands (Recommended)

# Show all available commands with descriptions
make help

# 🏗️ Build & Start the Complete Stack
make start
# - Pulls Docker images for all services
# - Creates networks and volumes
# - Initializes PostgreSQL + Hive Metastore
# - Starts Trino, Spark, and Shiny services
# - Sets up Iceberg warehouse directory

# ✅ Verify Stack Health
make verify
# - Tests database connectivity
# - Validates service endpoints
# - Checks Iceberg table access
# - Confirms cross-engine functionality

# 🎬 Initialize Demo Data
make init-data
# - Creates sample Iceberg tables
# - Inserts customer and product data
# - Demonstrates schema evolution
# - Sets up cross-engine branching examples

# 📚 Show Interactive Demo Guide
make demo

🌐 Access Points

Service	URL	Purpose
🌟 Shiny Frontend	http://localhost:8000	Main user interface - Query execution, data visualization
📊 Trino Web UI	http://localhost:8081	Query monitoring, cluster status, performance metrics
⚡ Spark Master UI	http://localhost:8082	Spark cluster management, job monitoring
🐘 PostgreSQL	localhost:5432	Direct database access (user: `hive`, password: `hive`)

🔍 Service Health Check

# Quick status overview
make status

# Individual service logs
make logs-trino    # Trino query engine logs
make logs-spark    # Spark processing logs  
make logs-shiny    # Shiny web app logs
make logs-hive     # Hive Metastore logs

⚡ Common Commands

make start        # Start everything
make stop         # Stop all containers
make shiny        # Restart only Shiny app
make status       # Check container status
make logs-shiny   # View Shiny app logs
make clean        # Clean up everything

# Demo & Testing
make init-data    # Initialize demo data with cross-engine branching
make test-all     # Run comprehensive Iceberg feature tests
make test-branching     # Test cross-engine branching specifically
make test-time-travel   # Test time travel functionality
make test-metadata      # Test metadata table access

📊 What the Demo Does

Core Iceberg Features

Creates Iceberg tables with customer data using Trino
Inserts sample records (5 customers from different countries)
Demonstrates schema evolution (adding customer_tier column)
Shows time travel queries across historical snapshots
Displays metadata tables (snapshots, files, history, refs)

🌟 Cross-Engine Branching (Advanced Feature)

Spark creates Iceberg branches for development/testing
Trino queries Spark-created branches seamlessly
Demonstrates cross-engine compatibility via shared Hive Metastore
Shows branch metadata and data isolation between branches

Analytics & Visualization

Runs analytical queries showing:
- Total customers and revenue by country
- Customer tier segmentation
- Revenue evolution across time
Generates Parquet files in the local warehouse directory

🌐 Shiny Frontend Features

The included Shiny for Python web application provides an intuitive interface for end users:

Query Interface

Pre-built queries: Show catalogs, schemas, tables, and sample data
Custom SQL: Execute any Trino/SQL query
Real-time results: Immediate feedback on query execution

Data Visualization

Automatic charts: Scatter plots, histograms, and bar charts
Interactive plots: Built with Plotly for rich interactivity
Smart detection: Chooses appropriate visualization based on data types

Monitoring

Connection status: Real-time Trino connection monitoring
Query feedback: Clear success/error messages
Performance info: Row and column counts

Example Workflows

Explore data structure: Start with "Show Catalogs" → "Show Schemas" → "Show Tables"
Sample data: Use "Sample Data" to preview table contents
Custom analysis: Switch to "Custom Query" for specific business questions
Visualize results: Automatic charts help identify patterns

🔍 Manual Exploration

Connect to Trino CLI

docker exec -it trino-cli /usr/bin/trino --server trino:8080 --catalog iceberg --schema demo

Example Queries

-- Show all tables
SHOW TABLES;

-- Query customer data
SELECT * FROM customers;

-- Show table snapshots (Iceberg feature)
SELECT * FROM "customers$snapshots";

-- Show table files (see Parquet files)
SELECT file_path, record_count, file_size_in_bytes FROM "customers$files";

📁 File Structure

./
├── docker-compose.yml          # Main orchestration with Spark + Trino
├── hive-site.xml              # Shared Hive Metastore configuration
├── Makefile                   # Comprehensive commands and testing
├── trino/
│   ├── etc/                   # Trino server configuration
│   │   ├── config.properties
│   │   ├── node.properties
│   │   └── log.properties
│   └── catalog/               # Catalog configurations
│       └── iceberg.properties # Iceberg connector config
├── jars/                      # Persistent Iceberg JAR storage
│   └── iceberg-spark-runtime-3.5_2.12-1.4.2.jar
├── warehouse/                 # Data files (Parquet) with cross-engine access
├── scripts/                   # Demo and comprehensive testing scripts
│   ├── init-demo-data.sh     # Cross-engine demo initialization
│   ├── test-branching.sh     # Branching functionality tests  
│   ├── test-time-travel.sh   # Time travel feature tests
│   └── test-*.sh             # Additional feature test scripts
├── shiny-app/                 # Shiny for Python web interface
│   ├── app.py                # Main Shiny application
│   └── shared/               # Shared query modules
└── archive/                   # Legacy demo scripts

🌐 Access URLs

Shiny Frontend: http://localhost:8000 (Main user interface)
Trino Web UI: http://localhost:8081 (Query engine admin)
Spark Master UI: http://localhost:8082 (Spark cluster status)
PostgreSQL: localhost:5432 (user: hive, password: hive)

🛠️ Advanced Usage

Time Travel Queries (Iceberg Feature)

-- Query data as of a specific timestamp
SELECT * FROM customers FOR TIMESTAMP AS OF TIMESTAMP '2023-12-01 10:00:00';

-- Query data from a specific snapshot
SELECT * FROM customers FOR VERSION AS OF 123456789;

-- View all available snapshots
SELECT * FROM "customers$snapshots" ORDER BY committed_at DESC;

Cross-Engine Branching (Advanced Iceberg Feature)

-- Trino: Query main branch
SELECT COUNT(*) FROM iceberg.branching_demo.products;

-- Trino: Query Spark-created dev branch  
SELECT COUNT(*) FROM iceberg.branching_demo.products FOR VERSION AS OF 'dev';

-- View available branches
SELECT name, type FROM "products$refs" WHERE type = 'BRANCH';

Note: Branch creation requires Spark. Use the demo scripts or:

# Spark SQL: Create a branch (from Spark container)
docker exec spark-iceberg /opt/spark/bin/spark-sql \
  --jars /opt/spark/jars/iceberg-spark-runtime-3.5_2.12-1.4.2.jar \
  -e "ALTER TABLE iceberg.demo.products CREATE BRANCH dev;"

Schema Evolution (Iceberg Feature)

-- Add a new column
ALTER TABLE customers ADD COLUMN phone VARCHAR(20);

-- Update data
UPDATE customers SET phone = '+1-555-0123' WHERE id = 1;

Partitioning

-- Create a partitioned table
CREATE TABLE sales (
    id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    sale_date DATE
) WITH (
    format = 'PARQUET',
    partitioning = ARRAY['month(sale_date)']
);

🏭 Production Deployment Guide

📦 Container Images & Versions

Component	Version	Image	Purpose
Trino	435	`trinodb/trino:435`	Latest stable with Iceberg 1.4.2 support
Spark	3.5.0	`apache/spark:3.5.0`	Current stable with Scala 2.12
Hive Metastore	4.0.0	`apache/hive:4.0.0`	Latest stable metastore release
PostgreSQL	15.4	`postgres:15.4`	LTS version for metadata persistence
Python/Shiny	3.11	`python:3.11-slim`	Web interface runtime

🔧 Cross-Engine Architecture Features

# Key architectural decisions for production readiness:

Persistent JAR Management:
  - Iceberg runtime JARs survive container rebuilds
  - Automatic version alignment across engines
  - Shared JAR volume: ./jars:/opt/shared/jars

Shared Metadata Layer:
  - Single Hive Metastore for all engines
  - ACID metadata operations via PostgreSQL
  - Cross-engine table discovery and governance

Warehouse Consistency:
  - Unified data path: ./warehouse:/data/warehouse
  - Parquet format optimization
  - Iceberg manifest management

Network Architecture:
  - Internal Docker network for service communication
  - External port exposure for user interfaces
  - Service discovery via container names

🚀 Scaling for Production

Infrastructure Scaling

# Scale Spark workers
docker-compose up -d --scale spark-worker=3

# Add Trino worker nodes (requires cluster configuration)
# Update trino/etc/config.properties:
# coordinator=false
# discovery.uri=http://trino-coordinator:8080

# Scale Shiny app instances (with load balancer)
docker-compose up -d --scale shiny-app=2

Performance Optimization

# trino/etc/config.properties
query.max-memory=50GB
query.max-memory-per-node=8GB
query.max-total-memory-per-node=10GB

# Spark configuration (spark-defaults.conf)
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer

🔐 Production Security Checklist

Authentication: Configure LDAP/OAuth for Trino and Spark
Authorization: Implement role-based access control (RBAC)
Network Security: Use TLS/SSL for all inter-service communication
Data Encryption: Enable encryption at rest and in transit
Monitoring: Deploy Prometheus + Grafana for observability
Backup: Automated PostgreSQL backups and Iceberg snapshots
Secrets Management: Use Docker secrets or external vault

☁️ Cloud Deployment Options

AWS Deployment

# Replace local storage with S3
# Update iceberg.properties:
iceberg.catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
s3.endpoint=https://s3.amazonaws.com
s3.path-style-access=false

# Use RDS for PostgreSQL
# Use ECS/EKS for container orchestration
# Use ALB for load balancing

Kubernetes Deployment

# Example helm values for production K8s
trino:
  replicas: 3
  resources:
    requests:
      memory: "8Gi"
      cpu: "2"
    limits:
      memory: "16Gi" 
      cpu: "4"

spark:
  master:
    replicas: 1
  worker:
    replicas: 5
    resources:
      requests:
        memory: "4Gi" 
        cpu: "2"

📊 Monitoring & Observability

# Add monitoring stack to docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
  
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

# Trino metrics endpoint: http://localhost:8081/v1/info
# Spark metrics endpoint: http://localhost:8082/metrics/json

🛑 Cleanup

# Clean stop with Makefile (recommended)
make clean

# Manual cleanup
docker-compose down -v

# Remove warehouse data (optional - preserves demo data for restart)
rm -rf warehouse/

# Note: JAR files in ./jars/ are preserved for persistence

🧪 Testing & Validation

The project includes comprehensive test suites:

# Test everything
make test-all

# Individual feature tests  
make test-branching     # Cross-engine branching
make test-time-travel   # Historical queries
make test-metadata      # Iceberg metadata tables
make test-query         # Basic connectivity

Expected Results:

✅ Cross-engine branching: Spark creates branches, Trino queries them
✅ Time travel: Query historical snapshots and timestamps
✅ Schema evolution: Add columns and query across versions
✅ Metadata access: Explore internal Iceberg metadata
✅ JAR persistence: Automatic setup survives rebuilds

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
hive/lib		hive/lib
jars		jars
scripts		scripts
shiny-app		shiny-app
src		src
trino		trino
warehouse		warehouse
.gitignore		.gitignore
BRANCH_QUERYING_GUIDE.md		BRANCH_QUERYING_GUIDE.md
Makefile		Makefile
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
hive-site.xml		hive-site.xml

FNLCR-DMAP/trino-shiny-demo

Folders and files

Latest commit

History

Repository files navigation

Modern Data Lakehouse Stack: Iceberg + Hive + Trino + Spark + Shiny

🏗️ Software Stack & Architecture

📊 Query & Processing Engines

🗄️ Data Storage & Metadata

🌐 User Interface & Storage

🚀 Building the Stack

📋 Prerequisites

🛠️ Quick Build & Start

🔧 Step-by-Step Manual Build

🏗️ Architecture Flow

🚀 Quick Start Guide

🎯 Makefile Commands (Recommended)

🌐 Access Points

🔍 Service Health Check

⚡ Common Commands

📊 What the Demo Does

Core Iceberg Features

🌟 Cross-Engine Branching (Advanced Feature)

Analytics & Visualization

🌐 Shiny Frontend Features

Query Interface

Data Visualization

Monitoring

Example Workflows

🔍 Manual Exploration

Connect to Trino CLI

Example Queries

📁 File Structure

🌐 Access URLs

🛠️ Advanced Usage

Time Travel Queries (Iceberg Feature)

Cross-Engine Branching (Advanced Iceberg Feature)

Schema Evolution (Iceberg Feature)

Partitioning

🏭 Production Deployment Guide

📦 Container Images & Versions

🔧 Cross-Engine Architecture Features

🚀 Scaling for Production

Infrastructure Scaling

Performance Optimization

🔐 Production Security Checklist

☁️ Cloud Deployment Options

AWS Deployment

Kubernetes Deployment

📊 Monitoring & Observability

🛑 Cleanup

🧪 Testing & Validation

📚 Learn More

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages