A Big Data analytics system that provides real-time traffic congestion predictions using machine learning, streaming data processing, and interactive visualization. The system ingests live traffic and weather data, processes it through a Lambda architecture, and delivers one-minute-ahead congestion forecasts via a web dashboard.
This project demonstrates a complete end-to-end big data pipeline for real-time traffic analysis, implementing:
- Real-time data ingestion from multiple streaming sources (TomTom Traffic API, Open-Meteo Weather API)
- Lambda architecture with batch and streaming processing layers
- Machine learning model (LSTM) for time-series traffic prediction
- Interactive web dashboard for visualization and real-time insights
The system predicts traffic congestion levels one minute ahead by analyzing a moving window of the last 30 minutes of traffic and weather data, enabling users to make informed routing decisions.
The system follows a Lambda architecture pattern:
- Ingestion Layer: Apache NiFi for data collection and preprocessing
- Streaming Layer: Apache Kafka + Apache Spark Streaming for real-time processing
- Batch Layer: Apache Spark + Apache Hive for historical data storage and batch analytics
- Serving Layer: Apache HBase for low-latency query serving
- Presentation Layer: Streamlit web application with interactive map visualization
- Data Processing: Apache Spark (PySpark), Apache NiFi, Apache Kafka
- Storage: Apache Hive (data warehouse), Apache HBase (real-time serving)
- Machine Learning: TensorFlow/Keras (LSTM model), scikit-learn
- Data Sources: TomTom Traffic API, OpenStreetMap API, Open-Meteo Weather API
- Visualization: Streamlit, GeoPandas, Folium
- Languages: Python 3.9, SQL
TrafficCongestionPrediction/
├── Data/ # Historical data for model training
├── DataPreprocessing/ # Data ingestion and transformation modules
│ ├── TomTom/ # Traffic data preprocessing (OSM matching algorithm)
│ └── OpenMeteo/ # Weather data preprocessing (gaussian noise simulation)
├── Model/ # Machine learning module
│ ├── train.py # Model training script
│ ├── predict.py # Prediction inference
│ ├── eda/ # Exploratory data analysis
│ └── plots/ # Training and evaluation visualizations
├── Spark/ # Batch and streaming processing
│ ├── Batch/ # Batch processing jobs
│ ├── stream_prediction_app.py # Real-time prediction application
│ └── Common/ # Shared schemas and utilities
├── Hive/ # Data warehouse table definitions
├── kafka/ # Kafka topic management scripts
├── Nifi/ # Apache NiFi flow configurations
└── PresentationLayer/ # Streamlit web application
└── src/ # Application source code
- Real-time Traffic Prediction: LSTM-based model predicting congestion 1 minute ahead
- Interactive Map Visualization: Color-coded traffic levels with zoom and pan capabilities
- Top 5 Rankings: Most and least congested streets in real-time
- Geographic Matching Algorithm: Custom algorithm to match TomTom traffic segments with OpenStreetMap road geometries
- Multi-source Data Integration: Combines traffic, weather, and road infrastructure data
- Designed and implemented a complete Lambda architecture for real-time big data analytics
- Developed a custom geometric matching algorithm to align dynamic traffic segments with static road networks
- Built an LSTM time-series model achieving reliable short-term traffic predictions
- Integrated multiple streaming data sources with different update frequencies
- Created an end-to-end pipeline from raw API data to interactive visualizations
- Handled real-world challenges: API rate limits, data schema mismatches, temporal alignment
- Demonstrated production-ready practices: modular architecture, error handling, data validation
- Python 3.9.4
- Apache Spark, Kafka, NiFi, Hive, HBase (for full deployment)
- Docker (for local development with PresentationLayer)
Each module contains detailed setup instructions:
- Model: See Model/README.md for training and EDA
- Data Preprocessing: See DataPreprocessing/README.md for ingestion setup
- Presentation Layer: See PresentationLayer/README.md for web app deployment
- Report: See report.pdf for final project report
This project was developed as part of a Big Data Analytics course at Warsaw University of Technology, demonstrating practical application of distributed systems, real-time analytics, and machine learning in a production-like environment.
See LICENSE file for details.