Real-Time NLP Data Pipeline using Docker, FastAPI, Kafka, Spark, MongoDB, Streamlit, and Poetry
The News Stream Recommender continuously ingests breaking news from live sources (NewsAPI), applies NLP topic modeling in real time using OpenAI, stores results in MongoDB, and visualizes insights via a FastAPI + Streamlit frontend.
- 🌍 Real-Time Ingestion – fetches live headlines via Kafka producers
- 🧩 Streaming NLP Pipeline – OpenAI
- 💾 Data Persistence – stores clustered topics in MongoDB
- ⚡ RESTful API – built with FastAPI for frontend data access
- 📊 Interactive Dashboard – live topic view via Streamlit
- 🐳 Fully Containerized – built and orchestrated using Docker Compose
- 📦 Poetry-Managed – clean and reproducible Python environments
| Layer | Technology |
|---|---|
| Ingestion | Kafka, NewsAPI, RSS Feeds |
| Processing | Apache Spark (OPENAI for deterministic mapping between title → topic) |
| Storage | MongoDB |
| Backend | FastAPI |
| Frontend | Streamlit |
| Management | Docker Compose, Poetry |
This project includes a comprehensive test suite with both simple and advanced testing options.
# Run all simple tests
make test
pytest --verbose
# Run with coverage
pytest --coverage --verbose
# Run specific component
make test-spark
pytest tests/test_spark_streaming.py --verboseThe simple test suite covers:
- ✅ Core Business Logic - Data processing, filtering, validation
- ✅ API Structure - Endpoint responses and data formats
- ✅ Utility Functions - Date formatting, text processing, URL validation
- ✅ Error Handling - Exception handling patterns
- ✅ Data Flow - Basic integration between components
poetry install