- General Architecture of Data Lake
- Real-time Monitoring & Scheduling
- MLOps Cycle
- FastAPI-based Microservice
Technology | Purpose |
---|---|
Spark Streaming | Stream processing for ingesting |
Apache Kafka | Streaming event and data |
Apache Airflow | Workflows & scheduling tasks |
MinIO & MongoDB | Data Storage and Catalog |
Trino | Federated SQL queries & seamless integration with BI tools |
Superset | BI tool for business analytics |
Resource | Specification |
---|---|
VPS OS | Ubuntu 24.0.2 |
CPU | 4-core Intel Xeon |
GPU | ❌ No GPU |
RAM | 10GB |
Storage | 200GB SSD |
Networking | 1Gbps Bandwidth |
- Builded a Data Lake following Medallion architecture with
catalog layer
andstorage layer
for storing image and its metadata - Streamed events from
file uploading
andcaptured images
from mobile app (was sent by API) into raw storage area, so that it helps data more various for AI training - Integrated NLP and Image processings in ETL pipeline to periodically normalize images and metadata
Metadata Layer
Monitoring Dashboard for Data Lake
Schedule tasks on Airflow
More detail in this Repo
- Query Data Service: Develop an APIs to retrieve metadata and images which were normalized in Data Lake for automated incremental learning process.
- Model Deploying Service: Develop an APIs to deploy model run on vps, and obtain streaming captured image and metadata from mobile app to data lake for incremental learning.
- Utilize Nginx to route and load balance among API service containers for reducing the latency and avoiding overload on each service.