The sales department of a grocery chain wants to build an unit sales prediction engine(a web service application). Their ML engineer has already a predictor ML model but they don't know how to bring that model into production. The task here is to use necessary MLOPS tools, design and manage production workflow.
- MLFlow for experiment tracking
- Prefect 2.0 as workflow orchestration tool
- AWS S3 bucket to store workflow artifacts
- Docker for deployment in a container locally
- AWS ECR to store the built docker container
- AWS Lambda to build a serverless deployment solution
- Terraform to automate infrastructure
Streamlit
For a quick demo checkout the deployed Streamlit app.
AWS API Gateway triggering Lambda function
Use REST API client. Set method as post
. Give this link for send request. Supply the following JSON object as body
. You should receive a JSON object back as output(respose body) with prediction.
{"find": {"date1": "2017-08-17", "store_nbr": 20}}
If you're looking for comprehensive testing of the workflow, then jump to "how to run" section.
The data comes from Kaggle competition - Corporación Favorita Grocery Sales Forecasting.
Since the compressed dataset when uncompressed becomes too slow to read, I create parquet equivalent of all files in Kaggle(660MB). The parquet format allows for fasting file reading time into memory. You would need a Kaggle account to download the files.
To know more about parquet files databricks has a nice summary.
Note: Though the loading time is faster, the training dataset needs about 6GB RAM.
The training dataset contains data from 2016 to July, 2017. This data was used to predict the future sales in 2017.
Using the preprocess data, we compute new features.
Basic features:
- Categorical features - item, family
- Promotion
Statistical features:
- time windows
- nearest days:[3, 7, 14]
- key: store x item, item
- target: unit_sales
- method:
- mean, median, max, min, std
- difference of mean value between adjacent time windows(only for equal time windows)
Since we have a bunch of features and single target variable in unit_sales
, we can consider this as regression problem.
We use LightGBM as our model algorithm. We set the hyperparameters to a default setting and collect as baseline. Then we tune the parameters as needed till we get the best model.
The feature engineering ideas are heavily borrowed from the 1st place solution of the competition.
Since it is a regression problem, independent variables serve as input and the target variable is unit sales. As input we can supply the store number, a date between 2017-08-16
and 2017-08-31
. An item number is randomly chosen. With these three inputs, unit sales is computed.
To download raw data, preprocess, model -
Operating System: Ubuntu 20.04 x64 Linux
vCPU: minimum 4 cores(AWS EC2 instance t2.xlarge
would be sufficient)
RAM: minimum 16GB
Storage: minimum 10GB
- Install the pre-requisites:
Miniconda3:
cd ~ && git clone https://github.com/dr563105/mlops-project-grocery-sales.git # clone the repo
sudo apt update && sudo apt install make -y
cd mlops-project-grocery-sales
make conda_docker_install # downloads and installs miniconda3, docker, docker-compose
Logout and log in back to the instance. Now the base
conda env would've been activated.
(Optional) Test if docker is working:
docker run --rm hello-world # show return hello world without errors
Note - You can skip the next steps and directly skip to deployment. Just make sure add in environment variables whereever necessary.
- Setup virtual environment:
cd ~/mlops-project-grocery-sales
make pipenv_setup # install Pipenv and other packages in Pipfile.
- Export Kaggle secrets as env variables to download the dataset:
export KAGGLE_USERNAME=datadinosaur
export KAGGLE_KEY=xxxxxxxxxxxxxx
- Create input directory and download Kaggle dataset into it:
cd ~/mlops-project-grocery-sales
make kaggle_dataset_download
- Follow aws-rds guide to setup AWS EC2 instance, S3 bucket and AWS RDS for Mlflow tracking server.
Note: Make sure ports 4200, 5000 and 5432 are added to the inbound rules. Security group of RDS must be linked with EC2 server instance to connect the server with the database. Port
4200
is for Prefect,5000
Mlflow,5432
PostgresDB.
- Run Prefect orion server:
In terminal 1
export EC2_IP="" # replace double quotes with the EC2 IP address
make setup_prefect && make start_prefect
- Run data_processing:
In terminal 2
make run_data_preprocess # runs preprocessing script. Training, validation datasets are created.
- Run MLflow with remote tracking and S3 as artifact store:
In terminal 3
Export DB secrets as environment variables. Replace double quotes with values got while setting up RDS into the variables.
export DB_USER=""
export DB_PASSWORD=""
export DB_ENDPOINT=""
export DB_NAME=""
export S3_BUCKET_NAME=""
export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""
export AWS_DEFAULT_REGION=us-east-1
Run MLFlow
make start_mlflow
MLFlow dashboard can be seen at http://<EC2_PUBLIC_IP_DNS>:5000
- Run model training:
In terminal 2
export EC2_IP="" # replace double quotes with the EC2 IP address
export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""
export AWS_DEFAULT_REGION=us-east-1
make run_model_training
Copy predictions to both deployment/webservice-flask
and deployment/webservice-lambda
directories.
make copy_preds
As we move to deployment and it takes a separe Pipfile
, it would be best all the opened terminal windows be closed and new ones opened for this section. If you want just test out deployment, run the above copy command. For flask-mlflow app, the script automatically handles it.
Basic setup
cd ~/mlops-project-grocery-sales/deployment/webservice-flask
pipenv install --dev # since this directory has a separate Pipfile
Run test_predict.py and test_requests.py:
pipenv run python test_predict.py # in terminal 1. To test as a normal python module
pipenv run python flask_sales_predictor.py # in terminal 1
pipenv run python test_requests.py # in terminal 2. To test with a request endpoint
Test lambda deployment:
cd ~/mlops-project-grocery-sales/deployment/webservice-lambda
# Exit out of flask venv and create new one for lambda testing
pipenv install --dev
pipenv run python test_lambda.py # in terminal 1. To test with lambda handler before creating AWS Lambda resource
Containerise Flask application:
cd ~/mlops-project-grocery-sales/deployment/webservice-flask
docker build -t flask-sales-predictor:v1 .
docker run -it --rm -p 9696:9696 flask-sales-predictor:v1 # Terminal 1
python test_requests.py # Terminal 2. No need for pipenv as docker is a container.
Containerise Lambda function:
To run and test lambda function locally, the AWS emulator allows a proxy to convert http requests to JSON events to pass to the lambda function in the container image. We don't expose any port inside Dockerfile when we build the image. Then expose port 9000
as 8080
while running it. Inside the test_lambda_fn_docker.py
, we send in the event to this(localhost:9000/2015-03-31/functions/function/invocations
). More info on this here
cd ~/mlops-project-grocery-sales/deployment/webservice-lambda
docker build -t lambda-sales-predictor:v1 .
docker run \
-it --rm \
-p 9000:8080 \
-e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-e RUN_ID=5651db4644334361b10296c51ba3af3e \
-e S3_BUCKET_NAME=mlops-project-sales-forecast-bucket \
lambda-sales-predictor:v1 # in terminal 1
python test_lambda_fn_docker.py # in terminal 2. No need for pipenv as docker is a container.
{"find": {"date1": "2017-08-26", "store_nbr": 20}}
The variable date1
can be a date between 2017-08-16 and 2017-08-31. Please follow the exact data format to avoid errors.
To manage the data pipeline using Terraform, kindly refer to this repo.
This final capstone project was created as a part of the Mlops-zoomcamp course from DataTalks.Club. I'd like to thank the staff for providing high quality learning resource. I appreciate their time, effort and assistance in helping me complete the project. I highly recommend any aspiring data scientists this course.
-
- ✅ Choose/collect dataset
- ✅ Convert huge raw CSV to parquet file formats
- ✅ Use Kaggle to store preprocessed datasets
- ✅ Preprocess and feature engineer
- ✅ Implement logging
- ✅ Prepare dataset for model training
- ✅ Implement LGBM model
- ✅ Validate and forecast predictions
-
- ✅ Do basic workflow orchestration with local API server
- ✅ Use a cloud(AWS) as API server
- ✅ Use local storage to store persisting flow code
-
- ✅ Track experiments local backend(sqlite)
- ✅ Track experiments with a cloud(AWS RDS) backend
- ✅ Store model artifacts in a cloud storage(S3)
-
- ✅ As a Flask application with an endpoint
- ✅ As a Lambda function with a handler
- ✅ As a docker container to test lambda function locally
- ✅ Use AWS ECR repository image as Lambda function source
- ✅ Create an AWS Lambda function with ECR image source and test it manually
- ✅ Connect API Gateway and lambda function using ECR image
- ✅ Deploy model as Streamlit app
-
- ✅ Use Terraform to deploy the model to production using AWS ECR, Lambda, S3 and API Gateway
-
- Continuous Integration
- ✅ Unit testing
- Continuous Integration
- Implement model monitoring