Grocery Sales Forcasting

Problem Statement

The sales department of a grocery chain wants to build an unit sales prediction engine(a web service application). Their ML engineer has already a predictor ML model but they don't know how to bring that model into production. The task here is to use necessary MLOPS tools, design and manage production workflow.

MLOPS model pipeline tools

MLFlow for experiment tracking
Prefect 2.0 as workflow orchestration tool
AWS S3 bucket to store workflow artifacts
Docker for deployment in a container locally
AWS ECR to store the built docker container
AWS Lambda to build a serverless deployment solution
Terraform to automate infrastructure

Test the workflow

Streamlit

For a quick demo checkout the deployed Streamlit app.

AWS API Gateway triggering Lambda function

Use REST API client. Set method as post. Give this link for send request. Supply the following JSON object as body. You should receive a JSON object back as output(respose body) with prediction.

{"find": {"date1": "2017-08-17", "store_nbr": 20}}

If you're looking for comprehensive testing of the workflow, then jump to "how to run" section.

Dataset

The data comes from Kaggle competition - Corporación Favorita Grocery Sales Forecasting.

Since the compressed dataset when uncompressed becomes too slow to read, I create parquet equivalent of all files in Kaggle(660MB). The parquet format allows for fasting file reading time into memory. You would need a Kaggle account to download the files.

To know more about parquet files databricks has a nice summary.

Note: Though the loading time is faster, the training dataset needs about 6GB RAM.

Machine Learning

Data Cleaning/preprocessing

The training dataset contains data from 2016 to July, 2017. This data was used to predict the future sales in 2017.

Feature Engineering

Using the preprocess data, we compute new features.

Basic features:

- Categorical features - item, family
- Promotion

Statistical features:

- time windows
    - nearest days:[3, 7, 14]
- key: store x item, item
- target: unit_sales
- method:
    - mean, median, max, min, std
    - difference of mean value between adjacent time windows(only for equal time windows)

Model training

Since we have a bunch of features and single target variable in unit_sales, we can consider this as regression problem.

We use LightGBM as our model algorithm. We set the hyperparameters to a default setting and collect as baseline. Then we tune the parameters as needed till we get the best model.

The feature engineering ideas are heavily borrowed from the 1st place solution of the competition.

Prediction

Since it is a regression problem, independent variables serve as input and the target variable is unit sales. As input we can supply the store number, a date between 2017-08-16 and 2017-08-31. An item number is randomly chosen. With these three inputs, unit sales is computed.

How to run

System Requirements

To download raw data, preprocess, model -

Operating System: Ubuntu 20.04 x64 Linux

vCPU: minimum 4 cores(AWS EC2 instance t2.xlarge would be sufficient)

RAM: minimum 16GB

Storage: minimum 10GB

Install the pre-requisites:

Miniconda3:

cd ~ && git clone https://github.com/dr563105/mlops-project-grocery-sales.git # clone the repo
sudo apt update && sudo apt install make -y
cd mlops-project-grocery-sales
make conda_docker_install # downloads and installs miniconda3, docker, docker-compose

Logout and log in back to the instance. Now the base conda env would've been activated. (Optional) Test if docker is working:

docker run --rm hello-world # show return hello world without errors

Note - You can skip the next steps and directly skip to deployment. Just make sure add in environment variables whereever necessary.

Setup virtual environment:

cd ~/mlops-project-grocery-sales
make pipenv_setup # install Pipenv and other packages in Pipfile.

Export Kaggle secrets as env variables to download the dataset:

export KAGGLE_USERNAME=datadinosaur
export KAGGLE_KEY=xxxxxxxxxxxxxx

Create input directory and download Kaggle dataset into it:

cd ~/mlops-project-grocery-sales
make kaggle_dataset_download

Follow aws-rds guide to setup AWS EC2 instance, S3 bucket and AWS RDS for Mlflow tracking server.

Note: Make sure ports 4200, 5000 and 5432 are added to the inbound rules. Security group of RDS must be linked with EC2 server instance to connect the server with the database. Port 4200 is for Prefect, 5000 Mlflow, 5432 PostgresDB.

Run Prefect orion server:

In terminal 1

export EC2_IP="" # replace double quotes with the EC2 IP address
make setup_prefect && make start_prefect

Run data_processing:

In terminal 2

make run_data_preprocess # runs preprocessing script. Training, validation datasets are created.

Run MLflow with remote tracking and S3 as artifact store:

In terminal 3

Export DB secrets as environment variables. Replace double quotes with values got while setting up RDS into the variables.

export DB_USER=""
export DB_PASSWORD=""
export DB_ENDPOINT=""
export DB_NAME=""
export S3_BUCKET_NAME=""
export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""
export AWS_DEFAULT_REGION=us-east-1

Run MLFlow

make start_mlflow

MLFlow dashboard can be seen at http://<EC2_PUBLIC_IP_DNS>:5000

Run model training:

In terminal 2

export EC2_IP="" # replace double quotes with the EC2 IP address
export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""
export AWS_DEFAULT_REGION=us-east-1
make run_model_training

Copy predictions to both deployment/webservice-flask and deployment/webservice-lambda directories.

make copy_preds

Deployment

As we move to deployment and it takes a separe Pipfile, it would be best all the opened terminal windows be closed and new ones opened for this section. If you want just test out deployment, run the above copy command. For flask-mlflow app, the script automatically handles it.

Flask:

Basic setup

cd ~/mlops-project-grocery-sales/deployment/webservice-flask
pipenv install --dev # since this directory has a separate Pipfile

Run test_predict.py and test_requests.py:

pipenv run python test_predict.py # in terminal 1. To test as a normal python module
pipenv run python flask_sales_predictor.py # in terminal 1
pipenv run python test_requests.py # in terminal 2. To test with a request endpoint

Lambda

Test lambda deployment:

cd ~/mlops-project-grocery-sales/deployment/webservice-lambda
# Exit out of flask venv and create new one for lambda testing
pipenv install --dev
pipenv run python test_lambda.py # in terminal 1. To test with lambda handler before creating AWS Lambda resource

Docker

Containerise Flask application:

cd ~/mlops-project-grocery-sales/deployment/webservice-flask
docker build -t flask-sales-predictor:v1 .
docker run -it --rm -p 9696:9696 flask-sales-predictor:v1 # Terminal 1
python test_requests.py # Terminal 2. No need for pipenv as docker is a container.

Containerise Lambda function:

To run and test lambda function locally, the AWS emulator allows a proxy to convert http requests to JSON events to pass to the lambda function in the container image. We don't expose any port inside Dockerfile when we build the image. Then expose port 9000 as 8080 while running it. Inside the test_lambda_fn_docker.py, we send in the event to this(localhost:9000/2015-03-31/functions/function/invocations). More info on this here

cd ~/mlops-project-grocery-sales/deployment/webservice-lambda
docker build -t lambda-sales-predictor:v1 .
docker run \
    -it --rm \
    -p 9000:8080 \
    -e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
    -e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
    -e RUN_ID=5651db4644334361b10296c51ba3af3e \
    -e S3_BUCKET_NAME=mlops-project-sales-forecast-bucket \
    lambda-sales-predictor:v1 # in terminal 1
python test_lambda_fn_docker.py # in terminal 2. No need for pipenv as docker is a container.

{"find": {"date1": "2017-08-26", "store_nbr": 20}}

The variable date1 can be a date between 2017-08-16 and 2017-08-31. Please follow the exact data format to avoid errors.

Terraform - Infrastructure as Code

To manage the data pipeline using Terraform, kindly refer to this repo.

Acknowledgements

This final capstone project was created as a part of the Mlops-zoomcamp course from DataTalks.Club. I'd like to thank the staff for providing high quality learning resource. I appreciate their time, effort and assistance in helping me complete the project. I highly recommend any aspiring data scientists this course.

Plan/tasks

Data Pre-processing, Feature engineering, Model training, Validation and Prediction
- ✅ Choose/collect dataset
- ✅ Convert huge raw CSV to parquet file formats
- ✅ Use Kaggle to store preprocessed datasets
- ✅ Preprocess and feature engineer
  - ✅ Implement logging
- ✅ Prepare dataset for model training
- ✅ Implement LGBM model
- ✅ Validate and forecast predictions
Prefect 2.0 Orion
- ✅ Do basic workflow orchestration with local API server
- ✅ Use a cloud(AWS) as API server
- ✅ Use local storage to store persisting flow code
MLFlow
- ✅ Track experiments local backend(sqlite)
- ✅ Track experiments with a cloud(AWS RDS) backend
- ✅ Store model artifacts in a cloud storage(S3)
Deployment
- ✅ As a Flask application with an endpoint
- ✅ As a Lambda function with a handler
- ✅ As a docker container to test lambda function locally
- ✅ Use AWS ECR repository image as Lambda function source
- ✅ Create an AWS Lambda function with ECR image source and test it manually
- ✅ Connect API Gateway and lambda function using ECR image
- ✅ Deploy model as Streamlit app
Infrastructure as Code(IaC) with Terraform
- ✅ Use Terraform to deploy the model to production using AWS ECR, Lambda, S3 and API Gateway
CI/CD with Github actions
- Continuous Integration
  - ✅ Unit testing

Future developments

Implement model monitoring

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github/workflows		.github/workflows
assets/images		assets/images
deployment		deployment
infrastructure		infrastructure
models		models
predictions		predictions
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grocery Sales Forcasting

Problem Statement

MLOPS model pipeline tools

Test the workflow

Dataset

Machine Learning

Data Cleaning/preprocessing

Feature Engineering

Model training

Prediction

How to run

System Requirements

Deployment

Flask:

Lambda

Docker

Terraform - Infrastructure as Code

Acknowledgements

Plan/tasks

Data Pre-processing, Feature engineering, Model training, Validation and Prediction

Prefect 2.0 Orion

MLFlow

Deployment

Infrastructure as Code(IaC) with Terraform

CI/CD with Github actions

Future developments

About

Releases

Packages

Languages

License

deepakramani/mlops-project-grocery-sales

Folders and files

Latest commit

History

Repository files navigation

Grocery Sales Forcasting

Problem Statement

MLOPS model pipeline tools

Test the workflow

Dataset

Machine Learning

Data Cleaning/preprocessing

Feature Engineering

Model training

Prediction

How to run

System Requirements

Deployment

Flask:

Lambda

Docker

Terraform - Infrastructure as Code

Acknowledgements

Plan/tasks

Data Pre-processing, Feature engineering, Model training, Validation and Prediction

Prefect 2.0 Orion

MLFlow

Deployment

Infrastructure as Code(IaC) with Terraform

CI/CD with Github actions

Future developments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages