Food Data Pipeline

Overview

This project fetches nutrition data from a variety of data sources and merges it into one standardized dataset, which is used for data analysis and mealplan generation in this project here

I've built pipelines for the following datasources, out of which only some turned out to be useful:

FoodData Central SR Legacy Food (Click here)
REWE Online Shop Data Click here
Insulin Index Click here & here
Fullness Factor calculated through a formula
Preparation Time estimated by openAi API

Installation

Python 3.x
Required Python libraries (listed in requirements.txt)

Clone the repository:

git clone https://github.com/ArthurZakirov/food-data-pipeline.git

Navigate to the project directory:
```
 cd food-data-pipeline
```

Install the dependencies

python3 -m venv venv
venv/Scripts/activate
pip install -r requirements.txt

Create an environment variable named OPENAI_API_KEY on your local device (& add your openAI API key)

REWE Online Shop Data

1. Setup Edge WebDriver

Download and setup the Microsoft WebDriver folling the instructions here

2. Start Edge Browser With Debugging Port

Run the following command in a new Terminal

start msedge --remote-debugging-port=9222 --user-data-dir=<path\to\your\user\data>

Argument	Example	Description
`--remote-debugging-port`	`9222`	Debugging Port
`--user-data-dir`	`"C:\Users\arthu\Documents\EdgeUserData"`	Where you store the edge browsing data

3. Manually Navigate To Website

Inside the edge browser tab that was created by the previous terminal command, manually navigate to https://shop.rewe.de/ and accept cookies, to avoid bot protection.

4. Scrape The Dataset

Run the following command in a new Terminal

python src/rewe_data/scrape_rewe_online_shop.py --output_path=<path/to/your/data> --remote-debugging-port=<port> --edge_driver_path=<path/to/your/driver> --url=<url>

Argument	Example	Description
`--output_path`	`"data/rewe_dataset.csv"`	Where to store your scraped dataset
`--remote-debugging-port`	`9222`	Debugging Port
`--edge_driver_path`	`"C:\\Users\\arthu\\Tools\\WebDriver\\edgedriver_win64\\msedgedriver.exe"`	Path to msedgedriver.exe
`--url`	`"https://shop.rewe.de/"`	URL to the Rewe Website

4. Process Rewe Dataset

Run the following

python src/rewe_data/process_rewe_dataset.py

Set the following config inside config/config.yaml:

defaults.chain to extract_regulated_food_name
data.input_path to data\raw\cleaned_rewe_dataset.csv
data.output_path to data\raw\cleaned_rewe_dataset.csv
data.input_column to Name
data.output_column to Regulated Name

Then run

python src/my_langchain/run_llm_processing_of_df.py

Set the following config inside config/config.yaml:

defaults.chain to translate_ger_to_eng
data.input_path to data\raw\cleaned_rewe_dataset.csv
data.output_path to data\raw\cleaned_rewe_dataset.csv
data.input_column to Regulated Name
data.output_column to Regulated Name English Then run

python src/my_langchain/run_llm_processing_of_df.py

FoodData Central

1. Download Dataset

Latest Downloads > SR Legacy > April 2018 (CSV)

I selected the SR Legacy version over all other datasets by the USDA because it is the only one that satisfies the following 2 properties:

It has the highest number of micronutrients provided
It has general food names (instead of branded food names), which makes it more convenient to match with foods from other data sources

Download here
Unzip the folder
Move the folder so that the path becomes "data/raw/FoodData_Central_sr_legacy_food_csv_2018-04"

Run this

python src/food_data_central/process_fdc_data.py

Insulin Index

University Of Sydney: Bell KJ Thesis

Upload this pdf to ChatGPT and give it the task to extract the data into a csv file.

Foodstruct

Export this website as a pdf and upload it to ChatGPT and give it the task to extract the data into a csv file.

Merging Datasets

1. Merge REWE & FDC

Note: If creating embeddings fails it might be because in a previous langchain operation something in the dataframe output column turned to NaN. If that's the case, open a jupyter notebook, perform dropna and save the dataset again.

Inside config/config.yaml set defaults.embedding to rewe_embedding. Then run

python src/data_merging/create_embeddings.py

Inside config/config.yaml set defaults.embedding to fdc_embedding. Then run

python src/data_merging/create_embeddings.py

Run this

python src/data_merging/merge_rewe_and_fdc_using_embeddings.py

2. Merge REWE & FDC & Insulin Index

Inside config/config.yaml set defaults.embedding to insulin_index_embedding. Then run

python src/data_merging/create_embeddings.py

Inside config/config.yaml set defaults.embedding to rewe_fdc_embedding. Then run

python src/data_merging/create_embeddings.py

Run this

python src/data_merging/merge_rewe_and_fdc_with_insulin_using_embeddings.py

3. Merge REWE & FDC & Insulin Index & Fullness Factor

python src/fullness_factor/append_fullness_factor.py --data_path <path/to/your/dataset>

4. Merge REWE & FDC & Insulin Index & Fullness Factor & Preparation Time

Set the following config inside config/config.yaml:

defaults.chain to food_preparation_time
data.input_path to data\processed\merged_rewe_fdc_insulin.csv
data.output_path to data/final/merged_rewe_fdc_insulin_time.csv
data.input_column to Non Nutrient Data.FDC Name
data.output_column to Non Nutrient Data.Preparation Time

python src/my_langchain/run_llm_processing_of_df.py

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.streamlit		.streamlit
additional_pages		additional_pages
config		config
data		data
images		images
notebooks		notebooks
outputs		outputs
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Food Data Pipeline

Table Of Contents

Overview

Installation

REWE Online Shop Data

1. Setup Edge WebDriver

2. Start Edge Browser With Debugging Port

3. Manually Navigate To Website

4. Scrape The Dataset

4. Process Rewe Dataset

FoodData Central

1. Download Dataset

Insulin Index

University Of Sydney: Bell KJ Thesis

Foodstruct

Merging Datasets

1. Merge REWE & FDC

2. Merge REWE & FDC & Insulin Index

3. Merge REWE & FDC & Insulin Index & Fullness Factor

4. Merge REWE & FDC & Insulin Index & Fullness Factor & Preparation Time

About

Uh oh!

Uh oh!

Languages

ArthurZakirov/Food-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Food Data Pipeline

Table Of Contents

Overview

Installation

REWE Online Shop Data

1. Setup Edge WebDriver

2. Start Edge Browser With Debugging Port

3. Manually Navigate To Website

4. Scrape The Dataset

4. Process Rewe Dataset

FoodData Central

1. Download Dataset

Insulin Index

University Of Sydney: Bell KJ Thesis

Foodstruct

Merging Datasets

1. Merge REWE & FDC

2. Merge REWE & FDC & Insulin Index

3. Merge REWE & FDC & Insulin Index & Fullness Factor

4. Merge REWE & FDC & Insulin Index & Fullness Factor & Preparation Time

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages