- 1. Overview
- 2. Installation
- 3. Rewe-Online-Shop-Data Pipeline
- 4. FoodData Central Pipeline
- 5. Insulin Index Data Pipeline
- 6. Merging Datasets
This project fetches nutrition data from a variety of data sources and merges it into one standardized dataset, which is used for data analysis and mealplan generation in this project here
I've built pipelines for the following datasources, out of which only some turned out to be useful:
- FoodData Central SR Legacy Food (Click here)
- REWE Online Shop Data Click here
- Insulin Index Click here & here
- Fullness Factor calculated through a formula
- Preparation Time estimated by openAi API
- Python 3.x
- Required Python libraries (listed in
requirements.txt
)
-
Clone the repository:
git clone https://github.com/ArthurZakirov/food-data-pipeline.git
-
Navigate to the project directory:
cd food-data-pipeline
-
Install the dependencies
python3 -m venv venv venv/Scripts/activate pip install -r requirements.txt
-
Create an environment variable named
OPENAI_API_KEY
on your local device (& add your openAI API key)
Download and setup the Microsoft WebDriver folling the instructions here
Run the following command in a new Terminal
start msedge --remote-debugging-port=9222 --user-data-dir=<path\to\your\user\data>
Argument | Example | Description |
---|---|---|
--remote-debugging-port |
9222 |
Debugging Port |
--user-data-dir |
"C:\Users\arthu\Documents\EdgeUserData" |
Where you store the edge browsing data |
Inside the edge browser tab that was created by the previous terminal command, manually navigate to https://shop.rewe.de/ and accept cookies, to avoid bot protection.
Run the following command in a new Terminal
python src/rewe_data/scrape_rewe_online_shop.py --output_path=<path/to/your/data> --remote-debugging-port=<port> --edge_driver_path=<path/to/your/driver> --url=<url>
Argument | Example | Description |
---|---|---|
--output_path |
"data/rewe_dataset.csv" |
Where to store your scraped dataset |
--remote-debugging-port |
9222 |
Debugging Port |
--edge_driver_path |
"C:\\Users\\arthu\\Tools\\WebDriver\\edgedriver_win64\\msedgedriver.exe" |
Path to msedgedriver.exe |
--url |
"https://shop.rewe.de/" |
URL to the Rewe Website |
- Run the following
python src/rewe_data/process_rewe_dataset.py
- Set the following config inside
config/config.yaml
:
defaults.chain
toextract_regulated_food_name
data.input_path
todata\raw\cleaned_rewe_dataset.csv
data.output_path
todata\raw\cleaned_rewe_dataset.csv
data.input_column
toName
data.output_column
toRegulated Name
Then run
python src/my_langchain/run_llm_processing_of_df.py
- Set the following config inside
config/config.yaml
:
defaults.chain
totranslate_ger_to_eng
data.input_path
todata\raw\cleaned_rewe_dataset.csv
data.output_path
todata\raw\cleaned_rewe_dataset.csv
data.input_column
toRegulated Name
data.output_column
toRegulated Name English
Then run
python src/my_langchain/run_llm_processing_of_df.py
Latest Downloads > SR Legacy > April 2018 (CSV)
I selected the SR Legacy version over all other datasets by the USDA because it is the only one that satisfies the following 2 properties:
- It has the highest number of micronutrients provided
- It has general food names (instead of branded food names), which makes it more convenient to match with foods from other data sources
- Download here
- Unzip the folder
- Move the folder so that the path becomes "data/raw/FoodData_Central_sr_legacy_food_csv_2018-04"
Run this
python src/food_data_central/process_fdc_data.py
Upload this pdf to ChatGPT and give it the task to extract the data into a csv file.
Export this website as a pdf and upload it to ChatGPT and give it the task to extract the data into a csv file.
Note: If creating embeddings fails it might be because in a previous langchain operation something in the dataframe output column turned to NaN
.
If that's the case, open a jupyter notebook, perform dropna and save the dataset again.
Inside config/config.yaml
set defaults.embedding
to rewe_embedding
. Then run
python src/data_merging/create_embeddings.py
Inside config/config.yaml
set defaults.embedding
to fdc_embedding
. Then run
python src/data_merging/create_embeddings.py
Run this
python src/data_merging/merge_rewe_and_fdc_using_embeddings.py
Inside config/config.yaml
set defaults.embedding
to insulin_index_embedding
. Then run
python src/data_merging/create_embeddings.py
Inside config/config.yaml
set defaults.embedding
to rewe_fdc_embedding
. Then run
python src/data_merging/create_embeddings.py
Run this
python src/data_merging/merge_rewe_and_fdc_with_insulin_using_embeddings.py
python src/fullness_factor/append_fullness_factor.py --data_path <path/to/your/dataset>
- Set the following config inside
config/config.yaml
:
defaults.chain
tofood_preparation_time
data.input_path
todata\processed\merged_rewe_fdc_insulin.csv
data.output_path
todata/final/merged_rewe_fdc_insulin_time.csv
data.input_column
toNon Nutrient Data.FDC Name
data.output_column
toNon Nutrient Data.Preparation Time
python src/my_langchain/run_llm_processing_of_df.py