Open code accompanying the paper: A Reproducible Approach for Mining Business Activities from Emails for Process Analytics, from Raphael Azorin, Daniela Grigori, and Khalid Belhajjame at ICSOC AI-PA 2021, The 2nd International Workshop on AI-enabled Process Automation.
This repository contains notebooks for three dependent tasks:
- Raw emails extraction from MBOX to CSV (see 0_apache_camel_email_dataset_extraction.ipynb)
- Emails CSV preprocessing (see 1_dataset_preprocessing.ipynb)
- Experimentations (see experimentations.ipynb, to be run in Google Colab®)
The last task can be run independently as the preprocessed input data is already included in this repository at /data/camel_emails_emb_s2v.csv.
In order to simply reproduce the experimentation results presented in the paper, one should:
- Upload the input CSV of preprocessed emails (available at /data/camel_emails_emb_s2v.csv) in Google Drive®
- Upload the experimentation notebook experimentations.ipynb in Google Colab®
- Setup and run the notebook. Please note that the default parameters are those used in the article. The required setup concerns the location of input and output files on your drive.
Preprocessed input data is already included in the /data/ folder. Should one need to fully reproduce these data preparation steps, one should first:
- download the raw Apache Camel MBOX files over the period 2017-04-14 10:42:39 UTC to 2017-04-19 13:27:37 UTC and store it in the /data/mailbox/ folder.
- download the corresponding email labels and store them in the /data/labels/ folder.
- download and decompress the Sense2Vec 2019 model in the /helper/ folder.
- download and install SpaCy with its language version "en_core_web_sm".
Then, the whole process of preprocessing the raw emails is contained in the 0_apache_camel_email_dataset_extraction.ipynb and 1_dataset_preprocessing.ipynb notebooks.