The goal is to classify the Cause and Body Part from the Description of an Injury usin the BERT model. Processed the dataset and saved them in a csv file. Then for training, bert-base-uncased model from HuggingFace was used. Two models were build one for classifying Cause and another for classifying Body Part.
The Jupyter notebook files are in the Model Training folder. The train.csv and test.csv files are inside the directory ‘data’. The data directory is also used in storing the processed csv files that will be loaded later in model training
Loading the train.csv file we can see that it has 6 columns [LossDescription, ResultingInjuryDesc, PartInjuredDesc, Cause - Hierarchy 1, Body Part - Hierarchy 1, Index], and loaded using the index column as index.
The texts in [LossDescription, ResultingInjuryDesc, PartInjuredDesc] were merged into a single column called ‘description’ and dropped.
Now our dataframe only has 3 columns namely [Cause - Hierarchy 1, Body Part - Hierarchy 1, description].
The text inside the dataframe is converted to lowercase.
Then we split the dataframe into two separate dataframes, because we need to classify two tables separately. Once we have split, the null values are dropped from the two dataframes.
We can see that there are 12 values in Cause - Hierarchy 1 and 7 values in Body Part - Hierarchy 1.
Now the text values are converted into labels in a new column called label.
These dataframes are them saved as csv files in side the data directory only containing the columns [description, label].
I used the bert-base-uncased pre-trained model from HuggingFace for classification. I loaded the model and finetuned it in pytorch using the dataset so that it can classify the class labels. Loaded the processed dataset, split the dataset into train and test set then tokenized the description column so that it will be accepted by the model.
Removed the Unnecessary columns
Loaded the data as a pytroch dataloader object
Loaded the Bert model
Initializing the optimizer and Learning rate scheduler
Finetuning the Pretrained model
This method was used for both the models namely the cause_model to classify Cause - Hierarchy 1 and bodypart_model to classify Body Part - Hierarchy 1.
I got the Following accuracy on the Validation set for the respective models Model for Classifying Cause - Hierarchy 1
Saving the model as cause_model
Model for Classifying Body Part - Hierarchy 1
Saving this model as bodypart_model
The Prediction follows the same method as before like processing the dataset and loading it. Processing the test data and saving it in a csv file.
Loading the processed test data and tokenizing it.
Loading both of the saved classification models
Running the inference and obtaining the prediction.
Saving the predictions into the test.csv dataset.
- Ensure the requirements.txt are installed
- First run the EDA.ipynb – This contains the exploratory data analysis of the dataset, the processing is done and saved as a csv file one for the Cause - Hierarchy 1 and another for Body Part - Hierarchy 1.
- Then run the cause_train.ipynb followed by bodypart_train.ipynb or vice versa, this will fine tune the bert-base-uncased model and saves the model locally.
- Run the test.ipynb - This is where we load both the models and predict the labels for the test.csv. and saved the file in a new test.csv file which contains the predicted labels
sample input: "punched ee in the face numerous times by person supported., struck or injured by, head"
we get out put as:
{
"answer": [
"struck or injured by",
"head"
]
}
- Dockerized the FastAPI application.
- All files to dockerize the application is in app folder
- Obtain the fine-tuned model by running the bodypart_train.ipynb and cause_train.ipynb
- Then copy the model files into the bodypart_model and causes_model inside the app folder
- Finally run the
docker compose up --build
to build the docker image