-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmodel_training.rmd
185 lines (147 loc) · 6.75 KB
/
model_training.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
title: "German Credit Model Training Notebook"
output: html_document
---
# German Credit model training
This notebook will be a guide in creating a Logistic Regression model on the German Credit dataset. The output of this notebook will be a trained model artifact that will be called on model runtime, as well as creating sample datasets that are MOC-compliant.
## Library imports
First, let's import our the necessary libraries
```{r eval=FALSE}
library(tidymodels)
library(readr)
library(yardstick)
library(jsonlite)
library(purrr)
library(stringr)
library(xgboost)
library(dplyr)
```
MOC best works with one-line JSON string objects. Since there isn't native support for one-line JSON string objects, we'll be using the following code to write json files based on the dataframes that are put in.
```{r}
# function to write json files
write_json <- function(object, filename){
s <- toJSON(object)
s <- substring(s, 2, nchar(s) - 1)
s <- str_replace_all(s, "\\},", "\\}\n")
write(s, filename)
}
```
## Preparing the data
Let's import the data. This specific version of the German Credit dataset has a `label` column that specifies whether or not someone defaulted or paid off the loan.
```{r}
df = readr::read_csv("german_credit_data.csv")
```
For classification models in R, we need the outcome variable to be a factor type. Currently, `label` is 0s and 1s, so we'll edit this.
```{r}
# converting 0,1s in label to string counterparts and then to factor
df <- transform(df, label = ifelse((label>0), "Default", "Pay Off"))
df <- df %>% mutate(label = factor(label))
```
Let's also set a random seed for reproducibility
```{r}
set.seed(777)
```
## Preparing for preprocessing
At this point, we'll be splitting the dataset into train/test datasets. We'll also want to save these datasets as the one-line JSON object files so we can run them on MOC.
```{r}
# train/test split
df_split <- initial_split(df, prop=0.8, strata=label)
df_baseline <- training(df_split)
df_sample <- testing(df_split)
glimpse(df_baseline)
```
Saving the datasets:
```{r}
# save the train and test data for later use
write_json(df_baseline, 'df_baseline.json')
write_json(df_sample, 'df_sample.json')
```
## Preprocessing and Feature Engineering
Now that the datasets are saved, let's get into preprocessing the data. We'll be using a recipe to conform the data in a workflow, which will allow us to seamlessly input data into MOC, as well. Here are the steps our recipe:
- First, we'll want to drop the `id` and `gender` columns, as incorporating these into the model would be detrimental and introduce bias. We will want the `gender` column later as we do bias monitoring.
- Next, we'll dummy the nominal features.
- After, we'll drop columns with 0 variance. This specific dataset doesn't have any of features with 0 variance, but it might be good practice to put it in there anyway.
- Finall, we'll normalize all the predictor features to have a mean of 0 and a standard deviation of 1.
```{r}
gc_recipe <-
# selecting all columns to predict `label`
recipe(label ~ ., data=df_baseline) %>%
# removing the id and gender columns from being predictive variables
step_rm(id, gender) %>%
# dummying all categorical features
step_dummy(all_nominal(), -all_outcomes()) %>%
# dropping columns with 0 variance ie. are only one value
step_zv(all_predictors()) %>%
# normalizing all columns to have a mean of 0, std of 1
step_normalize(all_predictors())
# check out summary of recipe
summary(gc_recipe)
```
## Model Training
Now that the data is ready, let's instantiate our model and create the workflow. We'll be using a simple logistic regression model for our purposes, and the workflow will just simply include the reciple and the model
```{r}
# fitting the model, simple logistic regression
logreg <- logistic_reg(penalty=tune(), mixture=tune()) %>%
set_engine("glm") %>%
set_mode("classification")
# creating workflow by combining recipe and model
logreg_wflow <-
workflow() %>%
add_model(logreg) %>%
add_recipe(gc_recipe)
```
Let's go ahead and fit the model. This will be the trained model artifact that we'll want to save later.
```{r}
# training the model
logreg_fit <- fit(logreg_wflow, df_baseline)
```
Let's make predictions on both the training dataset and the testing dataset. We'll add these back into the origal dataset, as the monitoring portion of MOC needs predictions to test against the actual ground truth labels.
```{r}
# predicting on baseline and sample data
train_preds <- predict(logreg_fit, df_baseline)
test_preds <- predict(logreg_fit, df_sample)
# binding predictions to original dataframes
df_baseline_scored <- bind_cols(train_preds, df_baseline)
df_sample_scored <- bind_cols(test_preds, df_sample)
glimpse(df_baseline_scored)
```
## Evaluating the Model
Now that predictions have been made, let's evaluate our model. We'll go with a standard set of metrics to evaluate: finding the best one for your own needs is obviously imperative for your own business use case.
```{r}
# evaluate outcomes
metrics <- metric_set(recall, precision, f_meas, accuracy, kap)
metrics(df_baseline_scored, truth=label, estimate=.pred_class)
metrics(df_sample_scored, truth=label, estimate=.pred_class)
```
Not the best model, but the purpose of this notebook is just to get the trained model artifacts in order to run on MOC. We'll contend with this model.
## Saving the Artifacts
Let's go ahead and save all the necessary artifacts: our datasets with predictions (also with some columns renamed to be MOC-compliant - this is a crucial step), and our trained model artifact (the fit workflow).
```{r}
# rename columns: label -> label_value, .pred -> score
df_baseline_scored <- rename(df_baseline_scored, label_value=label, score=.pred_class) %>% relocate(label_value, score)
df_sample_scored <- rename(df_sample_scored, label_value=label, score=.pred_class) %>% relocate(label_value, score)
# save "scored" dataframes for later analysis
write_json(df_baseline_scored, "df_baseline_scored.json")
write_json(df_sample_scored, "df_sample_scored.json")
# persisting the fit model (the trained workflow)
save(logreg_fit, file="trained_model.RData")
```
And that's a wrap! The code below can be used as a stand-alone to test the importing of the trained model artifact and data.
```{r}
# --------------------------------------------------------
# testing loading and predicting with fit model
# run below code without running above code to test
# importing necessary libraries
library(tidymodels)
library(readr)
library(jsonlite)
# importing test data (flatten to prevent nesting)
data_in <- stream_in(file("df_sample.json"),flatten = TRUE)
test_data <- tibble(data_in)
# loading fit model
load("trained_model.RData")
# re-assigning model for clarity
model <- logreg_fit
# predicting
predict(model, test_data)
```