forked from vanderbilt-data-science/ancient-artifacts
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path42-naivebayes-modeling.Rmd
215 lines (181 loc) · 10.2 KB
/
42-naivebayes-modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "42-naivebayes-modeling"
output:
pdf_document:
toc: yes
toc_depth: '3'
html_notebook:
toc: yes
toc_depth: 3
toc_float: yes
number_sections: yes
html_document:
toc: yes
toc_depth: '3'
df_print: paged
---
In this notebook, we use Naive Bayes to perform modeling. Our general approach will be to use hyperparameter tuning via cross-validation to identify the best performing hyperparameters, In another notebook, we will investigate the performance of the model.
### Useful packages
```{r naive bayes specific modeling packages, results='hide'}
#load previous notebook data
source(knitr::purl("40-modeling.Rmd"))
fs::file_delete("40-modeling.R")
pacman::p_load(glmnet, tictoc, tidytext, naivebayes, discrim)
```
Here, we define the specs for the feature engineering, the model, the generalized workflow, and the parameters that we'll tune using parameters selected from a max entropy grid. For the Naive Bayes model, the general usemodels template code was not supported but using the naivebayes package model allowed for easy 1:1 translation
```{r naive bayes tidymodel specs}
nb_recipe <-
recipe(formula = particle_class ~ ., data = train_data) %>%
update_role(id, img_id, starts_with('filter'), hash, new_role='no_model') %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal())
nb_spec <-
naive_Bayes(smoothness = tune(), Laplace = tune() ) %>%
set_mode("classification") %>%
set_engine("naivebayes")
nb_workflow <-
workflow() %>%
add_recipe(nb_recipe) %>%
add_model(nb_spec)
nb_parameters <- parameters(nb_spec)
nb_grid <- grid_max_entropy(nb_parameters, size=20)
```
# Hyperparameter tuning (model selection) via cross-validation
```{r perform hyperparameter tuning via xvalidation, results='hide'}
tic()
nb_tune <- nb_workflow %>%
tune_grid(resamples = cv_folds,
grid = nb_grid,
metrics = metric_set(accuracy, roc_auc, pr_auc, sens, yardstick::spec, ppv, npv, f_meas),
control = control_grid(verbose = TRUE))
toc()
```
## Cross-validation metric distributions
In this section, we're going to take a little bit of a look at the individualized performance of the models taking into each fold into account. This will satisfy our academic curiosity in terms of machine learning and also provide some insight into the behaviors of the models. We'll look more at the aggregated measures in a moment.
We'll first decompress the tuning metrics a bit to get them into a more friendly form for processing.
```{r arrange cross validation metrics}
#extract the cross validation metrics for the naive bayes model by fold (i.e., unsummarized)
nb_fold_metrics <- nb_tune %>%
dplyr::select(id, .metrics, .notes) %>%
unnest(.metrics)
head(nb_fold_metrics, 10)
```
Now, let's visualize this generalized performance over all the models
```{r visualize fold performance distributions, fig.height=6, fig.width=12}
nb_fold_metrics %>%
mutate(facet_val = if_else(.metric== 'roc_auc' | .metric=='pr_auc' | .metric=='f_meas', 'Aggregate metrics', 'Confusion matrix metrics')) %>%
ggplot(aes(x=.metric, y=.estimate, fill=id)) +
geom_boxplot(outlier.colour = 'red', na.rm=TRUE) +
facet_wrap(facet='.metric', scales='free', nrow=2) +
labs(title='Distribution of performance by 20 model candidates',
subtitle='By fold and metric',
x='metric',
y='metric distribution') +
scale_x_discrete(labels=NULL)
```
Here we can see some interesting fold-specific results. There tended to be a large distubution for Fold 2 and fold 3 performance while Fold 3 and 1 were consistently small in terms of distribution.
# Identifying the "best" model
Now, let's collect the metrics to see how the model did over all of the folds and all of the metrics in order to identify the best model from these candidates. Note that this table looks similar to the prior tibble; the main difference here is that the results are aggregated over the folds (hence the `mean` and `n` columns).
```{r aggregating hyperparameter tuning/cross validation metrics}
tune_metrics <- nb_tune %>%
collect_metrics()
head(tune_metrics, 5)
```
## Basic performance overview
Let's just look at the overall (fold-less) distribution of the metrics.
```{r overall performance}
tune_metrics %>%
ggplot(aes(x=.metric, y=mean)) +
geom_boxplot(aes(fill=.metric), outlier.shape=NA, na.rm=TRUE) +
geom_jitter(na.rm=TRUE) +
facet_wrap(facets='.metric', nrow=2, scales='free') +
theme(legend.position = 'none') +
labs(title='Distribution of mean cv performance by 20 model candidates',
subtitle='By metric',
x='metric',
y='mean cv metric') +
scale_x_discrete(labels=NULL)
```
Here, we can see a relatively tight distribution for ppv and rather large distributions for spec, sense, and npv. However, all of these have a very small scale for their axises so that may be entirely normal.
One additional observation is that for accuracy and pr_auc, there are several very high-performing outliers. These will likely be our candidates for the "best" model but them being outliers is an interesting area to be explored.
## Making sense of the hyperparameters and their influence
Let's visualize this so we can make some sort of sense out of it.
```{r fig.height=6}
tune_metrics %>%
mutate(.metric=fct_relevel(.metric, 'roc_auc', 'pr_auc', 'f_meas', 'accuracy', 'sens', 'spec')) %>%
ggplot(aes(x=Laplace, y=smoothness)) +
geom_point(aes(fill=mean), shape=22, size=6) +
scale_x_log10(guide=guide_axis(angle=45)) +
facet_wrap(ncol=4, facet='.metric') +
scale_fill_gradient2(low='red', mid='yellow', high='green', midpoint=0.5) +
labs(title='Mean performance of mixture/penalty hyperparameter combinations',
subtitle='By performance metric',
x='LaPlace',
y='smoothness',
fill='mean cv value')
```
Here is a difficult-to-understand plot. The objective of this visualization is to begin to digest the relationship between the two hyperparameters and the performance given a certain metric. Recall that hyperparameter tuning evaluates all combinations of hyperparameters. These combos are shown as a square on a particular "subplot" of a metric of interest. Then, there are 20 squares since there are 20 models. And, the arrangement of all the "intersection" squares is identical.
What is of interest here is the color of the squares. Red indicates that the performance is poor, and green indicates that the performance is great. What is interesting here is that all of the squares within a given graph typically had the same color no matter the value of our two hyperparameters. This may suggest these hyperparameters don't affect the cv value as much as the model itself does in terms of classification.
## Looking at all of the metrics together to select a model
One possible way of evaluating a good model might be to rank the model according to its performance across all of the metrics. This allows us to get a bit away from the values themselves. However, we can also look at the values themselves and investigate the relationship.
```{r overall performance assessment, fig.height=6}
#calculate mean metrics and rank
mdl_overall <- tune_metrics %>%
group_by(.metric) %>%
mutate(metr_rank=rank(-mean, ties.method='average')) %>% #-mean so that rank increases (so, worse) with decreasing metric
group_by(.config, .add=FALSE) %>%
mutate(mean_rank = mean(metr_rank)) %>% #add mean rank
mutate(mean_value = -mean(mean, na.rm=TRUE)) %>% #add mean value
pivot_longer(cols=c(mean_rank, mean_value), names_to = 'agg_perf_type', values_to='agg_perf') %>%
group_by(agg_perf_type) %>%
filter(.metric=='pr_auc') #just pick one set of values; all these aggregated values will be identical
#plot; note that there is manipulation of negatives for the directionality and absolute value
mdl_overall %>%
ggplot(aes(x=reorder_within(str_remove(.config, 'Preprocessor1_'), -agg_perf, agg_perf_type),
y=abs(agg_perf),
width=smoothness)) +
geom_col(aes(fill=-log(Laplace))) +
geom_label(aes(label=round(abs(agg_perf),3)), label.r=unit(0.0, "lines"), label.size=0, size=3) +
facet_wrap(~agg_perf_type, ncol=2, scales='free') +
scale_x_reordered() +
coord_flip() +
labs(title='General model performance over all metrics by mean overall rank',
subtitle='Bar appearance shows parameters (width=smoothness, color=LaPlace)',
y='Mean over all metrics',
x='Model name')
```
Here, the height of each bar represents smoothness while the color represents the Laplace parameter. Overall, we see skinnier bars, or those with a lower smoothness, perform better than those with a higher smoothness. Additionally, the visual color does not immediately seem to correlate with model performance with the Laplace.
## Selecting the best model
With this information in mind as well as more help from tidymodels, we can then select the "best" model. One way to do this is to simply choose according to some metric. We'll decide to use `pr_auc` here just because our training data is so imbalanced.
```{r get best hyperparameters from resampling}
eval_metric <- 'pr_auc'
#show best parameters in terms of pr_auc
nb_tune %>% show_best(eval_metric)
```
We find here that this is exactly in line with our previous assessment of overall model performance. Laplace does not seem to have much of an affect on the mean while the lower smoothness levels are very critical.
```{r select best parameters}
#select best parameters
best_nb_params <- nb_tune %>%
select_best(eval_metric)
#show selected parameters
best_nb_params
```
We can see that Model 14 (best in overall rank and mean metric performance) predictably had the highest `pr_auc`.
# Training fit
Having identified the best hyperparameters, we can create the final fit on all of the training data:
```{r fit workflow on training data}
#finalize workflow with model hyperparameters
nb_final_wf <- nb_workflow %>%
finalize_workflow(best_nb_params)
nb_final_wf
#using final workflow, fit on training data
nb_final_fit <- nb_final_wf %>%
fit(data = train_data)
```
# Save markdown file
```{r save markdown}
#fs::file_copy('42-naivebayes-modeling.nb.html', './html_results/42-naivebayes-modeling.nb.html', overwrite=TRUE)
```
```{r save naive bayes model to box}
#move_model_info('nb', 'save', box_dir = path.expand('~/../Box/DSI_AncientArtifacts/'))
```