forked from osbili/sandbox
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbglm_example4.qmd
430 lines (331 loc) · 15.7 KB
/
bglm_example4.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
---
title: "Bayesian GLM Part4"
author: "Murray Logan"
date: today
date-format: "DD/MM/YYYY"
format:
html:
## Format
theme: [default, ../resources/ws-style.scss]
css: ../resources/ws_style.css
html-math-method: mathjax
## Table of contents
toc: true
toc-float: true
## Numbering
number-sections: true
number-depth: 3
## Layout
page-layout: full
fig-caption-location: "bottom"
fig-align: "center"
fig-width: 4
fig-height: 4
fig-dpi: 72
tbl-cap-location: top
## Code
code-fold: false
code-tools: true
code-summary: "Show the code"
code-line-numbers: true
code-block-border-left: "#ccc"
code-copy: true
highlight-style: atom-one
## Execution
execute:
echo: true
cache: true
## Rendering
embed-resources: true
crossref:
fig-title: '**Figure**'
fig-labels: arabic
tbl-title: '**Table**'
tbl-labels: arabic
engine: knitr
output_dir: "docs"
documentclass: article
fontsize: 12pt
mainfont: Arial
mathfont: LiberationMono
monofont: DejaVu Sans Mono
classoption: a4paper
bibliography: ../resources/references.bib
---
```{r}
#| label: setup
#| include: false
knitr::opts_chunk$set(cache.lazy = FALSE,
tidy = "styler")
options(tinytex.engine = "xelatex")
```
# Preparations
Load the necessary libraries
```{r}
#| label: libraries
#| output: false
#| eval: true
#| warning: false
#| message: false
#| cache: false
library(tidyverse) #for data wrangling etc
library(rstanarm) #for fitting models in STAN
library(cmdstanr) #for cmdstan
library(brms) #for fitting models in STAN
library(coda) #for diagnostics
library(bayesplot) #for diagnostics
library(ggmcmc) #for MCMC diagnostics
library(DHARMa) #for residual diagnostics
library(rstan) #for interfacing with STAN
library(emmeans) #for marginal means etc
library(broom) #for tidying outputs
library(tidybayes) #for more tidying outputs
library(ggeffects) #for partial plots
library(broom.mixed) #for summarising models
library(ggeffects) #for partial effects plots
library(bayestestR) #for ROPE
library(see) #for some plots
library(easystats) #for the easystats ecosystem
library(patchwork) #for multiple plots
library(modelsummary) #for data and model summaries
library(car) #for scatterplot matrices
library(ggridges) #for ridge plots
theme_set(theme_grey()) #put the default ggplot theme back
source("helperFunctions.R")
```
# Scenario
@Loyn-1987-1987 modelled the abundance of forest birds with six predictor
variables (patch area, distance to nearest patch, distance to nearest
larger patch, grazing intensity, altitude and years since the patch had
been isolated).
{#fig-honeyeater width="165" height="240"}
ABUND DIST LDIST AREA GRAZE ALT YR.ISOL
------- ------ ------- ------ ------- ----- ---------
.. .. .. .. .. .. ..
: Format of loyn.csv data file {#tbl-loyn .table-condensed}
------------- ------------------------------------------------------------------------------
**ABUND** Abundance of forest birds in patch- response variable
**DIST** Distance to nearest patch - predictor variable
**LDIST** Distance to nearest larger patch - predictor variable
**AREA** Size of the patch - predictor variable
**GRAZE** Grazing intensity (1 to 5, representing light to heavy) - predictor variable
**ALT** Altitude - predictor variable
**YR.ISOL** Number of years since the patch was isolated - predictor variable
------------- ------------------------------------------------------------------------------
```{r}
# We are not trying to do prediction. We are trying to see what impact does this 6 parameters have in driving bird abundance patterns.
```
: Description of the variables in the loyn data file {#tbl-loyn1 .table-condensed}
The aim of the analysis is to investigate the effects of a range of
predictors on the abundance of forest birds.
# Read in the data
```{r}
# We are dealing with bird count so we are going to be implementing Poisson. However, they have different amount of quadrants for each area they measured, depending on the size of the area, so they AVERAGED the quadrant data in a patch of area. Poisson likes to deal with raw data, now aggregated data is not integer which disrupts the concept of count. It is not proportions so its not Binomial, We can do Gaussian, we might entertain Gamma distribution, we could do a t distribution, we can also try a log normal distribution.
#Authors of this data did square-root transformation. Which is very dangerous thing, because you cannot back transform, (negatives are back transformed as positives, decimals are back transformed as less than their original value (0.5 -> 0.25). We rather to find a distribution to match the data, then transforming the data to match a certain distribution.
```
```{r}
# The predictors are continuous, so we have to center them. Because y-intercept when all the predictors (distance,altitude, etc = 0) does not make sense. How is distance of a quadrant is 0 to another quadrant. So centering them is necessary, and we will be scaling them which provides an advantage. It puts each predictors on same scale so we can determine which has biggest impact by comparing their slopes, they become comparible. It is important that predictors are not correlated, they will compete in the model which will cause grossly/under estimate the effect of one of them. Estimates will be wildly out.
```
```{r}
# Predictors should be symmetrical, ideally they should be uniformed. They need to be symmetrical uniform or normal.
# We also assume that trends are linear.
# We need to make sure data is equally varied (homogeniety of variance).
# We need to make sure the predictions are not correlated with one another.
```
```{r readData, results='markdown', eval=TRUE}
loyn <- read_csv('../data/loyn.csv', trim_ws=TRUE)
```
# Exploratory data analysis
Model formula:
$$
y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\
log(\mu_i) = \boldsymbol{\beta} \bf{X_i}\\
\beta_0 \sim{} \mathcal{N}(3,0.5)\\
\beta_{1-9} \sim{} \mathcal{N}(0,2.5)\\
\sigma \sim{} \mathcal{Gamma}(2,1)\\
OR\\
\sigma \sim{} \mathcal{t}(3,0,2.5)
$$
where $\boldsymbol{\beta}$ is a vector of effects parameters and $\bf{X}$ is a model matrix representing the additive effects of
the scaled versions of distance (ln), distance to the nearest large patch (ln), patch area (ln), grazing intensity, year of isolation and
altitude on the abundance of forest birds.
```{r}
#| label: Scatterplot Matrix
library(car)
scatterplotMatrix(~ABUND+DIST+LDIST+AREA+GRAZE+ALT+YR.ISOL, data =loyn,
diagonal = list(method = 'boxplot'))
#Boxplots will help us assess normality. Diagonal
#Top line shows response of the abundance to each predictors. It is assesing linearity and homogeneity.
#All others will for assessing correlations between predictors.
# If the data is not normally distributed, if the normality is not there, it is difficult to assess others (linearity, homogeneity)
```
```{r}
#| label: Log Transform
#Log transformation will provide normality for predictors since abundance is skewed for bunch of predictors. We are only log transform the skewed ones. (Log transformation has bigger impact on smaller numbers so it will stretch the boxplot to the middle)
scatterplotMatrix(~ABUND+log(DIST)+log(LDIST)+log(AREA)+GRAZE+ALT+YR.ISOL, data =loyn,
diagonal = list(method = 'boxplot'))
# grazing was categorical, we shouldn't treat it as continuous anymore.
# variance looks okay, there is no drastic change within variables of each group. homogeneity looks alright.
#DIST and LDIST looks correlated which makes sense because they show strong relationship, in our model we should keep them separate. We can either take one of them out, both of them out, leave both in.
```
```{r}
#| label: Checking variance inflation
vif(lm(ABUND ~ log(DIST) + log(LDIST) + log(AREA) + GRAZE + YR.ISOL + ALT, data = loyn))
#Numbers in results are called variance inflation factors. Equivalent of 1 on, meaning 1/vif. 60% indicates correlation. Above 80% indicates strong correlation. Values above 3 in the table indicates it is correlated to the rest of the pool and will compete. They shouldn't be in the same model. Our results indicate we can leave all the predictors in the model.
```
```{r}
#GRAZE is the first group for categorical values. Before grazing is all
loyn <- loyn |> mutate(fGRAZE = factor(GRAZE))
#Although the boxplot for ABUND looks normal enough, when Gaussian distribution was implemented, it shows that for very small areas predictors indicated negative amount of birds which is non-sense. That is why we do log normal distribution. If the abundance was high enough, Gaussian would be okay.
```
```{r}
#| label: Priors
#Start by getting average number of birds for grazing areas. We particularly want median and mad for grazing area 1 because that is our starting point.
loyn |> group_by(fGRAZE) |>
summarise(Median = median(log(ABUND)),
MAD = mad(log(ABUND)))
# Our priors: mean would be 3.4, variance would be ~0.1
# For scaled predictors a prior of (0,1) should be 1. But we have a categorical predictor so it might be wide enough to cover 2 instead of 1. We have to try and see.
# b0 ~ N(3.4, 0.1)
# b1 ~ N(0,2)
# sigma ~ N(3,0,1) -> that value is always 3, we for t distribution(necessary for sigma), We have now priors for intercept (b0), prior for slope(b1) and prior for sigma(3,0,1) now we can code.
priors <- prior(normal(3.4,0.1), class = 'Intercept') +
prior(normal(0,2), class = 'b') +
prior(student_t(3,0,1), class = 'sigma')
# you can log scale data, but you cant scale log data.
```
# Fit the model
```{r}
loyn.form <- bf(ABUND ~ scale(log(DIST))+
scale(log(LDIST))+
scale(log(AREA)) +
fGRAZE +
scale(ALT) +
scale(YR.ISOL),
family = gaussian(link = 'log'))
loyn.brm2 <- brm(loyn.form,
data = loyn,
prior = priors,
sample_prior = 'only',
iter = 5000,
warmup = 1000,
chains = 3, cores = 3,
thin = 5,
refresh = 0,
backend = "cmdstanr")
```
# MCMC sampling diagnostics
# Model validation
# Partial effects plots
```{r}
loyn.brm2 |>
conditional_effects() |>
plot(points = TRUE, ask = FALSE, plot = FALSE) |>
wrap_plots() & # -> & sign is for if you want to apply changes on the plots (changing y axis scale to log) you use & sign to apply the changes to all of them.
scale_y_log10()
```
```{r}
loyn.brm3 <- update(loyn.brm2, sample_prior = 'yes', refresh = 0)
loyn.brm3 |>
conditional_effects() |>
plot(points = TRUE, ask = FALSE, plot = FALSE) |>
wrap_plots() &
scale_y_log10()
loyn.brm3 |> SUYR_prior_and_posterior()
loyn.brm3 |> conditional_effects(effects = "AREA") |> plot(points = TRUE) |> _[[1]] + scale_y_log10()
```
# Model investigation
```{r}
loyn.brm3$fit |> stan_trace() #pars not specified can be removed by specifying
loyn.brm3$fit |> stan_trace(pars = "sigma")
loyn.brm3$fit |> get_variables()
loyn.brm3$fit |> get_variables() |> str_subset(pattern = '^b_.*') # dot means any character before me, star means any number of the thing before me, this means that they are looking for anything that starts with b_ and comes anything else afterwards. ^ means letter b must be at the start it wont search a string that has b in the middle. if ^ replaced with $ it means ends with. so if you write b_.*^sigma$ it means look for sigma and sigma only no addition afterwards.
```
# Further analyses
```{r}
loyn.brm3$fit |> get_variables() |> str_subset(pattern = '^b_.*|^sigma$') -> par
par
loyn.brm3$fit |> stan_ac(pars = par)
loyn.brm3$fit |> stan_rhat()
loyn.brm3$fit |> stan_ess()
loyn.brm3 |> pp_check(type = 'dens_overlay', ndraws = 100)
# Checking for residuals
loyn.resids <- make_brms_dharma_res(loyn.brm3, integerResponse = FALSE)
testUniformity(loyn.resids)
plotResiduals(loyn.resids)
plot(as.vector(loyn.resids$scaledResiduals) ~ log(loyn$AREA)) # just plotting residuals against AREA
plot(as.vector(loyn.resids$scaledResiduals) ~ log(loyn$DIST))
plot(as.vector(loyn.resids$scaledResiduals) ~ log(loyn$LDIST))
plot(as.vector(loyn.resids$scaledResiduals) ~ loyn$YR.ISOL)
plot(as.vector(loyn.resids$scaledResiduals) ~ loyn$ALT)
plot(as.vector(loyn.resids$scaledResiduals) ~ log(loyn$GRAZE))
testDispersion(loyn.resids) # we didnt need to check dispersion because we already modelled it.
```
```{r}
loyn.brm3 |>
as_draws_df() |>
dplyr::select(matches("b_.*|^sigma$")) |>
exp() |>
summarise_draws(
median,
HDInterval::hdi,
rhat,
ess_bulk,
ess_tail
)
# its exponentiated so the values are averages, ~23 birds for Grazing level1 (Intercept).
# b values are called partial slopes because when you are looking at an effect of one parameter, the others are hold as constant.
#scaledist median -> for every one unit change in distance, it shows the change of effect on response, because we have exponentiated, the numbers are multiplies. Also the lower and upper range includes 1 in the middle so it can be effecting positive or negative meaning no evidence.
# Scaling also help to assess which predictor has the biggest impact. You need to check the table.
#1/0.4623 = 2.163098 grazing has the biggest impact ( we do inverse because it is a negative effect)
```
```{r}
loyn.brm4a <- update(loyn.brm3, .~scale(log(DIST))*scale(log(LDIST)),
save_pars = save_pars(all = TRUE), refresh = 0)
loyn.brm4b <- update(loyn.brm3, .~scale(log(AREA))*fGRAZE,
save_pars = save_pars(all = TRUE), refresh = 0)
loyn.brm4c <- update(loyn.brm3, .~scale(log(AREA))*scale(YR.ISOL)*fGRAZE,
save_pars = save_pars(all = TRUE), refresh = 0)
loyn.brm4d <- update(loyn.brm3, .~scale(log(DIST))*scale(ALT),
save_pars = save_pars(all = TRUE), refresh = 0)
loyn.brm4e <- update(loyn.brm3, .~1, # 1 stands for intercept -> NULL model good for comparison
save_pars = save_pars(all = TRUE), refresh = 0)
#you take the deviants (unexplained parts) multiply by 2 and penalise it for number of criterias, its called information criteria -> information criteria lower is better, it implies less unexplained models.
# We are gonna use LOO information criteria (Leave One Out) -> you create a model with leaving 1 observation out and check how model predicts without that observation.
loo(loyn.brm4a) # -217.2 -> * -2 = 434.3 (looic in table) -> information criteria, by itself means nothing at all.
loo(loyn.brm4e) # p_loo is for number of parameters -> null model has 1.2 whereas model a had ~4, so model a is not a good model
loo(loyn.brm4b) # lower information criteria than null model.
loo_compare(loo(loyn.brm4a), loo(loyn.brm4e))
loo_compare(loo(loyn.brm4b), loo(loyn.brm4e)) #it ranks them from top to bottom
```
```{r}
loyn.brm4b |> conditional_effects(effects = "AREA:fGRAZE") |> plot() |> _[[1]] +
scale_y_log10() +
scale_x_log10()
```
```{r}
loyn.brm4b |>
as_draws_df() |>
dplyr::select(matches("b_.*|^sigma$")) |>
exp() |>
summarise_draws(
median,
HDInterval::hdi,
rhat,
Pl1 = ~mean(.x<1),
Pg1 = ~mean(.x>1)
)
# It is important to look at the interactions first.
# The magnitude of effect of area depends on grazing.
# The magnitude of effect of grazing depends on area.
```
```{r}
loyn.list <- with(loyn, list(AREA = c(min(AREA), median(AREA), max(AREA))))
loyn.list
loyn.brm4b |>
emmeans(~fGRAZE|AREA, at = loyn.list, type = "response") |>
pairs()
# In small areas, drastic grazing does have little effect
```
# Summary figure
# References