cleanup website

ConnorDonegan · Sep 18, 2024 · 2a5568b · 2a5568b
1 parent 24551b4
commit 2a5568b
Show file tree

Hide file tree

Showing 144 changed files with 690 additions and 906 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -1,14 +1,20 @@
 # geostan 0.7.0
 
-Changes that increase sampling speed:
+V0.7.0 includes various adjustments to speed things up, and the DIC is now provided for model comparison (in addition to WAIC).
 
- - Samples for the pointwise log likelihood are no longer collected in the Stan models. Instead, they are calculated in R when needed for calculating waic or dic. 
+New features:
+
+ - Deviance information criteria (DIC). WAIC is considered to be more robust than DIC, but DIC remains common in spatial statistics because (unlike WAIC) it does not assume independent observations.
+
+Changes in the background that should improve the user experience:
+
+ - Samples for the pointwise log likelihood are no longer collected in the Stan models. Instead, they are calculated in R when needed for calculating waic or dic. This should improve sampling speed and lower memory and storage use.
  - The QR decomposition is now implemented (automatically) for models with covariates. (Note that this cannot be used for measurement error models.) The benefits of the QR decomposition in terms of sampling speed and efficiency are greatest when covariates are centered (which you can do using the `centerx` argument). 
  - A change recommended by Roger Bivand takes advantage of an improvement in `spdep`'s creation of neighbors objects (https://github.com/ConnorDonegan/geostan/issues/19). This will speed up the `shape2mat` function in some cases.
 
 Other changes:
 
- - Stan code for the CAR and SAR models has been simplified a bit. The change is reflected in the vignette on custom spatial models.
+ - Stan code for the CAR and SAR models has been simplified a bit. The changes are reflected in the vignette on custom spatial models. The output of `prep_car_data` and `prep_sar_data` have changed somewhat, but the user workflow is the same.
 
 # geostan 0.6.2
 

diff --git a/R/convenience-functions.R b/R/convenience-functions.R
@@ -44,7 +44,7 @@ n_eff <- function(n, rho) {
 #'
 #' @md
 #' 
-#' @description The approximate-profile likelihood estimator for the spatial autocorrelation parameter from a simultaneous autoregressive (SAR) model (Li et al. 2007). Note, the `APLE` approximation is not reliable when the number of observations is large.
+#' @description The approximate-profile likelihood estimator for the spatial autocorrelation parameter from a simultaneous autoregressive (SAR) model (Li et al. 2007). 
 #' 
 #' @param x Numeric vector of values, length `n`. This will be standardized internally with \code{scale(x)}.
 #' @param w An `n x n` row-standardized spatial connectivity matrix. See \link[geostan]{shape2mat}.
@@ -54,7 +54,7 @@ n_eff <- function(n, rho) {
 #'
 #' @seealso \link[geostan]{mc}, \link[geostan]{moran_plot}, \link[geostan]{lisa}, \link[geostan]{sim_sar}
 #'
-#' @details The \code{APLE} is an estimate of the spatial autocorrelation parameter one would obtain from fitting an intercept-only SAR model.
+#' @details The \code{APLE} is an estimate of the spatial autocorrelation parameter one would obtain from fitting an intercept-only SAR model. Note, the `APLE` approximation is not reliable when the number of observations is large.
 #'
 #' @source
 #'
@@ -100,7 +100,7 @@ aple <- function(x, w, digits = 3) {
 #'
 #' If \code{m = 1} a vector of the same length as \code{mu}, otherwise an \code{m x length(mu)} matrix with one sample in each row.
 #'
-#' @details Calls \code{MASS::mvrnorm} internally to draw from the multivariate normal distribution. The covariance matrix is specified following the simultaneous autoregressive (SAR) model. 
+#' @details Calls \code{MASS::mvrnorm} internally to draw from the multivariate normal distribution. The covariance matrix is specified following the simultaneous autoregressive (SAR, aka spatial error) model. 
 #'
 #' @seealso \code{\link[geostan]{aple}}, \code{\link[geostan]{mc}}, \code{\link[geostan]{moran_plot}}, \code{\link[geostan]{lisa}}, \code{\link[geostan]{shape2mat}}
 #' 
@@ -136,7 +136,7 @@ sim_sar <- function(m = 1, mu = rep(0, nrow(w)), w, rho, sigma = 1, ...) {
 #' 
 #' @param shape An object of class \code{sf} or another spatial object coercible to \code{sf} with \code{sf::st_as_sf} such as \code{SpatialPolygonsDataFrame}.
 #'
-#' @param ... Additional arguments passed to \code{\link[geostan]{residuals.geostan_fit}}. For binomial and Poisson models, this includes the option to view the outcome variable as a rate (the default) rather than a count; for \code{\link[geostan]{stan_car}} models with auto-Gaussian likelihood (`fit$family$family = "auto_gaussian"), the residuals will be detrended by default, but this can be changed using `detrend = FALSE`.
+#' @param ... Additional arguments passed to \code{\link[geostan]{residuals.geostan_fit}}. 
 #' 
 #' @return
 #'
@@ -146,9 +146,7 @@ sim_sar <- function(m = 1, mu = rep(0, nrow(w)), w, rho, sigma = 1, ...) {
 #' 
 #' When provided with a numeric vector, this function plots a histogram, Moran scatter plot, and map.
 #'
-#' When provided with a fitted `geostan` model, the function returns a point-interval plot of observed values against fitted values (mean and 95 percent credible interval), either a Moran scatter plot of residuals or a histogram of Moran coefficient values calculated from the joint posterior distribution of the residuals, and a map of the mean posterior residuals (means of the marginal distributions). 
-#'
-#' When `y` is a fitted CAR or SAR model with `family = auto_gaussian()`, the fitted values will include implicit spatial trend term, i.e. the call to \link[geostan]{fitted.geostan_fit} will use the default `trend = TRUE` and the call to \link[geostan]{residuals.geostan_fit} will use the default `detrend = TRUE`. (See \link[geostan]{stan_car} or \link[geostan]{stan_sar} for additional details on their implicit spatial trend components.) 
+#' When provided with a fitted `geostan` model, the function returns a point-interval plot of observed values against fitted values (mean and 95 percent credible interval), a Moran scatter plot for the model residuals, and a map of the mean posterior residuals (means of the marginal distributions). If if `mc_style = 'hist'`, the Moran scatter plot is replaced by a histogram of Moran coefficient values calculated from the joint posterior distribution of the residuals.
 #'
 #' @seealso \code{\link[geostan]{me_diag}}, \code{\link[geostan]{mc}}, \code{\link[geostan]{moran_plot}}, \code{\link[geostan]{aple}}
 #' 
@@ -164,7 +162,8 @@ sim_sar <- function(m = 1, mu = rep(0, nrow(w)), w, rho, sigma = 1, ...) {
 #' \donttest{
 #' fit <- stan_glm(log(rate.male) ~ log(income),
 #'                 data = georgia,
-#'                 chains = 2, iter = 800) # for speed only
+#'                 centerx = TRUE,
+#'                 chains = 1, iter = 1e3) # for speed only
 #' sp_diag(fit, georgia)
 #' }
 #' @importFrom stats sd
@@ -403,7 +402,7 @@ me_diag <- function(fit,
                     size = 0.25,
                     index = 0,
                     style = c("W", "B"),
-                    w = shape2mat(shape, match.arg(style)),
+                    w = shape2mat(shape, match.arg(style), quiet = TRUE),
                     binwidth = function(x) 0.5 * sd(x)
                     ) {
     stopifnot(length(varname) == 1)    
@@ -599,7 +598,7 @@ make_EV <- function(C, nsa = FALSE, threshold = 0.2, values = FALSE) {
 #'
 #' Haining and Li (Ch. 4) provide a helpful discussion of spatial connectivity matrices (Ch. 4).
 #'
-#' The space-time connectivity matrix can be used for eigenvector space-time filtering (\code{\link[geostan]{stan_esf}}. The `lagged' space-time structure connects each observation to its own past (one period lagged) value and the past value of its neighbors. The `contemporaneous' specification links each observation to its neighbors and to its own in situ past (one period lagged) value (Griffith 2012, p. 23).
+#' The space-time connectivity matrix can be used for eigenvector space-time filtering (\code{\link[geostan]{stan_esf}}. The 'lagged' space-time structure connects each observation to its own past (one period lagged) value and the past value of its neighbors. The 'contemporaneous' specification links each observation to its neighbors and to its own in situ past (one period lagged) value (Griffith 2012, p. 23).
 #' 
 #' @source
 #'

diff --git a/R/geary.R b/R/geary.R
@@ -18,7 +18,7 @@ GRfun <- function(i, var, weights) matrix(weights[i,], nrow=1) %*% ((var[i] - va
 #' 
 #' Local Geary's C is found in the numerator of the Geary Ratio (GR). For the \eqn{i^{th}} observation, the local Geary statistic is
 #' \deqn{C_i = \sum_j w_{i,j} * (x_i - x_j)^2}
-#' Hence, local Geary values will be largest for those observations that are most unlike their neighboring values. If a binary connectivity matrix is used (rather than row-standardized), then having many neighbors will also increase the value of the local Geary statistic. For most purposes, the row-standardized spatial weights matrix may be the more appropriate choice.
+#' Hence, local Geary values will be largest for those observations that are most unlike their neighboring values. If a binary connectivity matrix is used (rather than row-standardized), then having many neighbors can also increase the value of the local Geary statistic. For most purposes, the row-standardized spatial weights matrix may be the more appropriate choice.
 #'
 #' @return The function returns a vector of numeric values, each value being a local indicator of spatial association (or dissimilarity), ordered as `x`.
 #'
@@ -33,10 +33,10 @@ GRfun <- function(i, var, weights) matrix(weights[i,], nrow=1) %*% ((var[i] - va
 #' data(georgia)
 #' x <- log(georgia$income)
 #' w <- shape2mat(georgia, "W")
-#' lisd <- lg(x, w)
-#' hist(lisd)
+#' lisa <- lg(x, w)
+#' hist(lisa)
 #' ggplot(georgia) +
-#'   geom_sf(aes(fill = lisd)) +
+#'   geom_sf(aes(fill = lisa)) +
 #' scale_fill_gradient(high = "navy",  
 #'                    low = "white") 
 #'  ## or try: scale_fill_viridis() 
@@ -87,11 +87,14 @@ lg <- function(x, w, digits = 3, scale = TRUE, na.rm = FALSE, warn = TRUE) {
 #' autocorrelation is represented by a GR of 1. Negative
 #' autocorrelation pushes the GR above 1, towards 2.  \deqn{GR =
 #' \frac{n-1}{2K} \frac{M}{D}} \deqn{M = \sum_i \sum_j w_{i,j} (x_i -
-#' x_j)^2 } \deqn{D = \sum_i (x_i - \overline{x})^2 } Observations
-#' with no neighbors are removed before calculating the GR. The
-#' alternative is for those observations to contribute zero to the
+#' x_j)^2 } \deqn{D = \sum_i (x_i - \overline{x})^2 }
+#'
+#' Observations
+#' with no neighbors are removed before calculating the GR. (The
+#' alternative would be for those observations to contribute zero to the
 #' numerator---but zero is not a neutral value, it represents strong
-#' positive autocorrelation.
+#' positive autocorrelation.)
+#' 
 #' @source
 #' 
 #' Chun, Yongwan, and Daniel A. Griffith. Spatial Statistics and Geostatistics: Theory and Applications for Geographic Information Science and Technology. Sage, 2013.

diff --git a/R/geostan_fit-methods.R b/R/geostan_fit-methods.R
@@ -156,29 +156,30 @@ plot.geostan_fit <- function(x,
 #' @examples
 #' \donttest{
 #' data(georgia)
-#' A <- shape2mat(georgia, "B")
+#' C <- shape2mat(georgia, "B")
 #' 
 #' fit <- stan_esf(deaths.male ~ offset(log(pop.at.risk.male)),
-#'                 C = A,
+#'                 C = C,
+#'                 re = ~ GEOID,
 #'                 data = georgia,
 #'                 family = poisson(),
 #'                 chains = 1, iter = 600) # for speed only
 #'
 #' 
 #' # Residuals
 #' r <- resid(fit)
-#' moran_plot(r$mean, A)
 #' head(r)
+#' moran_plot(r$mean, C)
 #' 
 #' # Fitted values
 #' f <- fitted(fit)
-#'
-#' # Fitted values, unstandardized
-#' f <- fitted(fit, rates = FALSE)
 #' head(f)
+#' 
+#' f2 <- fitted(fit, rates = FALSE)
+#' head(f2)
 #'
 #' # Spatial trend
-#' esf <- spatial(fit)
+#' esf  <- spatial(fit)
 #' head(esf)
 #' }
 #' @export

diff --git a/R/information-criteria.R b/R/information-criteria.R
@@ -1,31 +1,41 @@
 
 #' Model comparison
 #'
-#' @description Deviance Information Criteria (DIC) and Widely Application Information Criteria (WAIC) for model comparison
+#' @description Deviance Information Criteria (DIC) and Widely Application Information Criteria (WAIC) for model comparison.
 #' 
 #' @param object A fitted \code{geostan} model
 #' 
 #' @param pointwise Logical (defaults to `FALSE`), should a vector of values for each observation be returned?
 #' 
 #' @param digits Round results to this many digits.
 #' 
-#' @return A vector of length 3 with \code{WAIC}, a rough measure of the effective number of parameters estimated by the model \code{Eff_pars}, and log predictive density \code{Lpd}. If \code{pointwise = TRUE}, results are returned in a \code{data.frame}.
+#' @return
+#'
+#' WAIC returns a vector of length 3 with the \code{WAIC} value, a penalty term which measures the effective number of parameters estimated by the model \code{Eff_pars}, and log predictive density \code{Lpd}. If \code{pointwise = TRUE}, results are returned in a \code{data.frame}.
+#'
+#' DIC returns a vector of length 2: the DIC value and the penalty term (which is part of the DIC calculation).
 #'
 #' @details
 #' 
 #' WAIC (widely applicable information criteria) and DIC (deviance information criteria) are used for model comparison. They are based on theories of out-of-sample predictive accuracy. The DIC is implemented with penalty term defined as 1/2 times the posterior variance of the deviance (Spiegelhatler et al. 2014).
 #'
-#' The limitations of these methods include that DIC is less robust than WAIC and that WAIC is not strictly valid for autocorrelated data (viz. geostan's spatial models). 
+#' The limitations of these methods include that DIC is less robust than WAIC and that WAIC is not strictly valid for autocorrelated data (viz. geostan's spatial models).
+#'
+#' For both DIC and WAIC, lower values indicate better models. 
 #' 
 #' @examples
-#' 
 #' data(georgia)
 #' 
-#' fit <- stan_glm(log(rate.male) ~ 1, data = georgia, iter=500)
-#'
+#' fit <- stan_glm(log(rate.male) ~ 1, data = georgia,
+#'                 iter=600, chains = 2, quiet = TRUE)
+#' fit2 <- stan_glm(log(rate.male) ~ log(income), data = georgia,
+#'                 centerx = TRUE, iter=600, chains = 2, quiet = TRUE)
+#' 
 #' dic(fit)
-#' waic(fit)
+#' dic(fit2)
 #' 
+#' waic(fit)
+#' waic(fit2)
 #' @source
 #'
 #' D. Spiegelhatler, N. G. Best, B. P. Carlin and G. Linde (2014) The Deviance Information Criterion: 12 Years on. J. Royal Statistical Society Series B: Stat Methodology. 76(3): 485-493.

diff --git a/R/moran.R b/R/moran.R
@@ -18,7 +18,7 @@
 #' \deqn{MC = \frac{n}{K}\frac{\sum_i \sum_j w_{ij} (y_i - \overline{y})(y_j - \overline{y})}{\sum_i (y_i - \overline{y})^2}}
 #' where \eqn{n} is the number of observations and \eqn{K} is the sum of all values in the spatial connectivity matrix \eqn{W}, i.e., the sum of all row-sums: \eqn{K = \sum_i \sum_j w_{ij}}.
 #'
-#' If any observations with no neighbors are found (i.e. \code{any(Matrix::rowSums(w) == 0)}) they will be dropped automatically and a message will print stating how many were dropped. The alternative is for those observations to have a spatial lage of zero---but zero is not a neutral value, see the Moran scatter plot. 
+#' If any observations with no neighbors are found (i.e. \code{any(Matrix::rowSums(w) == 0)}) they will be dropped automatically and a message will print stating how many were dropped. (The alternative would be for those observations to have a spatial lage of zero, but zero is not a neutral value.)
 #'
 #' @seealso \link[geostan]{moran_plot}, \link[geostan]{lisa}, \link[geostan]{aple}, \link[geostan]{gr}, \link[geostan]{lg}
 #' 
@@ -232,7 +232,7 @@ lisa <- function(x, w, type = TRUE, scale = TRUE, digits = 3) {
 #' @examples
 #' data(georgia)
 #' C <- shape2mat(georgia)
-#' X <- model.matrix(~ ICE + college, georgia)
+#' X <- model.matrix(~ college, georgia)
 #' expected_mc(X, C)
 expected_mc <- function(X, C) {
     C <- as.matrix(C)

diff --git a/R/priors.R b/R/priors.R
@@ -16,23 +16,24 @@
 #' 
 #' @examples
 #' \donttest{
-#' # std. normal priors to k=2 covariates
+#' # normal priors for k=2 covariates
 #' data(georgia)
 #' prior <- list()
 #' k <- 2
 #' prior$beta <- normal(location = rep(0, times = k),
-#'                      scale = rep(1, times = k))
+#'                      scale = rep(2, times = k))
 #' prior$intercept <- normal(-5, 3)
 #' print(prior)
-#' fit <- stan_glm(deaths.male ~ offset(log(pop.at.risk.male)) + ICE + college,
+#' fit <- stan_glm(deaths.male ~ offset(log(pop.at.risk.male)) + log(income) + college,
 #'                 re = ~ GEOID,
+#'                 centerx = TRUE,
 #'                 data = georgia,
 #'                 family = poisson(),
 #'                 prior = prior,
-#'                 prior_only = TRUE,
 #'                 chains = 2, iter = 600) # for speed only
 #' plot(fit)
 #'
+#' # setting (hyper-) priors in ME models
 #' se <- data.frame(insurance = georgia$insurance.se)
 #' prior <- list()
 #' prior$df <- gamma2(3, 0.2)
@@ -42,8 +43,8 @@
 #' ME <- prep_me_data(se = se, prior = prior)
 #' fit <- stan_glm(log(rate.male) ~ insurance, 
 #'                 data = georgia,
+#'                 centerx = TRUE,
 #'                 ME = ME,
-#'                 prior_only = TRUE,
 #'                chains = 2, iter = 600) # for speed only
 #' }
 #' @name priors

diff --git a/R/stan_car.R b/R/stan_car.R
@@ -177,8 +177,6 @@
 #' 
 #' Cressie, Noel and Wikle, Christopher (2011). *Statistics for Spatio-Temporal Data*. Wiley.
 #'
-#' Donegan, Connor and Chun, Yongwan and Griffith, Daniel A. (2021). Modeling community health with areal data: Bayesian inference with survey standard errors and spatial structure. *Int. J. Env. Res. and Public Health* 18 (13): 6856. DOI: 10.3390/ijerph18136856 Data and code: \url{https://github.com/ConnorDonegan/survey-HBM}.
-#'
 #' Donegan, Connor (2021). Building spatial conditional autoregressive (CAR) models in the Stan programming language. *OSF Preprints*. \doi{10.31219/osf.io/3ey65}.
 #' 
 #' Haining, Robert and Li, Guangquan (2020). *Modelling Spatial and Spatial-Temporal Data: A Bayesian Approach*. CRC Press.

diff --git a/R/stan_esf.R b/R/stan_esf.R
@@ -130,8 +130,6 @@
 #' 
 #' Donegan, C., Y. Chun and A. E. Hughes (2020). Bayesian estimation of spatial filters with Moran’s Eigenvectors and hierarchical shrinkage priors. *Spatial Statistics*. \doi{10.1016/j.spasta.2020.100450} (open access: \doi{10.31219/osf.io/fah3z}).
 #'
-#' Donegan, Connor (2021). Building spatial conditional autoregressive (CAR) models in the Stan programming language. *OSF Preprints*. \doi{10.31219/osf.io/3ey65}.
-#'
 #' Griffith, Daniel A., and P. R. Peres-Neto (2006). Spatial modeling in ecology: the flexibility of eigenfunction spatial analyses. *Ecology* 87(10), 2603-2613.
 #' 
 #' Griffith, D., and Y. Chun (2014). Spatial autocorrelation and spatial filtering, Handbook of Regional Science. Fischer, MM and Nijkamp, P. eds.
@@ -144,32 +142,28 @@
 #' \donttest{
 #' data(sentencing)
 #' # spatial weights matrix with binary coding scheme
-#' C <- shape2mat(sentencing, style = "B")
+#' C <- shape2mat(sentencing, style = "B", quiet = TRUE)
 #'
-#' # log-expected number of sentences
-#' ## expected counts are based on county racial composition and mean sentencing rates
+#' # expected number of sentences
 #' log_e <- log(sentencing$expected_sents)
 #'
 #' # fit spatial Poisson model with ESF + unstructured 'random effects'
 #' fit.esf <- stan_esf(sents ~ offset(log_e),
 #'                    re = ~ name,
 #'                    family = poisson(),
 #'                    data = sentencing,
-#'                    C = C,
+#'                    C = C, 
 #'                    chains = 2, iter = 800) # for speed only
 #' 
 #' # spatial diagnostics 
 #' sp_diag(fit.esf, sentencing)
-#' plot(fit.esf)
 #' 
 #' # plot marginal posterior distributions of beta_ev (eigenvector coefficients)
 #' plot(fit.esf, pars = "beta_ev")
 #'
-#' # plot the marginal posterior distributions of the spatial filter 
-#' plot(fit.esf, pars = "esf")
-#'
-#' # calculate log-standardized incidence ratios 
-#  # (observed/exected counts)
+#' # calculate log-standardized incidence ratios (SIR)
+#' #  # SIR = observed/exected number of cases
+#' # in this case, prison sentences
 #' library(ggplot2)
 #' library(sf)
 #' f <- fitted(fit.esf, rates = FALSE)$mean