Note: An updated version of NNDM with enhanced capabilities has been
included in the CAST
package (available on CRAN).
The goal of the NNDM
package is to provide tools to perform Nearest
Neighbour Distance Matching (NNDM) Leave-One-Out (LOO) Cross-Validation
(CV) to estimate map accuracy for a given prediction task. In NNDM LOO
CV, the nearest neighbour distance distribution function between the
test and training data during the LOO CV process is matched to the
nearest neighbour distance distribution function between the target
prediction and training points, for those distances in which
autocorrelation is present. NNDM CV can be used for various prediction
areas (geographic interpolation/extrapolation), sampling patterns
(regular/random/clustered), and landscape autocorrelation ranges. The
package also includes tools to compute buffered-LOO (bLOO) CV.
Reference:
C. Milà, J. Mateu, E. Pebesma, and H. Meyer, ‘Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation’, Methods in Ecology and Evolution (2022).
You can install the development version from GitHub with:
library("devtools")
devtools::install_github("carlesmila/NNDM")
In this example we will use LOO, bLOO and NNDM LOO CV for map validation
in a spatial interpolation task to illustrate the typical use of the
package. We will use the ranger
implementation of the Random Forest
(RF) algorithm wrapped in the caret
package, but the CV indices
generated by NNDM
can also be used with other machine learning
packages.
We will use the meuse
dataset included in the sp
package, which
includes samples of topsoil zinc concentrations in a flood plain of the
river Meuse, in the Netherlands. More information on the dataset can be
obtained by using ?meuse
.
library("NNDM")
library("caret")
library("sp")
library("sf")
library("knitr")
library("gstat")
library("gridExtra")
# Sample data
data("meuse")
meuse <- st_as_sf(meuse, coords = c("x", "y"), crs = 28992, remove = F)
# AOI polygon
data("meuse.area")
meuse.area <- SpatialPolygons(list(Polygons(list(Polygon(meuse.area)), "area")))
meuse.area <- st_as_sf(meuse.area)
st_crs(meuse.area) <- 28992
# Prediction grid
data("meuse.grid")
meuse.grid <- st_as_sf(meuse.grid, coords = c("x", "y"), crs = 28992, remove = F)
We start by fitting a RF model with the selected predictors (no
hyperparameter tuning is done in this example) and validating it with
LOO CV, which is readily available in caret
.
# Fit model
trainControl_LOO <- trainControl(method = "LOOCV", savePredictions = T)
paramGrid <- data.frame(mtry = 2, min.node.size = 5, splitrule = "variance")
mod_LOO <- train(zinc ~ x + y + dist + ffreq + soil,
method = "ranger",
trControl = trainControl_LOO,
tuneGrid = paramGrid,
data = meuse,
seed=12345)
Both bLOO and NNDM LOO CV require an autocorrelation range parameter to be estimated. There are many proposals on how to estimate the radius of bLOO/NNDM LOO CV, being the most frequent the residual or outcome autocorrelation ranges (see our article for more details). Here we estimate the residual autocorrelation range:
# Estimate variogram on the residual and return range
meuse$res <- meuse$zinc - predict(mod_LOO, newdata=meuse)
empvar <- variogram(res~1, data = meuse)
fitvar <- fit.variogram(empvar, vgm(model="Sph", nugget = T))
plot(empvar, fitvar, cutoff=1500, main = "Residual semi-variogram estimation")
(resrange <- fitvar$range[2]) # Residual autocorrelation range in m
#> [1] 842.0515
With the estimated range, we can now compute the bLOO CV indices using
the bLOO
function. We print the object and observe that with that
radius, the average training size of each bLOO iteration is of roughly
110 points. We can also plot the object for a given observation and
explore which observations are used for training, testing, or are
excluded. We use these indices in a caret
model to perform a bLOO CV.
# Compute bLOO indices
(bLOO_indices <- bLOO(meuse, resrange, min_train = 0.5))
#> bLOO object
#> Total number of points: 155
#> Mean number of training points: 110.01
#> Minimum number of training points: 92
#> Mean buffer radius: 842.05
# Plot for one CV iteration
plot(bLOO_indices, 53) +
theme(axis.text = element_blank(), axis.ticks = element_blank())
# Evaluate RF model using bLOO CV
trainControl_bLOO <- trainControl(method = "cv", savePredictions = T,
index=bLOO_indices$indx_train,
indexOut=bLOO_indices$indx_test)
mod_bLOO <- train(zinc ~ x + y + dist + ffreq + soil,
method = "ranger",
trControl = trainControl_bLOO,
tuneGrid = paramGrid,
data = meuse,
seed=12345)
Similarly, we compute NNDM LOO CV indices using the function nndm
.
When plotting the object, the estimated
# Compute NNDM indices
(NNDM_indices <- nndm(meuse, meuse.grid, resrange, min_train = 0.5))
#> nndm object
#> Total number of points: 155
#> Mean number of training points: 153.88
#> Minimum number of training points: 150
# Plot NNDM functions
plot(NNDM_indices)
# Evaluate RF model using NDM CV
trainControl_NNDM <- trainControl(method = "cv", savePredictions = T,
index=NNDM_indices$indx_train,
indexOut=NNDM_indices$indx_test)
mod_NNDM <- train(zinc ~ x + y + dist + ffreq + soil,
method = "ranger",
trControl = trainControl_NNDM,
tuneGrid = paramGrid,
data = meuse,
seed=12345)
We compute the different CV results by manually computing the statistics
for bLOO and NNDM (caret
results for LOO customised folds are not
appropriate):
# Compute and return results
stats_LOO <- data.frame(validation="LOO CV",
RMSE=mod_LOO$results$RMSE,
R2=mod_LOO$results$Rsquared)
stats_bLOO <- data.frame(validation="bLOO CV",
RMSE=with(mod_bLOO$pred, sqrt(mean((obs-pred)^2))),
R2=with(mod_bLOO$pred, cor(obs, pred)^2))
stats_NNDM <- data.frame(validation="NNDM LOO CV",
RMSE=with(mod_NNDM$pred, sqrt(mean((obs-pred)^2))),
R2=with(mod_NNDM$pred, cor(obs, pred)^2))
validation | RMSE | R2 |
---|---|---|
LOO CV | 202.13 | 0.71 |
bLOO CV | 288.36 | 0.42 |
NNDM LOO CV | 203.43 | 0.70 |
In this example, we see that LOO and NNDM LOO CV return very similar results while bLOO CV results in a much lower estimated map accuracy.