Brian Richardson
The sparcc package can be loaded locally using the devtools package.
## load the package
library(devtools)
load_all()
## other necessary packages
library(dplyr)
library(ggplot2)
library(statmod)The sparcc package contains functions to analyze data with a randomly
right-censored covariate using the SPARCC estimator.
The methods implemented are introduced in the paper, “SPARCC: Semi-Parametric Robust Estimation in a Right-Censored Covariate Model,” which is currently under revision.
The code implemented in this package is specific to the scenario where
Below is a tutorial for how the SPARCC estimator can be used on a data set with a randomly right-censored covariate.
We first need to simulate a data set with a randomly right-censored
covariate. This data generation can be done, for example, using the
built in gen.data.beta function in the sparcc package.
The gen.data.beta function generates the following variables:
-
$Z \sim \textrm{Bernoulli}(0.5)$ , a binary uncensored observed covariate, -
$X$ , a randomly right-censored covariate with$X|Z \sim \textrm{beta}(\alpha_{11} + \alpha_{12}Z, \alpha_{13} + \alpha_{14}Z)$ , -
$C$ , the censoring variable with$C|Z \sim \textrm{beta}(\alpha_{21} + \alpha_{22}Z, \alpha_{23} + \alpha_{24}Z)$ , -
$Y$ , the outcome with$Y|X,Z \sim \textrm{Normal}(\beta_0 + \beta_1X + \beta_2Z, \sigma^2)$ .
To ensure that gen.data.beta function:
The user specifies desired values for
gen.data.beta function then finds
an appropriate
From the complete data
The gen.data.beta function returns a list of four data frames:
datf: The full data, including the outcomeY, the covariateX, and the censoring variableC.dat: The observed data, including the outcomeY, the observed valueWof the possibly right-censored covariate, and the censoring indicatorDelta.dat0: The oracle data, a version of the observed data where no observations are right-censored (essentially settingCequal to infinity for all observations).datcc: The complete case data, or the subset ofdatwithDelta == 1.
## define parameters
set.seed(123) # random number seed for reproducibility
n <- 800 # sample size
q <- 0.6 # right-censoring proportion
B <- c(1, 10, 2) # outcome model parameters
s2 <- 1 # Var(Y|X,Z)
x.thetas <- 0.5 * c(-1, 1) # parameters governing X|Z and C|Z
x.gamma <- 1
c.gamma <- 2
## generate data
dat.list <- gen.data.beta(
n = n, q = q, B = B, s2 = s2,
x.thetas = x.thetas, x.gamma = x.gamma, c.gamma)
datf <- dat.list$datf # full data
dat0 <- dat.list$dat0 # oracle data
dat <- dat.list$dat # observed data
datcc <- dat.list$datcc # complete case data
zs <- sort(unique(dat$Z)) # unique z valuesWe can visualize the full data, observed data, and complete case data below.
For the full data, all n = 800 observations appear in the scatterplot.
The colors of points indicate the level of the uncensored binary
covariate Z=1) tend to be above the red points
(Z=0) by 2.
## plot full data
datf %>%
ggplot(aes(x = X,
y = Y,
color = factor(Z))) +
geom_point() +
labs(x = "True X Value",
color = "Z",
title = "Full Data") +
theme_bw() +
theme(legend.position = "bottom")For the observed data, all n = 800 observations again appear in the
scatterplot, with colors indicating the level of
## plot observed data
dat %>%
ggplot(aes(x = W,
y = Y,
color = factor(Z))) +
geom_point() +
labs(x = "Observed W = min(X, Delta)",
color = "Z",
title = "Observed Data") +
theme_bw() +
theme(legend.position = "bottom")For the complete case data, only the 325 uncensored observations appear in the scatterplot. Like for the full data, the outcome model parameters can be read off of the plot, but now with less precision because 475 of the observations are ignored.
## plot complete case data
datcc %>%
ggplot(aes(x = W,
y = Y,
color = factor(Z))) +
geom_point() +
labs(x = "Observed W = min(X, Delta)",
color = "Z",
title = "Complete Case Data") +
theme_bw() +
theme(legend.position = "bottom")Using the observed data, we can fit the SPARCC estimator using the
sparcc function. The SPARCC estimator can be obtained by either (i)
using parametric models for the nuisance parameters
These nuisance parameters
To use the SPARCC estimator with parametric models, use the
nuisance.models = "parametric" option.
Then supply the posited parametric models for distr.x and distr.c options, respectively. Each should be
a character string "name" naming a distribution for which the
corresponding density function dname is defined.
For example, the default is "beta", for which the density dbeta is
defined in base R. It is assumed that shape1 and shape2.
name = "gamma". In this case, the shape and
rate parameters would be estimated at each level of
Importantly, this implementation of the SPARCC estimator is consistent and asymptotically normal provided at least one of these two model specifications is correct.
The following additional function arguments have default values, but are included in the function call below for the sake of this tutorial:
-
xz.interaction: a logical indicator for whether anX*Zinteraction is to be included in the outcome model (TRUE) or not (FALSE), -
mx: a positive integer, the number of quadrature nodes used forX, -
mc: a positive integer, the number of quadrature nodes used forC, -
my: a positive integer, the number of Gauss-Hermite quadrature nodes used for$Y$ , -
range.x: a numeric vector of length 2, the range of the support ofX, -
range.c: a numeric vector of length 2, the range of the support ofC.
mx, mc, and my determine how fine of a grid of points are used for
quadrature approximations of integrals with respect to the cumulative
distribution functions of mx, mc, and my tend to give more accurate numerical results but
require more computation time. Since Gauss-Hermite quadrature is used
for integrals over my than mx and
mc. In the simulated and real data considered in the accompanying
paper, the default values of mx, mc, and my appeared to be
sufficient.
range.x and range.x describe the ranges of the right-censored
covariate
sparcc.param <- sparcc(
data = dat,
xz.interaction = T,
nuisance.models = "parametric",
distr.x = "beta",
distr.c = "beta",
mx = 40,
mc = 40,
my = 5,
range.x = c(1E-6, 1-1E-6),
range.c = c(1E-6, 1-1E-6),
)## STEP 1: estimate parametric nuisance parameters
## STEP 1 complete (0.45 seconds)
## STEP 2: obtain SPARCC estimator
## STEP 2 complete (70.74 seconds)
## STEP 3: obtain SPARCC variance estimator
## STEP 3 complete (17.55 seconds)
The sparcc function returns a list with three items: x.model,
c.model, and outcome.model, which are themselves lists with results
for the
The x.model and c.model lists contain the estimated eta1 and eta2). With the
parametric model implementation (i.e., if
nuisance.models = "parametric"), the x.model and c.model lists
also contain the posited distributional families (distr.x and
distr.c) and estimated parameters (x.params.hat and c.params.hat).
As a demonstration, we can assess here how well the posited beta
distribution for
To create the plot below, the data set eta1 is used, which contains
information about the estimated x.nds column contains
the quadrature nodes for eta1 column contains the
estimate values of
## extract nuisance parameter f_{X|Z}
eta1 <- sparcc.param$x.model$eta1
## plot full data vs estimated f_{X|Z}
ggplot(data = NULL) +
geom_line(data = eta1,
aes(x = x.nds,
y = eta1,
color = factor(Z)),
linewidth = 2) +
geom_histogram(data = datf,
aes(x = X,
y = after_stat(density),
group = factor(Z),
fill = factor(Z)),
position = "identity",
alpha = 0.5,
bins = 50) +
labs(x = "X",
y = "Density",
title = "Estimated and Observed X|Z",
subtitle = "(Parametric Model)",
fill = "Z",
color = "Z") +
theme_bw() +
theme(legend.position = "bottom")The outcome.model list contains the outcome model formula, the
estimated model coefficients, and the covariance matrix for these
coefficients.
## outcome model formula
sparcc.param$outcome.model$outcome.fmla## Y ~ X * Z
## <environment: 0x000002ece293dd90>
## estimated coefficient; truth is c(1, 10, 2, 0, 0)
sparcc.est <- sparcc.param$outcome.model$coef
round(sparcc.est, 3)## B1 B2 B3 B4 B5
## 0.659 10.689 2.502 -0.830 -0.144
## covariance matrix
sparcc.cov <- sparcc.param$outcome.model$cov
round(sparcc.cov, 3)## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.040 -0.078 -0.040 0.078 0.000
## [2,] -0.078 0.173 0.078 -0.174 0.000
## [3,] -0.040 0.078 0.055 -0.123 -0.001
## [4,] 0.078 -0.174 -0.123 0.397 0.003
## [5,] 0.000 0.000 -0.001 0.003 0.006
We can use these estimated coefficients and corresponding covariance
matrix to plot the regression line of
-
Build a new dataset
$D_{\text{new}}$ that contains a grid of$X$ values ranging from 0 to 1, for each possible value of$Z \in {0,1}$ . This gives us the points where we will evaluate the fitted regression line. -
The regression line is obtained by multiplying
$D_{\text{new}}$ with the estimated regression coefficients$\widehat{\boldsymbol{\beta}}$ :
- To account for sampling variability, we use the delta method to compute the variance of the fitted values:
Then we construct 95% Wald-type confidence intervals:
## new X and Z data for fitting the regression model
Dnew <- data.frame(
int = 1,
X = seq(0, 1, length = 100),
Z = rep(0:1, each = 100)) %>%
mutate(XZ = X * Z) %>%
as.matrix()
## fitted values
Yhat <- Dnew %*% sparcc.est[1:4]
## standard errors for fitted values
Yhat.se <- sqrt(diag(Dnew %*% sparcc.cov[1:4, 1:4] %*% t(Dnew)))
## data for plotting
plot.dat <- data.frame(
X = Dnew[,2],
Z = factor(Dnew[,3]),
Yhat = Yhat,
Ylower = Yhat - qnorm(0.975) * Yhat.se,
Yupper = Yhat + qnorm(0.975) * Yhat.se)
## plot fitted lines and 95% confidence intervals
ggplot(data = plot.dat,
aes(x = X,
y = Yhat,
ymin = Ylower,
ymax = Yupper,
color = Z,
fill = Z)) +
geom_line() +
geom_ribbon(alpha = 0.5) +
labs(x = "X",
y = "Estimated Regression Line (95% CI)",
color = "Z",
fill = "Z",
title = "SPARCC Outcome Model Results",
subtitle = "(Using Parametric Models for Nuisance Parameters)") +
theme_bw() +
theme(legend.position = "bottom")To use the SPARCC estimator with nonparametric/machine learning methods
to estimate the nuisance parameters, use the
nuisance.models = "nonparametric" option.
Under this option, deg and m.knots arguments.
Larger values of deg (spline degree) or m.knots (number of knots)
make the spline basis more flexible, allowing it to capture more complex
patterns in the data, but at the risk of overfitting. In the simulated
and real data considered in the accompanying paper, the default values
of deg and m.knots seemed appropriate.
sparcc.nonpar <- sparcc(
data = dat,
xz.interaction = T,
nuisance.models = "nonparametric",
deg = 3,
m.knots = 5,
mx = 40,
mc = 40,
my = 5,
range.x = c(1E-6, 1-1E-6),
range.c = c(1E-6, 1-1E-6),
)## STEP 1: estimate nonparametric nuisance parameters
## STEP 1 complete (1.45 seconds)
## STEP 2: obtain SPARCC estimator
## STEP 2 complete (51.08 seconds)
## STEP 3: obtain SPARCC variance estimator
## STEP 3 complete (18.04 seconds)
The output is a list containing x.model, c.model, and
outcome.model, similar to that for the parametric model
implementation. We can again assess the fit of the B-spline estimator of
## extract nonparametrically estimated f_{X|Z}
eta1.np <- sparcc.nonpar$x.model$eta1
## plot full data vs estimated f_{X|Z}
ggplot(data = NULL) +
geom_line(data = eta1.np,
aes(x = x.nds,
y = eta1,
color = factor(Z)),
linewidth = 2) +
geom_histogram(data = datf,
aes(x = X,
y = after_stat(density),
group = factor(Z),
fill = factor(Z)),
position = "identity",
alpha = 0.5,
bins = 50) +
labs(x = "X",
y = "Density",
title = "Estimated and Observed X|Z",
subtitle = "(Nonparametric Model)",
fill = "Z",
color = "Z") +
theme_bw() +
theme(legend.position = "bottom")Finally, we can extract the estimated outcome model coefficients and
corresponding covariance matrix, and plot the regression line of
## estimated coefficient; truth is c(1, 10, 2, 0, 0)
sparcc.est.np <- sparcc.nonpar$outcome.model$coef
round(sparcc.est.np, 3)## B1 B2 B3 B4 B5
## 0.658 10.693 2.502 -0.814 -0.144
## covariance matrix
sparcc.cov.np <- sparcc.nonpar$outcome.model$cov
round(sparcc.cov.np, 3)## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.040 -0.078 -0.040 0.078 0.000
## [2,] -0.078 0.173 0.078 -0.173 0.000
## [3,] -0.040 0.078 0.055 -0.122 -0.001
## [4,] 0.078 -0.173 -0.122 0.390 0.003
## [5,] 0.000 0.000 -0.001 0.003 0.006
## fitted values
Yhat.np <- Dnew %*% sparcc.est.np[1:4]
## standard errors for fitted values
Yhat.se.np <- sqrt(diag(Dnew %*% sparcc.cov.np[1:4, 1:4] %*% t(Dnew)))
## data for plotting
plot.dat.np <- data.frame(
X = Dnew[,2],
Z = factor(Dnew[,3]),
Yhat = Yhat.np,
Ylower = Yhat.np - qnorm(0.975) * Yhat.se.np,
Yupper = Yhat.np + qnorm(0.975) * Yhat.se.np)
## plot fitted lines and 95% confidence intervals
ggplot(data = plot.dat.np,
aes(x = X,
y = Yhat,
ymin = Ylower,
ymax = Yupper,
color = Z,
fill = Z)) +
geom_line() +
geom_ribbon(alpha = 0.5) +
labs(x = "X",
y = "Estimated Regression Line (95% CI)",
color = "Z",
fill = "Z",
title = "SPARCC Outcome Model Results",
subtitle = "(Using Nonparametric Estimator for Nuisance Parameters)") +
theme_bw() +
theme(legend.position = "bottom")The four simulation studies for the paper accompanying this R package
can all be reproduced using the code in simulations/. For each of
these four simulation studies (x = 1,2,3,4), the following exist:
simx_function.R: An R script defining a function to run a single simulation.simx_run.R: An R script to run repeated simulations in parallel on a computing cluster using a variety of desired settings.simx_data/: A folder containing simulation results produced by runningsimx_run.Ron a computing cluster.simx_analyze.Rmd: An R Markdown report to analyze the results from the simulation study (found insimx_data/).
Simulation figures are output to the folder sim_plots/.
The workflow used to analyze the ENROLL-HD data set in the accompanying
paper is replicated in the folder synth_data_analysis/, but using
synthetic data. Details and a data dictionary are provided in
synth_data_analysis/00-SynthData-Dictionary.Rmd.
The entire synthetic data analysis can be reproduced by running the following R scripts in order:
01-SynthData-Generation.R02-SynthData-Analysis-TMS.R03-SynthData-Analysis-SDMT.R04-SynthData-Analysis-cUHDRS.R05-SynthData-Results.R







