Skip to content

brian-d-richardson/sparcc

Repository files navigation

sparcc: Semi-Parametric Robust Estimation in a Right-Censored Covariate Model

Brian Richardson

Installation

The sparcc package can be loaded locally using the devtools package.

## load the package
library(devtools)
load_all()

## other necessary packages
library(dplyr)
library(ggplot2)
library(statmod)

The sparcc package contains functions to analyze data with a randomly right-censored covariate using the SPARCC estimator.

The methods implemented are introduced in the paper, “SPARCC: Semi-Parametric Robust Estimation in a Right-Censored Covariate Model,” which is currently under revision.

The code implemented in this package is specific to the scenario where $Y|X,Z$ has a normal distribution with mean $\textrm{E}(Y|X,Z)=\beta_0+\beta_1X+\beta_2Z + \beta_3XZ$, for a randomly right-censored covariate $X$ and an uncensored discrete-valued covariate $Z$.

Tutorial

Below is a tutorial for how the SPARCC estimator can be used on a data set with a randomly right-censored covariate.

Data Generation

We first need to simulate a data set with a randomly right-censored covariate. This data generation can be done, for example, using the built in gen.data.beta function in the sparcc package.

The gen.data.beta function generates the following variables:

  • $Z \sim \textrm{Bernoulli}(0.5)$, a binary uncensored observed covariate,
  • $X$, a randomly right-censored covariate with $X|Z \sim \textrm{beta}(\alpha_{11} + \alpha_{12}Z, \alpha_{13} + \alpha_{14}Z)$,
  • $C$, the censoring variable with $C|Z \sim \textrm{beta}(\alpha_{21} + \alpha_{22}Z, \alpha_{23} + \alpha_{24}Z)$,
  • $Y$, the outcome with $Y|X,Z \sim \textrm{Normal}(\beta_0 + \beta_1X + \beta_2Z, \sigma^2)$.

To ensure that $X$ and $C$ are generated with the desired right-censoring proportion $q=P(X>C)$, the following reparametrization is used within the gen.data.beta function:

$$ \begin{bmatrix} \alpha_{11} \\ \alpha_{12} \\ \alpha_{13} \\ \alpha_{14} \end{bmatrix} = \begin{bmatrix} 1 + \gamma_x - \theta_{x1} \\ \theta_{x1} - \theta_{x2} \\ 1 + \gamma_x + \theta_{x1} \\ -\theta_{x1} + \theta_{x2} \end{bmatrix} $$ $$ \begin{bmatrix} \alpha_{21} \\ \alpha_{22} \\ \alpha_{23} \\ \alpha_{24} \end{bmatrix} = \begin{bmatrix} 1 + \gamma_c - \theta_{c1} \\ \theta_{c1} - \theta_{c2} \\ 1 + \gamma_c + \theta_{c1} \\ -\theta_{c1} + \theta_{c2} \end{bmatrix} $$

The user specifies desired values for $\gamma_x, \gamma_c, \theta_{x1}, \theta_{x2}$ along with a desired right-censoring proportion $q$. The gen.data.beta function then finds an appropriate $\theta_{c1}, \theta_{c2}$ in order to satisfy the constraint $q=\textrm{P}(X>C)$.

From the complete data $(Y, X, C, Z)$, the observed data $(Y, W, \Delta, Z)$ are generated using $W = \min(X, C)$ and $\Delta = I(X \leq C)$.

The gen.data.beta function returns a list of four data frames:

  1. datf: The full data, including the outcome Y, the covariate X, and the censoring variable C.
  2. dat: The observed data, including the outcome Y, the observed value W of the possibly right-censored covariate, and the censoring indicator Delta.
  3. dat0: The oracle data, a version of the observed data where no observations are right-censored (essentially setting C equal to infinity for all observations).
  4. datcc: The complete case data, or the subset of dat with Delta == 1.
## define parameters
set.seed(123)                 # random number seed for reproducibility
n <- 800                      # sample size
q <- 0.6                      # right-censoring proportion
B <- c(1, 10, 2)              # outcome model parameters
s2 <- 1                       # Var(Y|X,Z)
x.thetas <- 0.5 * c(-1, 1)    # parameters governing X|Z and C|Z
x.gamma <- 1
c.gamma <- 2

## generate data
dat.list <- gen.data.beta(
  n = n, q = q, B = B, s2 = s2,
  x.thetas = x.thetas, x.gamma = x.gamma, c.gamma)
datf <- dat.list$datf          # full data
dat0 <- dat.list$dat0          # oracle data
dat <- dat.list$dat            # observed data
datcc <- dat.list$datcc        # complete case data
zs <- sort(unique(dat$Z))      # unique z values

We can visualize the full data, observed data, and complete case data below.

For the full data, all n = 800 observations appear in the scatterplot. The colors of points indicate the level of the uncensored binary covariate $Z$. The trends in the scatterplot clearly reflect the parameters in the outcome model $\textrm{E}(Y|X,Z)=\beta_0+\beta_1X+\beta_2Z + \beta_3XZ$, for $\mathbf{\beta} = (\beta_0, \beta_1, \beta_2, \beta_3) = (1, 10, 2, 0)$. For example, the average slope of $Y$ with respect to $X$ appears to be around 10, and the blue dots (Z=1) tend to be above the red points (Z=0) by 2.

## plot full data
datf %>% 
  ggplot(aes(x = X,
             y = Y, 
             color = factor(Z))) +
  geom_point() +
  labs(x = "True X Value",
       color = "Z",
       title = "Full Data") +
  theme_bw() +
  theme(legend.position = "bottom")

For the observed data, all n = 800 observations again appear in the scatterplot, with colors indicating the level of $Z$. But in this plot, the outcome $Y$ is plotted against the observed value $W = min(X, C)$ of the right-censored covariate. Using this plot, the outcome model parameters are not so easily ascertained. For example, the average slope of $Y$ with respect to $X$ does not appear to be 10.

## plot observed data
dat %>% 
  ggplot(aes(x = W,
             y = Y, 
             color = factor(Z))) +
  geom_point() +
  labs(x = "Observed W = min(X, Delta)",
       color = "Z",
       title = "Observed Data") +
  theme_bw() +
  theme(legend.position = "bottom")

For the complete case data, only the 325 uncensored observations appear in the scatterplot. Like for the full data, the outcome model parameters can be read off of the plot, but now with less precision because 475 of the observations are ignored.

## plot complete case data
datcc %>% 
  ggplot(aes(x = W,
             y = Y, 
             color = factor(Z))) +
  geom_point() +
  labs(x = "Observed W = min(X, Delta)",
       color = "Z",
       title = "Complete Case Data") +
  theme_bw() +
  theme(legend.position = "bottom")

Estimation

Using the observed data, we can fit the SPARCC estimator using the sparcc function. The SPARCC estimator can be obtained by either (i) using parametric models for the nuisance parameters $f_{X|Z}$ and $f_{C|Z}$, or (ii) using nonparametric/machine learning methods for the nuisance parameters.

These nuisance parameters $f_{X|Z}$ and $f_{C|Z}$ are so named because, while they are involved in the estimation $(\beta_0, \beta_1, \beta_2, \beta_3)$, they are not directly of scientific interest. While they are called “parameters,” they are in fact infinite dimensional (because they are density functions).

Parametric Working Models

To use the SPARCC estimator with parametric models, use the nuisance.models = "parametric" option.

Then supply the posited parametric models for $f_{X|Z}$ and $f_{C|Z}$ using the distr.x and distr.c options, respectively. Each should be a character string "name" naming a distribution for which the corresponding density function dname is defined.

For example, the default is "beta", for which the density dbeta is defined in base R. It is assumed that $X$ (and $C$) then follow these posited distributions with possibly different parameter values at each level of $Z$. For the beta distribution in base R, these parameter values are shape1 and shape2.

$f_{X|Z}$ and $f_{C|Z}$ could also be specified as, e.g., having gamma distributions using name = "gamma". In this case, the shape and rate parameters would be estimated at each level of $Z$.

Importantly, this implementation of the SPARCC estimator is consistent and asymptotically normal provided at least one of these two model specifications is correct.

The following additional function arguments have default values, but are included in the function call below for the sake of this tutorial:

  • xz.interaction: a logical indicator for whether an X*Z interaction is to be included in the outcome model (TRUE) or not (FALSE),
  • mx: a positive integer, the number of quadrature nodes used for X,
  • mc: a positive integer, the number of quadrature nodes used for C,
  • my: a positive integer, the number of Gauss-Hermite quadrature nodes used for $Y$,
  • range.x: a numeric vector of length 2, the range of the support of X,
  • range.c: a numeric vector of length 2, the range of the support of C.

mx, mc, and my determine how fine of a grid of points are used for quadrature approximations of integrals with respect to the cumulative distribution functions of $X$, $C$, and $Y$, respectively. Larger values of mx, mc, and my tend to give more accurate numerical results but require more computation time. Since Gauss-Hermite quadrature is used for integrals over $Y$, fewer nodes are needed for my than mx and mc. In the simulated and real data considered in the accompanying paper, the default values of mx, mc, and my appeared to be sufficient.

range.x and range.x describe the ranges of the right-censored covariate $X$ and the censoring variable $C$. Since, in practice, only the minimum $W$ of $X$ and $C$ is observed for each observation, it may be difficult to know exactly what the maximum values of these two variables are. Plausible maximum values could be chosen based on a combination of the observed data (e.g., the maximum observed values of $W$ among censored and uncensored observations) and subject matter knowledge about $X$ and $C$.

sparcc.param <- sparcc(
  data = dat,
  xz.interaction = T,
  nuisance.models = "parametric",
  distr.x = "beta",
  distr.c = "beta",
  mx = 40,
  mc = 40,
  my = 5,
  range.x = c(1E-6, 1-1E-6),
  range.c = c(1E-6, 1-1E-6), 
)
## STEP 1: estimate parametric nuisance parameters

## STEP 1 complete (0.45 seconds)

## STEP 2: obtain SPARCC estimator

## STEP 2 complete (70.74 seconds)

## STEP 3: obtain SPARCC variance estimator

## STEP 3 complete (17.55 seconds)

The sparcc function returns a list with three items: x.model, c.model, and outcome.model, which are themselves lists with results for the $f_{X|Z}$ nuisance parameter, the $f_{C|Z}$ nuisance parameter, and the $Y|X,Z$ outcome model, respectively.

The x.model and c.model lists contain the estimated $f_{X|Z}$ and $f_{C|Z}$ at their quadrature nodes (eta1 and eta2). With the parametric model implementation (i.e., if nuisance.models = "parametric"), the x.model and c.model lists also contain the posited distributional families (distr.x and distr.c) and estimated parameters (x.params.hat and c.params.hat).

As a demonstration, we can assess here how well the posited beta distribution for $f_{X|Z}$ fits the data. Note that in reality, we would not observe most of these $X$ values due to right-censoring.

To create the plot below, the data set eta1 is used, which contains information about the estimated $f_{X|Z}$. The x.nds column contains the quadrature nodes for $X$, and the eta1 column contains the estimate values of $f_{X|Z}$ at these nodes.

## extract nuisance parameter f_{X|Z}
eta1 <- sparcc.param$x.model$eta1

## plot full data vs estimated f_{X|Z}
ggplot(data = NULL) +
  geom_line(data = eta1,
            aes(x = x.nds,
                y = eta1,
                color = factor(Z)),
            linewidth = 2) +
  geom_histogram(data = datf,
                 aes(x = X,
                     y = after_stat(density),
                     group = factor(Z),
                     fill = factor(Z)),
                 position = "identity",
                 alpha = 0.5,
                 bins = 50) +
  labs(x = "X",
       y = "Density",
       title = "Estimated and Observed X|Z",
       subtitle = "(Parametric Model)",
       fill = "Z",
       color = "Z") +
  theme_bw() +
  theme(legend.position = "bottom")

The outcome.model list contains the outcome model formula, the estimated model coefficients, and the covariance matrix for these coefficients.

## outcome model formula
sparcc.param$outcome.model$outcome.fmla
## Y ~ X * Z
## <environment: 0x000002ece293dd90>
## estimated coefficient; truth is c(1, 10, 2, 0, 0)
sparcc.est <- sparcc.param$outcome.model$coef
round(sparcc.est, 3)
##     B1     B2     B3     B4     B5 
##  0.659 10.689  2.502 -0.830 -0.144
## covariance matrix
sparcc.cov <- sparcc.param$outcome.model$cov
round(sparcc.cov, 3)
##        [,1]   [,2]   [,3]   [,4]   [,5]
## [1,]  0.040 -0.078 -0.040  0.078  0.000
## [2,] -0.078  0.173  0.078 -0.174  0.000
## [3,] -0.040  0.078  0.055 -0.123 -0.001
## [4,]  0.078 -0.174 -0.123  0.397  0.003
## [5,]  0.000  0.000 -0.001  0.003  0.006

We can use these estimated coefficients and corresponding covariance matrix to plot the regression line of $Y$ on $X$ and $Z$, along with a corresponding 95% confidence interval. To do this, we proceed in three steps:

  1. Build a new dataset $D_{\text{new}}$ that contains a grid of $X$ values ranging from 0 to 1, for each possible value of $Z \in {0,1}$. This gives us the points where we will evaluate the fitted regression line.

  2. The regression line is obtained by multiplying $D_{\text{new}}$ with the estimated regression coefficients $\widehat{\boldsymbol{\beta}}$:

$$\widehat{Y} = D_{\text{new}}\widehat{\boldsymbol{\beta}} $$

  1. To account for sampling variability, we use the delta method to compute the variance of the fitted values:

$$\widehat{\text{Var}}(\widehat{Y}) = D_{new}\widehat{\text{Var}}(\widehat{\mathbf{\beta}})D_{new}^T$$

Then we construct 95% Wald-type confidence intervals:

$$\widehat{Y} \pm z_{0.975} \sqrt{\widehat{\text{Var}}(\widehat{Y})}$$

## new X and Z data for fitting the regression model
Dnew <- data.frame(
  int = 1,
  X = seq(0, 1, length = 100),
  Z = rep(0:1, each = 100)) %>% 
  mutate(XZ = X * Z) %>% 
  as.matrix()

## fitted values
Yhat <- Dnew %*% sparcc.est[1:4]

## standard errors for fitted values
Yhat.se <- sqrt(diag(Dnew %*% sparcc.cov[1:4, 1:4] %*% t(Dnew)))

## data for plotting
plot.dat <- data.frame(
  X = Dnew[,2],
  Z = factor(Dnew[,3]),
  Yhat = Yhat,
  Ylower = Yhat - qnorm(0.975) * Yhat.se,
  Yupper = Yhat + qnorm(0.975) * Yhat.se)

## plot fitted lines and 95% confidence intervals
ggplot(data = plot.dat,
       aes(x = X,
           y = Yhat,
           ymin = Ylower,
           ymax = Yupper,
           color = Z,
           fill = Z)) +
  geom_line() +
  geom_ribbon(alpha = 0.5) +
  labs(x = "X",
       y = "Estimated Regression Line (95% CI)",
       color = "Z",
       fill = "Z",
       title = "SPARCC Outcome Model Results",
       subtitle = "(Using Parametric Models for Nuisance Parameters)") +
  theme_bw() +
  theme(legend.position = "bottom")

Nonparametric Models

To use the SPARCC estimator with nonparametric/machine learning methods to estimate the nuisance parameters, use the nuisance.models = "nonparametric" option.

Under this option, $f_{X|Z}$ and $f_{C|Z}$ are fit using a nonparametric B-spline estimator, which is described in detail in Section 3.2.1 of the accompanying paper. Instead of supplying posited parametric distributions, we specify the polynomial degree and number of spline knots using the deg and m.knots arguments.

Larger values of deg (spline degree) or m.knots (number of knots) make the spline basis more flexible, allowing it to capture more complex patterns in the data, but at the risk of overfitting. In the simulated and real data considered in the accompanying paper, the default values of deg and m.knots seemed appropriate.

sparcc.nonpar <- sparcc(
  data = dat,
  xz.interaction = T,
  nuisance.models = "nonparametric",
  deg = 3,
  m.knots = 5,
  mx = 40,
  mc = 40,
  my = 5,
  range.x = c(1E-6, 1-1E-6),
  range.c = c(1E-6, 1-1E-6), 
)
## STEP 1: estimate nonparametric nuisance parameters

## STEP 1 complete (1.45 seconds)

## STEP 2: obtain SPARCC estimator

## STEP 2 complete (51.08 seconds)

## STEP 3: obtain SPARCC variance estimator

## STEP 3 complete (18.04 seconds)

The output is a list containing x.model, c.model, and outcome.model, similar to that for the parametric model implementation. We can again assess the fit of the B-spline estimator of $f_{X|Z}$.

## extract nonparametrically estimated f_{X|Z}
eta1.np <- sparcc.nonpar$x.model$eta1

## plot full data vs estimated f_{X|Z}
ggplot(data = NULL) +
  geom_line(data = eta1.np,
            aes(x = x.nds,
                y = eta1,
                color = factor(Z)),
            linewidth = 2) +
  geom_histogram(data = datf,
                 aes(x = X,
                     y = after_stat(density),
                     group = factor(Z),
                     fill = factor(Z)),
                 position = "identity",
                 alpha = 0.5,
                 bins = 50) +
  labs(x = "X",
       y = "Density",
       title = "Estimated and Observed X|Z",
       subtitle = "(Nonparametric Model)",
       fill = "Z",
       color = "Z") +
  theme_bw() +
  theme(legend.position = "bottom")

Finally, we can extract the estimated outcome model coefficients and corresponding covariance matrix, and plot the regression line of $Y$ on $X$ and $Z$, just as we did for the SPARCC estimator using parametric models.

## estimated coefficient; truth is c(1, 10, 2, 0, 0)
sparcc.est.np <- sparcc.nonpar$outcome.model$coef
round(sparcc.est.np, 3)
##     B1     B2     B3     B4     B5 
##  0.658 10.693  2.502 -0.814 -0.144
## covariance matrix
sparcc.cov.np <- sparcc.nonpar$outcome.model$cov
round(sparcc.cov.np, 3)
##        [,1]   [,2]   [,3]   [,4]   [,5]
## [1,]  0.040 -0.078 -0.040  0.078  0.000
## [2,] -0.078  0.173  0.078 -0.173  0.000
## [3,] -0.040  0.078  0.055 -0.122 -0.001
## [4,]  0.078 -0.173 -0.122  0.390  0.003
## [5,]  0.000  0.000 -0.001  0.003  0.006
## fitted values
Yhat.np <- Dnew %*% sparcc.est.np[1:4]

## standard errors for fitted values
Yhat.se.np <- sqrt(diag(Dnew %*% sparcc.cov.np[1:4, 1:4] %*% t(Dnew)))

## data for plotting
plot.dat.np <- data.frame(
  X = Dnew[,2],
  Z = factor(Dnew[,3]),
  Yhat = Yhat.np,
  Ylower = Yhat.np - qnorm(0.975) * Yhat.se.np,
  Yupper = Yhat.np + qnorm(0.975) * Yhat.se.np)

## plot fitted lines and 95% confidence intervals
ggplot(data = plot.dat.np,
       aes(x = X,
           y = Yhat,
           ymin = Ylower,
           ymax = Yupper,
           color = Z,
           fill = Z)) +
  geom_line() +
  geom_ribbon(alpha = 0.5) +
  labs(x = "X",
       y = "Estimated Regression Line (95% CI)",
       color = "Z",
       fill = "Z",
       title = "SPARCC Outcome Model Results",
       subtitle = "(Using Nonparametric Estimator for Nuisance Parameters)") +
  theme_bw() +
  theme(legend.position = "bottom")

Workflow

Simulations

The four simulation studies for the paper accompanying this R package can all be reproduced using the code in simulations/. For each of these four simulation studies (x = 1,2,3,4), the following exist:

  • simx_function.R: An R script defining a function to run a single simulation.
  • simx_run.R: An R script to run repeated simulations in parallel on a computing cluster using a variety of desired settings.
  • simx_data/: A folder containing simulation results produced by running simx_run.R on a computing cluster.
  • simx_analyze.Rmd: An R Markdown report to analyze the results from the simulation study (found in simx_data/).

Simulation figures are output to the folder sim_plots/.

Synthetic Data Analysis

The workflow used to analyze the ENROLL-HD data set in the accompanying paper is replicated in the folder synth_data_analysis/, but using synthetic data. Details and a data dictionary are provided in synth_data_analysis/00-SynthData-Dictionary.Rmd.

The entire synthetic data analysis can be reproduced by running the following R scripts in order:

  • 01-SynthData-Generation.R
  • 02-SynthData-Analysis-TMS.R
  • 03-SynthData-Analysis-SDMT.R
  • 04-SynthData-Analysis-cUHDRS.R
  • 05-SynthData-Results.R

About

semiparametric estimation with censored covariates

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors