Skip to content

zzw20/spire

Repository files navigation

spire: Semi-Parametric Informative Right-censored covariate Estimator

Zhewei Zhang

Installation

The sparcc package can be loaded locally using the devtools package.

## load the package
library(devtools)
load_all()

## other necessary packages
library(dplyr)
library(ggplot2)
library(statmod)

The sparcc package contains functions to analyze data with a randomly right-censored covariate using the SPARCC estimator.

The methods implemented are introduced in the paper, “SPARCC: Semi-Parametric Robust Estimation in a Right-Censored Covariate Model,” which is currently under revision.

The code implemented in this package is specific to two scenarios: 1. $Y|X,Z$ has a normal distribution with mean $\textrm{E}(Y|X,Z)=\beta_0+\beta_1X+\beta_2Z$; 2. $Y|X,Z$ has a normal distribution with mean $\textrm{E}(Y|X,Z)=\beta_0+\beta_1X+\beta_2Z_1 + \beta_3Z_2 + \beta_4XZ_2$, for a randomly right-censored covariate $X$ and $h$ uncensored covaraiates $Z$. Specially, for the second scenario, $Z_1$ should be contiunous while $Z_2$ should be categorical.

Tutorial

Below is a tutorial for how the SPIRE estimator can be used on a data set with a randomly right-censored covariate.

Estimation

Using the observed data, we can fit four estimators (CC, IPW, MLE, and SPIRE) for the parameter of interest $\mathbf{\beta}$ using functions in the ‘spire’ package.

Complete Case (CC) estimator

To use the CC estimator, use ‘cc’ function in the ‘spire’ package.

There are six arguments in total for the ‘cc’ function:

  • y: Numeric vector of responses.
  • w: Numeric vector of observation of censored covariates $W=\min(C,X)$.
  • delta: 0/1 vector: 1 if $X \leq C$, 0 otherwise.
  • z: Numeric vector (one uncensored covariate) or matrix (multiple uncensored covariates) of covariates $Z$.
  • beta_init: Numeric initial value for ${\boldsymbol\beta}$ (length p), which could be any reasonable estimator.
  • S_beta_fun: Function: Score function based on the model of $Y|X,Z$, i.e., $\partial\log f_{Y|X,Z}(y,x,z;{\boldsymbol\beta})/\partial{\boldsymbol\beta}$.

The cc function returns a list with five items:

  • coef: the estimated model coefficients (or parameter of interest) $\widehat{\boldsymbol\beta}$ (length $p$).
  • cov: estimated covariance matrix of $\widehat{\boldsymbol\beta}$ ($p\times p$ matrix).
  • se: standard errors of $\widehat{\boldsymbol\beta}$.
  • PSI: $p\times n$ matrix of per-obervation scores at $\widehat{\boldsymbol\beta}$.
  • J_hat: $p\times p$ average Jacobian at $\widehat{\boldsymbol\beta}$.

We can then use the estimated coefficients and the standard errors to obtain 95% confidence interval. For example,

## obtain cc estimator from a give dataset
res_cc <- spire::cc(y=y, w=w, delta=delta, z=z, beta_init = beta, S_beta_fun = S.beta.f)

## calculate 95% confidence interval
cbind(res_cc$coef - qnorm(0.975)*res_cc$se, res_cc$coef + qnorm(0.975)*res_cc$se)

In addition, the PSI and J_hat are used in the test for noninformative covariate censoring, which we are going to introduce later.

Inverse Probability Weighting (IPW) estimator

To use the IPW estimator, use ‘ipw’ function in the ‘spire’ package.

In addition to the six arguments y, w, delta, z, beta_init, S_beta_fun which are already defined in the documentation of cc function, there several more arguments in the ipw function:

  • dc_cxz: Optional density $f_{C|X,Z}$. If provided (and pr_fun not), tail probability $P(C\geq x|x,z)$ is integrate(dc_cxz, lower = x, upper = upper).
  • pr_fun: Optional direct tail-prob function pr(x, z_row) $= P(C\geq x|x,z)$. If supplied, it takes precedence over dc_cxz.
  • lower,upper: Numeric scalar integration bounds used when dc_cxz is provided. Defaults are -Inf and Inf; set finite bounds if your support is bounded.

You can plug the true $f_{C|X,Z}$ into dc_cxz, or use any misspecified version (or its corresponding tail probability via pr_fun), and the estimator remains consistent.

The ipw function returns the list with the same five items as cc funtion.

Maximum Likelihood Estimator (MLE)

To use the MLE, use ‘mle’ function in the ‘spire’ package.

In addition to the six arguments y, w, delta, z, beta_init, S_beta_fun which are already defined in the documentation of cc function, there several more arguments in the mle function:

  • f_x_cz: Function (x, c, z_row) for density $f_{X|C,Z}$. Ignored when method = "KM".
  • f_y_xz: Function (beta, y_i, x, z_row) for density $f_{Y|X,Z}$.
  • method: “continuous”, “discrete”, or “KM” (default “continuous”).
  • m: Integer, number of X grid points (only used for “discrete”). Default 30.
  • x_grid_range: numeric(2): c(x_min, x_max) for the grid (only for “discrete”). If NULL, uses mean(w) ± 3*sd(w).
  • upper: Numeric, upper limit for x-integrals when method="continuous" (default Inf).
  • h: Positive bandwidth for the Gaussian kernel in method="KM". Required if method="KM".
  • l: Integer grid length for z1 interpolation in method="KM" (default 20).

To explain the method argument more specifically:

  • method="continuous" means you treat $X|C,Z$ as a parametric density f_x_cz. For censored cases ($\delta=0$), the estimating equations integrate over $x\in [w_i,\text{upper}]$ using f_x_cz. It is used when you have a workable model for $f_{X|C,Z}$ and want full precision.

Requires: f_x_cz, upper.

  • method="discrete" means you approximate the same integrals by a finite grid of $x$-values. The conditional density f_x_cz supplies weights on the grid. It is used when numerical integration is inconvenient/unstable, or you prefer fixed-cost sums.

Requires: f_x_cz; optional m, x_grid_range to control the grid.

  • method="KM" means you use nonparametric working model for $X|Z$: builds a Kaplan-Meier estimate of the distribution of $X|Z$. The KM mass function is then used in place of f_x_cz inside the censored-part scores. It is used when you are going to conduct the test for noninformative covariate censoring and have a $(Z_1, Z_2)$ structure when $Z_1$ is continuous and $Z_2$ is discrete.

Requires: h (bandwidth), l ($Z_1$ grid length)

The mle function returns the list with the same five items as cc funtion.

Semi-Parametric Informative Right-censored covariate Estimator (SPIRE)

To use the SPIRE, use ‘spire’ function in the ‘spire’ package.

In addition to the arguments y, w, delta, z, beta_init, S_beta_fun, f_x_cz, f_y_xz, method, m, x_grid_range, upper, l, which are already defined in the documentation of mle function, there several more arguments in the spire function:

  • h_z1: Bandwidth for KM weighting over $Z_1$ (required when method="KM").
  • h_x: Bandwidth for smoothing KM-estimator of $f_{X|Z}$ to x_grid (required when method="KM").

The spire function returns the list with the same five items as cc funtion.

Test for noninfotmative covariate censoring

To conduct the test for noninfotmative covariate censoring, please use the noninfo_test function in the spire package.

In addition to the six arguments y, w, delta, z, beta_init, S_beta_fun which are already defined in the documentation of cc function, there several more arguments in the mle function:

  • mle_args: List of extra arguments that get passed to mle function.
  • cc_args: List of extra arguments that get passed to cc function.
  • ipw_args: List of extra arguments that get passed to ipw function.
  • spire_args: List of extra arguments that get passed to spire function.
  • compare_with: “cc”, “ipw”, or “spire”.Specifies which estimator will be compared against the MLE when testing noninformative covariate censoring. “cc” means comparing CC estimator with MLE, “ipw” means compare IPW estimator with MLE, and “spire” means comparing SPIRE with MLE.

Workflow

Simulations

All the simulation studies for the paper accompanying this R package can all be reproduced using the code in simulations/. Specifically:

  • sim1_run.R: An R script to run simulations comparing methods under misspecification in the normal setting.

  • sim2_run.R: An R script to run simulations comparing methods under misspecification in the beta setting.

  • sim3_run.R: An R script to run simulations calculating the empirical size and powers of the test across different settings.

  • sim4_run.R: An R script to run simulations calculating the empirical power of the test in the beta setting.

Synthetic Data Analysis

As mentioned in the paper accompanying this R package, the beta setting is the synthetic data of the ENROLL-HD data set. Thus, sim2_run.R and sim4_run.R provide synthetic data analysis to the ENROLL-HD data set.

About

Semi-Parametric Informative Right-censored covariate Estimator

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages