README

This repository contains simulated air quality data, covariates, and R code to illustrate the methodology and analysis published in ``Fine-scale spatiotemporal air pollution analysis using mobile monitors on Google Street View vehicles" (add DOI).

Data Files

oakland_data_simulated.csv

Contains simulated Oakland Air Quality data
Variables include:
- Date_Time: Data and time (seconds) of each measurement
- Car_Identifier: identifies which car (A or B) obtained the measurement
- Latitude: latitude of car position at time of measurement
- Longitude: longitude of car position at time of measurement
- Car_Speed: speed of car (km/s)
- NO2: measured nitrogen dioxide (ppm)

Simulation Details

Data are simulated for a subset of 60 days between Sept. 11, 2015 and May 5, 2016. Spatial locations and measurement times are sampled from actual observed car paths, with hour of observation altered to build a sequence of data from 9:00 to 18:00 on each day. Measurements of N02 at the simulated paths and times are constructed as white noise perturbed predictions from the fitted ST model estimated using the actual Google air quality data. The full sequence of simulated data is then subsampled to approximately 1/6 the original size to produce 76,484 total simulated measurements, roughly 10% the size of the actual Oakland data.

Note: the simulated data is meant to help understand how the following code can be used but is not intended to show any interesting results. A interested user should obtain the actual data from Google. The data is freely available upon request here.

Data_Covariates.csv

Contains (real) spatial covariate data for 30m road segments in Oakland.
Variables include:
- ID: unique identifier for each 30m road segment
- Long30m: longitude at the center of the segment
- Lat30m: latitude at the center of the segment
- Elevation_50: mean elevation within a circular buffer of 50 meters
- Total_Road_50: total length of highways, major road, residential roads within a circular 50 meter buffer region
- Hwy_Road_50: highways road length within a circular buffer of 50 meters
- Maj_Road_50: major arterials road length within a circular buffer of 50 meters
- Res_Road_50: residential roads within a circular buffer of 50 meters
- Pop_50: population density (people/sq-km) within a circular buffer 50 meters
- Binary road classification variables:
  - Hwy_Roads: highways
  - Major_Roads: major through road
  - Res_Roads: residential road
  - Local_Trucks: designated heavy-duty truck routes
  - Local_Restricted_Trucks: restricted heavy-duty truck routes
  - Commerical.Zone: commercial zoning
  - Industrial.Zone: industrial zoning
  - Residential.Zone: residential zoning
  - NDVI_50: The average Normalized Difference Vegetative Index within circular buffer of 50 meters
- Variables constructed based on the National Land Cover Database satellite imagery file. Each variable shows the percent of land cover of type within a circular buffer of 50 meters.
  - NLCD_Water_50: water
  - NLCD_DevOpen_50: open areas in developments
  - NLCD_DevLow_50: low developed areas
  - NLCD_DevMed_50: medium developed areas
  - NLCD_DevHigh_50: highly developed areas
  - NLCD_Barren_50: barren
  - NLCD_Deciduous_50: deciduous forest
  - NLCD_Evergreen_50: evergreen forest
  - NLCD_MixForest_50: mixed forest
  - NLCD_Shrub_50: shrublands
  - NLCD_Herbaceous_50: herbaceous
  - NLCD_Pasture_50: pasture
  - NLCD_Crops_50: crops
  - NLCD_WoodyWet_50: woody wetlands
  - NLCD_EmergWet_50: emerging wetlands
  - NLCD_Impervious_50: impervious surface
- Variables representing cumulative exponentially decaying contribution from point sources
  - Distance_to_NPL: National Priority Listing (NPL) sites
  - Distance_to_Rail: railroads
  - Distance_to_TRI: Toxic Release Inventory (TRI) sites
- Variables representing minimum inverse distance to point sources
  - MinDist2Port: all principal port and facility locations
  - MinDist2MainPort: principal port locations
  - Dist2MainAirport: Major airports
  - Dist2Airport: airports

R/C++ Files

Rfunctions_GC.R

contains all R functions to perform estimation and prediction for the STx, ST and S models

cpp_code_GC.cpp

contains helper c++ files to speed up estimation and prediction

run_all_code.R

R script to perform analysis included in the paper on simulated data provided in the Data folder
Calls the follwing helper scripts
- data_setup.R
  - cleans up raw data and performs temporal aggregation at 15 sec and 1 min block medians
  - each aggregated dataset is saved and stored in data_blockmed.Rda
- st_stx_estimation_prediction.R
  - performs parameter estimation and prediction for the ST and STx models
  - results are stored as R objects and saved out in .Rda files
- spatial_only_estimation_prediciton.R
  - performs parameter estimation and prediction for the S (spatial only) model
  - results are stored as R objects and saved out in .Rda files
- 15min_map_forecasts.R
  - creates 15 min ahead spatial map forecasts for the ST model for two days
  - used to recreate maps contained in Figures 5 and 6 of the paper
- rolling_window_estimation.R
  - performs ST and STx model estimation on data in 21-week rolling windows
  - results are stored in .Rda files
- deployment_design.R
  - performs the simulation experiment comparing mspe for mobile vs fixed-location monitors

plotting_code.R

Contains code to construct the plots include in the paper after obtaining all results from run_all_code.R

Reproducing the analysis on simulated data

The R script run_all_code.R illustrates the analysis performed in ``Fine-scale spatiotemporal air pollution analysis using mobile monitors on Google Street View vehicles" using simulated data. If the user obtains the actual Google data, run_all_code.R should recreate the analysis included in the paper.

The following sections walk through the various components of run_all_code.R . Note that this R script and all others assume the data and scripts are all stored in a single, common folder.

Set Parameters

rolling_window = FALSE # should rolling window estimation be performed? (computationally expensive)

type <- 2; # type in 1:3, defines which data product to use 
# 1 == raw data, 2 == 15sec aggregates, 3 == 1min aggregates

j <- 3 # j in 1:4, which estimation lag, index of h <- c(0.02, 5, 15, 60) min
jj <- 3 # jj in 1:3, which size of conditioning set, index of m <- c(10, 30, 60) min

lag <- 21 # lag for moving window 

if(type==1){
  ns <- 100 # if using 1sec data, define the size to subsample the conditioning set in Vecchia
  nsA <- 800 # if using 1sec data, define the nearest neighbor size for Car A prediction
}

nnbs <- 800 # number of nearest neighbors for spatial only prediction

maxit <- 1e3 # maximum number of iterations for Nelder-Mead maximization of Vecchia log-likelihood

Load R Packages

pkg.names <- c("Rcpp", "RcppArmadillo", "doParallel", "RANN", "chron", "fields", 
               "dplyr", "FNN", "ggplot2", "ggmap")

for(i in 1:length(pkg.names)){
  if(pkg.names[i] %in% rownames(installed.packages()) == FALSE) {
    install.packages(pkg.names[i])
    cat("\r", pkg.names[i])
  }
}

lp <- sapply(pkg.names, require, character.only = TRUE)

if(sum(lp)!=length(pkg.names)){
  stop(paste0("Unable to load packages: ", paste(pkg.names[!lp], collapse = ", ")))
}

Source R Functions

source("Rfunctions_GC.R")
sourceCpp("cpp_code_GC.cpp")

Process Raw Data and Perform Temporal Aggregation

source("data_setup.R")
# Necessary files loaded in data_setup.R:
# Data_IDs.csv # road segment ID
# Data_Covariates.csv #GIS covariates
# oakland_data_simulated.csv # simulated Google car dataset
#
# Saves out temporally aggregated datasets in "data_blockmed.Rda"

Model Fitting Setup

########### Set Up Data For Model Fitting ##########
h <- c(0.02, 5, 15, 60) # minutes
m <- c(10, 30, 60) # minutes 


load("data_blockmed.Rda")
df$Y_block_med <- df$Y
df$Car = ifelse(df$Car=="Car_B",2,1)

covar = c("Longitude","Latitude","t",paste0("PC",1:7))

if(type == 1) df_block = df
if(type == 2) df_block = df_block_tseg15sec
if(type == 3) df_block = df_block_tseg1min

keep = c("Y_block_med","locID","Longitude","Latitude","Car","speed","year",
         "month","day","hour","min","sec","wday","t",paste0("PC",1:7),paste0("H",1:4))
df_block[,colnames(df_block)%in%paste0("PC",1:7)] = scale(df_block[,colnames(df_block)%in%paste0("PC",1:7)])
df_block=df_block[order(df_block$t),keep]

regressorname = c()
foo = paste0("(",paste0("PC",1:7,collapse ="+"),")")
for(i in 1:4) regressorname = c(regressorname,paste0("H",i,"*",foo))

PCidx <- which(colnames(df_block)%in%c(paste0("PC",1:7),paste0("H",1:4)))
cov_names <- c(paste0("PC",1:7),paste0("H",1:4))

days = sort(unique(floor(df_block$t)))
df_block$day = floor(df_block$t)

if(rolling_window){
  #  create blocks of two weeks
  wk_block = seq(192,max(days),by = 7*1)
  daysint = findInterval(df_block$day, wk_block)
  df_block$daysint = daysint; 
  df_block$idx = 1:nrow(df_block);
}

Fit Models

source("st_stx_estimation_prediction.R") 
# NOTE: this estimation can take several hours to run, depending on the number of cores used

ST and STx Vecchia estimation and prediction for type, and combination of h (j) and m (jj) specified above. Saves out estimated parameters; 5min, 15min and 60min forecasts; Car A predictions; mspe and correlation in .Rda files.

source("spatial_only_estimation_prediction.R")
# NOTE: this can take several hours to run, depending on the number of cores used.

S model Vecchia approximation estimation and prediction for type, and combination of h (j) and m (jj) specified above. Saves out estimated parameters, spatial predictions, Car A predictions, mspe and correlation in .Rda files

if(rolling_window){
  source("rolling_window_estimation.R")
}

ST and STx Vecchia estimation and prediction using rolling windows with window size lag = 21 weeks. For example, data from week 1-21 are used for parameter estimation, then prediction is made for week 22. This procedure is then repeated for the next window of week 2-22. The rolling window analysis to compare ST and STx Vecchia models is performed in parallel for each window in practice. It is not performed here by default (rolling_window = FALSE) because the example simulated data spans only two months. The results are saved out for each window in .Rda files

Map Forecasts

# run for dayidx = 253 and dayidx = 490 
if(type == 2){
  for(dayidx in c(253, 490)){
    source("15min_map_forecasts.R")
  }
}

Creates full spatial 15-min ahead map forecasts for the two different days and times included in the paper (Figures 5 and 6).

Mobile vs. Stationary Simulation

source("deployment_design.R")

# NOTE: this can take several hours to run, depending on the number of cores used.

Performs a simulation experiment comparing mean squared prediction error (MSPE) for mobile vs fixed-location monitors for short-term forecasting and spatial interpolation. For ncar=1,...,15 and nstation=1,...,15, we sample ncar number of routes from the Google data and randomly select nstation number of locations in the study region as fixed-location monitors. The MSPE is computed for both type of monitors conditional on locations from ncar and nstation of monitors respectively. This procedure is repeated 30 times to obtain uncertainty of the MSPE.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Code		Code
Data		Data
.gitignore		.gitignore
LICENSE		LICENSE
README.Rmd		README.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README

Data Files

oakland_data_simulated.csv

Simulation Details

Data_Covariates.csv

R/C++ Files

Reproducing the analysis on simulated data

Set Parameters

Load R Packages

Source R Functions

Process Raw Data and Perform Temporal Aggregation

Model Fitting Setup

Fit Models

Map Forecasts

Mobile vs. Stationary Simulation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

majohnso/JASA-mobile-AQ

Folders and files

Latest commit

History

Repository files navigation

README

Data Files

oakland_data_simulated.csv

Simulation Details

Data_Covariates.csv

R/C++ Files

Reproducing the analysis on simulated data

Set Parameters

Load R Packages

Source R Functions

Process Raw Data and Perform Temporal Aggregation

Model Fitting Setup

Fit Models

Map Forecasts

Mobile vs. Stationary Simulation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages