Skip to content

Conversation

@dlebauer
Copy link
Contributor

@dlebauer dlebauer commented Sep 23, 2025

Update downscaling to support multiple PFTs including mixed systems, with demonstration of orchard + ground cover. Combine reports into webpage and publish to gh-pages.

  • New Utilities & Modular Scripts
    • Update and rename combine_mixed_crops.R for mixed‐PFT aggregation
    • New ggsave_optimized.R to streamline plot saving; use svg and webp to reduce size and increase loading speed
    • Generalize helper match_site_ids_by_location.R to match site‐IDs based on lat/lon
  • Workflow Documentation & Site
    • Update README.md with Quarto build/publish instructions
    • Add _quarto.yml, index.qmd, and expanded docs (docs/, reports/) for a full Quarto‐powered website
    • Rename and streamline docs/workflow_documentation.md for clarity
  • Configuration & Environment
    • Update .Rprofile for consistent repository messaging
    • Update 000-config.R, includin using argparse for development vs. production modes
    • auto‐source R/ scripts

dlebauer and others added 29 commits September 23, 2025 01:00
…uture.R (set in config); use all outputs in production to aid in development of multi-crop workflow
…add change since start as delta_<variable> to standard output; aggregate output to monthly means
no longer load renv in .Rprofile; hard code path to renv libraries in .Renviron
Revise mixed_system_prototype.qmd for clarity and to fix incorrect labels.
Co-authored-by: Chris Black <chris@ckblack.org>
…on density and stock and mask counties that cumulatively contribute <1% of total (consider this below detection limit given uncertainty)
…y and stock, save variable importance results; add logging and memory tracking
…qmd` from `downscaling_report.qmd`

- Updated `040_downscale.R` to improve naming conventions and ensure training covariate matrices are saved correctly.
- Refactor `041_aggregate_to_county.R` to streamline the aggregation process and improve readability.
- Enhance `042_downscale_analysis.R` to load predictions and covariates, compute variable importance, and generate plots for partial dependence, ALE, and ICE effects using the `iml` package.
- Improved logging.
@dlebauer dlebauer changed the title Mixed downscaling Downscaling mixed systems Sep 23, 2025
#' For California analyses, use "EPSG:3310" (NAD83 / California Albers).
#' @param map_all logical. If TRUE, compute nearest distances for all target rows.
#' If FALSE, only compute nearest distances for IDs missing from reference;
#' matched-by-ID rows are returned with distance 0.
Copy link
Contributor Author

@dlebauer dlebauer Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define max_distance in docs

Copy link

@divine7022 divine7022 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM overall, the way you are thinking about scaling is excellent.
I have dropped a few inline below; (mostly validation improvements) non-blocking, feel free to take a look and address in this PR or a follow-up

# ca_field_attributes
# )

# TODO: Need to put a canonical design_points CSV in repository

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good that this temporary approach is documented with ### FOR NOW
Training RF on only 10 design points with 7 predictors and predicting on all California agricultural fields without validation is statistically dangerous.
For very small datasets (<100 samples): Random Forest is prone to severe overfitting

} else if (!is.null(mdl$y)) {
y_train <- mdl$y
}
r2_oob <- extract_oob_r2(mdl, y_train)
Copy link

@divine7022 divine7022 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOB samples are still from the same 10 sites, doesn't test spatial extrapolation to new locations so it can give falsely optimistic R2 with small samples.
here i would lean towards having spatial cross-validation.
This is the key validation gap regardless of design point count.

dp <- dplyr::bind_rows(dp, mixed_df_all)
}
}

Copy link

@divine7022 divine7022 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a plan to validate the "woody + annual" combination against real mixed-crop field data?
I am wondering if we have observational data (e.g, from orchards with cover crops) to check if predicted carbon increments are realistic.
Could we add a sanity check comparing predicted (woody + annual) / woody ratios against the ranges?

Thoughts on whether this validation is in scope for this or future iteration ?


ens_members_by_county <- ens_county_preds |>
dplyr::group_by(model_output, pft, county) |>
dplyr::summarize(n_vals = dplyr::n_distinct(total_c_Mg), .groups = "drop")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks if all counties have the same number of ensemble members, is there a case where values are identical and ends up not getting variability ? i suspect to also have check for values are actually different

)
}

county_summaries <- ens_county_preds |>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i am understanding this correct, when aggregating ensemble predictions, this calculates sd(total_c_Tg) which is the between-ensemble variability, but why it ignores within-county spatial heterogeneity ?

par(mar = c(5, 5, 4, 2))

randomForest::partialPlot(rf,
pred.data = design_covariates, x.var = top_predictors[1], ylim = yl,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yl is not defined. Either remove ylim let R auto scale y or calculate ylim from training response value.

y_range <- range(rf$y, na.rm = TRUE)
yl <- c(floor(min(y_range)), ceiling(max(y_range)))

dev.off()
PEcAn.logger::logger.info("Saved importance/PDP figure:", importance_partial_plot_fig)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script currently creates PDPs first and ALE plots afterward.
This order is backwards, since the covariates (clay, temp, precip) are highly correlated.
PDPs fail with correlated features because they marginalize over unrealistic or physically impossible feature combinations, while ALE plots remain unbiased and valid in such cases.

The script already generates ALE plots, but they’re treated as secondary.
Suggestion: swap the priority, generate ALE plots first, and use PDPs only if features are uncorrelated.

#' @param map_all logical. If TRUE, compute nearest distances for all target rows.
#' If FALSE, only compute nearest distances for IDs missing from reference;
#' matched-by-ID rows are returned with distance 0.
#' @return a tibble with mapping and distances (same number of rows as target_df)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' @return a tibble with mapping and distances (same number of rows as target_df)
#' @return a tibble with one row per target row, containing:
#' \describe{
#' \item{target_site_id}{Original target site ID}
#' \item{matched_site_id}{Matched reference site ID (may differ from target if not found by ID)}
#' \item{target_lat, target_lon}{Original target coordinates}
#' \item{ref_lat, ref_lon}{Matched reference coordinates}
#' \item{distance_m}{Distance between target and matched reference (meters)}
#' \item{close}{Proximity classification: "same location", "very close", "close", "moderate", "far"}
#' }
#' Rows are returned in the same order as target_df.

paste(setdiff(req_ref, colnames(reference_df)), collapse = ", ")
)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there is no check for duplicate target site_ids. Could target_df contain duplicate site_ids with different coordinates?

@divine7022
Copy link

scripts/021_clustering_diagnostics.R

dplyr::sample_n(min(nrow(sites_clustered), 1000)) |>

consider setting seed here for reproducibility and potentially migrate to slice_sample() since sample_n is suppressed

@divine7022
Copy link

ca_fields_pts <- ca_fields |> 
sf::st_as_sf(coords = c("lon", "lat"), crs = 4326) 

is ca_fields already an sf object (from line 5)? If so, this conversion will fail.

ca_fields_pts <- ca_fields |>
 sf::st_centroid() |>
 sf::st_transform(crs = 4326)

this makes an intent explicit.

@divine7022
Copy link

sites_clustered <- readRDS(file.path(cache_dir, "sites_clustered.rds")) 

no validation that clustering was successful or that file exists

@divine7022
Copy link

divine7022 commented Nov 7, 2025

scripts/031_aggregate_sipnet_output.R

ens_combined <- ens_wide |> 
dplyr::rowwise() |> 
dplyr::mutate( 
value_combined = combine_value(...) 
) 

rowwise() is slow for large datasets, consider using purrr::pmap here

@divine7022
Copy link

annual_init = dplyr::case_when( 
# TODO: think through this 
# we want to add AGB to the ecosystem level value 
# for SOC, we only want the diff 
# this probably isn't the best place to store this logic 
# also, 
variable == "AGB" ~ 0, 
variable == "TotSoilCarb" ~ value 
) 

Why treating AGB and SOC differently?

@divine7022
Copy link

dplyr::filter(mix_description == "100% woody + 50% annual") 

why is 50% ground cover the "canonical" scenario for EFI output?

@divine7022
Copy link

C_combined = C_woody + f × C_annual 

current approach assumes linear additivity, but mixed systems have non-linear interactions ?

Copy link
Contributor

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized I wrote these comments ages ago and they've been sitting unposted -- sorry for the wait.

"- model_outdir. : {model_outdir}\n\n",
"### Other Settings ###\n",
"- will extract variables: {paste(outputs_to_extract, collapse = ', ')}\n",
"- ca_albers_crs : {ca_albers_crs}{if(ca_albers_crs == 3310) ', which is NAD83 / California Albers' else ''}\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits:

  • This seems like a lot of custom logic for an output that's only helpful in one case. Would terra::crs(ca_albers_crs, describe = TRUE)$name help?
  • Between the variable name and the special handling of 3310 in the summary, I'm hearing an implicit assumption that the projection will be Albers of at least some flavor. Does the code rely on that? If yes consider enforcing more vigorously than just notes in the printed summary, if no then probably clearer to drop "albers" from the variable name.

Comment on lines +23 to +24
#' @param woody_value numeric. Pool size for the woody PFT (kg/m2).
#' @param annual_value numeric. Pool size for the annual PFT (kg/m2).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work on truly any pool, or only select ones? I think I knew this at one point but can't remember on a cold reread, so probably worth documenting.

#' @param reference_lat_col character. Latitude column in reference_df (default "lat")
#' @param reference_lon_col character. Longitude column in reference_df (default "lon")
#' @param crs character. Coordinate reference system of input points (default "EPSG:4326").
#' For California analyses, use "EPSG:3310" (NAD83 / California Albers).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this as recommending the user pass crs = "EPSG:3310" for any input located in CA, but that will be incorrect any time the locations are given in degrees lat/lon.

As ?terra::crs says (their emphasis):

set the CRS of a dataset (if it does not come with one) to what it is, not to what you would like it to be. See project() to transform an object from one CRS to another.

So when the input units are angular degrees the CRS should be geographic (as 4326 is), not projected (as 3310 is). Only use 3310 if both targ and ref really do give their lat and lon in units of meters from the CA Albers origin point.

If the CRS is set wrong, the distance calculations will be (probably very) incorrect, and could even make it match the wrong site:

> targ <- data.frame(lat = 40, lon = -120, site_id = 100)
> ref <- data.frame(lat = c(40.0, 40.001), lon = c(-120.0011, -120.0), site_id = 1:2)
> match_site_ids_by_location(targ, ref, crs = "epsg:4326")
2025-11-05 13:40:48.787131 WARN   [match_site_ids_by_location.R#65: PEcAn.logger::logger.warn] : 
   1 target sites not in reference by ID; matching by nearest location. 
# A tibble: 1 × 8
  target_site_id matched_site_id target_lat target_lon ref_lat ref_lon distance_m
           <dbl>           <int>      <dbl>      <dbl>   <dbl>   <dbl>      <dbl>
1            100               1         40       -120      40   -120.       93.9
# ℹ 1 more variable: close <chr>
> match_site_ids_by_location(targ, ref, crs = "epsg:3310")
2025-11-05 13:40:54.977141 WARN   [match_site_ids_by_location.R#65: PEcAn.logger::logger.warn] : 
   1 target sites not in reference by ID; matching by nearest location. 
# A tibble: 1 × 8
  target_site_id matched_site_id target_lat target_lon ref_lat ref_lon distance_m
           <dbl>           <int>      <dbl>      <dbl>   <dbl>   <dbl>      <dbl>
1            100               2         40       -120    40.0    -120    0.00100
# ℹ 1 more variable: close <chr>

@@ -0,0 +1,209 @@
#' Match site IDs by geographic location
#'
#' Matches site IDs from a target data.frame to the nearest reference site IDs based on latitude/longitude.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level impression: If I'm following the code right, this function tries to match first by ID without considering location and then by location for sites whose IDs didn't match -- sort of a fuzzy join between two datasets assumed to be drawn from the same set of IDs and almost the same set of points. But this documentation seems to present it as a match by location first (or only?), and expecting that left me confused for my first few test runs. I think in general I'd want to do an id-then-location join as an explicitly two-step process.

My immediate use case for this function would be to generate mappings between datasets that are already known to have different reference and target site IDs. In those cases I want every target point to match the closest reference location, even if there's a further reference point whose ID collides with the target's ID. As far as I can tell that's not available with the current design: Even with map_all = TRUE, the returned match will be to the matching ID not the nearest location, yeah?

#'
#' @param target_df data.frame
#' @param reference_df data.frame
#' @inheritParams match_site_ids_by_location
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Roxygen will probably complain about conflicting definitions of target_df and reference_df (the others come in from inheritParams) and generate a confusing Rd file from them

Comment on lines +8 to +12
reference <- data.frame(
site_id = c("A", "B"),
lat = c(34.0, 35.0),
lon = c(-118.0, -119.0)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If these are supposed to really be identical, copy one so readers don't have to play spot the difference

Suggested change
reference <- data.frame(
site_id = c("A", "B"),
lat = c(34.0, 35.0),
lon = c(-118.0, -119.0)
)
reference <- target

dlebauer and others added 4 commits November 30, 2025 19:02
Co-authored-by: Chris Black <chris@ckblack.org>
Co-authored-by: Chris Black <chris@ckblack.org>
Co-authored-by: Chris Black <chris@ckblack.org>
Comment on lines +8 to +9
type = "logical", default = TRUE,
help = "Set to true for production mode, false for faster development (default: FALSE)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default/doc mismatch?

Suggested change
type = "logical", default = TRUE,
help = "Set to true for production mode, false for faster development (default: FALSE)"
type = "logical", default = TRUE,
help = "Set to true for production mode, false for faster development (default: TRUE)"

Copy link
Contributor Author

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider that canopy_cover != fraction of impact. This is model-specific assumption that should be integrated into write.configs.SIPNET

out_of_range_woody <- (woody_cover < 0 - tol) | (woody_cover > 1 + tol)
if (any(out_of_range_annual | out_of_range_woody, na.rm = TRUE)) {
n_bad <- sum(out_of_range_annual | out_of_range_woody, na.rm = TRUE)
PEcAn.logger::logger.severe(paste0(n_bad, " rows have cover fractions outside [0,1] (<U+00B1>tol)."))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PEcAn.logger::logger.severe(paste0(n_bad, " rows have cover fractions outside [0,1] (<U+00B1>tol)."))
PEcAn.logger::logger.severe("weighted: cover fractions outside must be in the range [0,1] (+/- tol).", n_bad, "rows violate.")

not_one <- abs(woody_cover - 1) > tol
if (any(not_one, na.rm = TRUE)) {
n_bad <- sum(not_one, na.rm = TRUE)
PEcAn.logger::logger.severe(paste0("incremental: woody_cover must be 1 (+/- ", tol, "); ", n_bad, " rows violate."))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are errors phrased differently?

I've rearranged the sentences. For incremental, woody must be 100%.

Suggested change
PEcAn.logger::logger.severe(paste0("incremental: woody_cover must be 1 (+/- ", tol, "); ", n_bad, " rows violate."))
PEcAn.logger::logger.severe("incremental: woody_cover must be 1 (+/- ", tol, "); ", n_bad, " rows violate.")

PEcAn.logger::logger.severe(paste0("incremental: woody_cover must be 1 (+/- ", tol, "); ", n_bad, " rows violate."))
}
res <- woody_value + annual_cover * (annual_value - annual_init)
return(as.numeric(res))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return(as.numeric(res))
return(res)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants