Skip to content

Conversation

@dlebauer
Copy link
Contributor

@dlebauer dlebauer commented Feb 3, 2025

First draft of code to select design points, downscale SIPNET output to fields, and aggregate to county level summaries.

This PR implements the core functionalities for the MVP of Phase 1b (Scale up Statewide Perennial Woody Crop Inventory).

Key features include:

  • Anchor sites: CSV file of 25 locations for calibration / validation
  • Data Preparation: extract environmental covariates for each landIQ parcel and join anchor sites to LandIQ parcels
  • Design Point Selection: Integration of k-means clustering and anchor site merging to generate design points.
  • Extraction of output from multi-site ensemble SIPNET outputs.
  • Downscaling & Aggregation: Application of Random Forest to downscale SIPNET outputs and aggregate predictions to county-level inventories.
  • Documentation & Outputs: Summarize workflow and results in 04_downscaling_documentation_results.qmd. This report can be reviewed by downloading and uncompressing
    04_downscaling_documentation_results 2.html.zip (updated 2025-04-15)

Note: I don't expect that this will run from beginning to end flawlessly, but it should work if the install_github is used to install PEcAn packages (assim.sequential and data.land) from specific branches. and using input data in this repository that is on geo.bu.edu.

Related: This code uses branches from the dlebauer/pecan repository in two places. This includes:

Review Requests

I am primarily interested in feedback on the logic of the workflows and correctness of implementation.
This PR focuses on MVP requirements for Phase 1b. When providing suggestions please distinguish what is required for phase 1b MVP and what should or could be included in future iterations.

@dlebauer dlebauer changed the title First draft of 1b deliverables [WIP] First draft of 1b deliverables Feb 3, 2025
Copy link
Contributor Author

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still halfway through, but I've finished fixing the 00 and 01 scripts

#' TODO: move a copy of these files to data_dir
#'
## ----load-soilgrids-----------------------------------------------------------
soilgrids_north_america_clay_tif <- '/projectnb/dietzelab/dongchen/anchorSites/NA_runs/soil_nc/soilgrids_250m/clay/clay_0-5cm_mean/clay/clay_0-5cm_mean.tif'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thus the todo on line 123) but ... ideally running workflows wouldn't require changing the scripts. I'm still wondering what is the best way to handle all of these machine specific paths - a bespoke workflow-level settings file?

Comment on lines +220 to +223
#' ### Check Clustering
#'
## ----check-clustering---------------------------------------------------------
# Summarize clusters
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comment styles are the result of using knitr::purl when I converted from Rmd or qmd to an R script. Although I don't have immediate plans, the idea is that this could still be knit into a useful document. But yes this will clean up a lot of redundant comments!

@@ -0,0 +1,10 @@
# This file will load any time the future package is loaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving this out of the script --definitely an improvement. I do have some reservations with this approach too:

  • Having it as a dotfile makes it harder to know it's there. I can imagine myself trying to run one script while not having any future-specific details at the front of my mind and being baffled trying to work out where the "using cores" message was coming from.
  • Goes against general Git advice that when the point of a file is to contain user-specific values, it's usually better not to commit it even if there's nothing sensitive/secret about the values it stores.
  • I assume you're intending this to set one plan for the project and have all runs use it. Will there be times that's not true (e.g. are some steps IO-bound such that throwing more cores at them causes slowdown rather than speedup)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having it as a dotfile ...

This filename is recommended in the docs. We can change it, I'm trying to reduce the amount of reused code.

when the point of a file is to contain user-specific values

How are these user-specific?

set one plan for the project

Could be moved so it is workflow specific.

Will there be times that's not true

There could be, in which case a script could temporarily change the plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these user-specific?

I'm channeling the standard frustrating-but-persuasive-to-me advice that "the only safe default is one core until the user explicitly says otherwise." Whether or not future's available core detection is robust, it's still not a given that the user wants to use all or all-but-one of them, and similarly the user's choice of plan "multisession"/"multicore"/"cluster"/etc is one that we can't really predict. I get that everything here is overrideable, but am starting from the belief it's better for leave this entirely to the user rather than assume anything.

To resolve this thread I'm OK with merging .future.R and revisiting in the next iteration, but we should put it where it will be useful:

  • future only checks the working directory and doesn't walk up the parent tree, so I think this should be placed in downscaling/ rather than the repo root, yeah?

BTW, a caution I discovered while testing: .future.R is actually only read when the package is attached, e.g. compare Rscript -e 'future::plan()' to Rscript -e 'library(future); plan()'

Copy link
Contributor

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the refinements! I think the remaining conversations can be for future improvement.

Last two requests before merging: Please move .future.R into downscaling/ and rename 03_downscale_and_agre... to 03_downscale_and_aggre..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants