-
Notifications
You must be signed in to change notification settings - Fork 277
Multisite SA support with input coordination and README tracking #3660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…g issue of sample.Rdata for multisite setting
…t to track all inputs used per run
infotroph
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Big picture, I don't follow how appending to samples.Rdata will work for the processes that need to read from it downstream and I wonder how much of the information appended is already captured by the input design. I left some more specific comments inline, but it's worth being sure the overall design is right before worrying about most of them.
| # This checks for met.id tag in the settings under sensitivity analysis - | ||
| # if it's not there it creates it. Then it's gonna use what it created. | ||
| if (is.null(settings$sensitivity.analysis$met.id)) { | ||
| settings$sensitivity.analysis$met.id <- 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to the overall goal of picking one met (/other input) path to use across the whole SA.
But calling this "met.id" is confusing to me -- looks like this is basically an index into the ensemble, whereas I would usually expect "met ID" to mean an identifier for which source (e.g. CRUNCEP ERA5, etc) the met data came from.
Could it instead be something like settings$sensitivity.analysis$met_path and contain the actual filepath to be used by the SA instead of an index that needs to be extracted from the ensemble? That'd be clearer to me, and would also allow a user to pass an arbitrary path to the SA even if it doesn't exist in the ensemble (not that I expect most people to want that, but it could come in handy sometimes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...But wait! Doesn't input_design handle this for us just below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, it duplicates with input_design( new approach )
- and also one of the reason is to be consistent with the ensemble pattern, so SA doing different (path) would be inconsistent
- and works when
input_designis NOT provided (single-site or legacy workflows) - The met.id resolution is indeed redundant when input_design is present. The met.id logic serves as a backward compatibility fallback for workflows that don't use
input_design. However, wheninput_designexists it supersedes met.id and handles ALL inputs (met, IC, soil, etc.) . - Alternatively we could skip met.id resolution entirely when
input_designexists
short-term fix (This PR):
keep current approach, add clarifying comment:
# select single met path for SA runs without input_design (backward compatibility)
# when input_design is provided (recommended), it supersedes this selection
# and allows different met files per SA run
if (is.list(settings$run$inputs$met$path)) {
if (is.null(settings$sensitivity.analysis$met.id)) {
settings$sensitivity.analysis$met.id <- 1
}
settings$run$inputs$met$path <- settings$run$inputs$met$path[[settings$sensitivity.analysis$met.id]]
}
Long-term (Further iteration):
Document input_design as recommended approach , consider deprecating met.id in favor of input_design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re met path: You've talked me out of the idea, especially since it would need to be site-specific -- so my proposal wouldn't need just one settings$sensitivity.analysis$met_path, it'd be at least a settings$sensitivity.analysis$<site>$met_path for every site. I do still suggest naming it something like met.index instead of met.id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re input design: I'll defer to @dlebauer but my sense is that since we've bought into making input_design mandatory for other parts of the workflow and it's specifically designed to be where we keep all the pieces of the sampling design together in one place, it may be sensible to make it mandatory here too rather than duplicate its functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do still suggest naming it something like
met.indexinstead ofmet.id
that's renaming to met.index would be nice clarity, but i am afraid of breaking backward compatibility for existing workflows may have <met.id> , and a direct rename would silently ignore those settings ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of having a consistent input_design object. It is hard to see an advantage of different approaches if one generalize solution is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious - where is met.id used? I can't find the string in either the PEcAn repository or in xml settings files used by CCMMF or Dongchen's workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dlebauer it's only used immediately below to set the met$path.
I agree with the consensus to drop the met.id solution proposed here and just use input_design. If one wants to only vary parameters and hold everything else fixed it's really easy to generate an input_design that's all 1's for everything other than parameters. The met.id "solution" isn't really a solution anyways as it only fixes met ensemble members and not the other ensembled inputs (phenology, soil physics, initial conditions, etc.)
| # combine parameter samples for same PFT across sites | ||
| ensemble.samples[[pft_name]] <- rbind( | ||
| existing_env$ensemble.samples[[pft_name]], | ||
| ensemble.samples[[pft_name]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing how one figures out after the fact which samples were used for which site
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The site information is preserved through the run id's; Each run id contains the site id
(e.g, ENS-00001-1000004925 --- > site 1000004925), and this becomes the output folder name. Downstream functions don't parse the site id, they just iterate through all run id's and read from folders named with those id's.
# site 1 :
runs.samples$ensemble <- data.frame(
id = c("ENS-00001-1000004925", # site id
"ENS-00002-1000004925")
)
ensemble.samples$temperate <- data.frame(
growth_resp = c(0.28, 0.31) # rows 1-2
)
# after site 2 merging:
runs.samples$ensemble <- data.frame(
id = c("ENS-00001-1000004925", # site 1, run 1
"ENS-00002-1000004925", # site 1, run 2
"ENS-00001-1000004928", # site 2, run 1 <-- Different site id
"ENS-00002-1000004928") # site 2, run 2
)
ensemble.samples$temperate <- data.frame(
growth_resp = c(0.28, 0.31, 0.33, 0.30) # rows match run id's
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I don't think ensemble.samples should grow at all -- The point of passing in an input design is to enforce that every run with a given ensemble member uses the same value of growth_resp (and every other parameter too), so ensemble.samples needs to stay n_ens lines long. I guess one could repeat the same set of values for each site, but that seems VERY inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@infotroph @mdietze @dlebauer want to make sure that i got right logic with the merging approach ?
Explained in comments
PEcAn.logger::logger.info("###### Finished writing model run config files #####")
PEcAn.logger::logger.info("config files samples in ", file.path(settings$outdir, "run"))
# -----------------------------------------------------------------------
# multisite merging logic
# -----------------------------------------------------------------------
# In multisite ensemble analysis, we maintain parameter
# consistency across sites while allowing run tracking to grow.
# This ensures that the same parameter samples are applied across all sites,
# enabling valid comparison of model performance under different environmental
# conditions while tracking site-specific run execution.
# -----------------------------------------------------------------------
samples.file <- file.path(settings$outdir, "samples.Rdata")
if (file.exists(samples.file)) {
PEcAn.logger::logger.info("Merging with existing samples.Rdata to support multisite run...")
# load existing data
existing_env <- new.env()
load(samples.file, envir = existing_env)
# Ensemble parameters must remain identical across sites to answer:
# How do the SAME parameter sets perform under DIFFERENT conditions ?
if (!is.null(existing_env$ensemble.samples)) {
# does the existing list have names ?
# this validates we have a properly structured ensemble.samples object
if (length(names(existing_env$ensemble.samples)) > 0) {
# use existing parameter set to maintain ensemble consistency
# all sites use identical parameter samples for ensemble analysis
ensemble.samples <- existing_env$ensemble.samples
} else {
# existing parameter structure is invalid - use current site's parameters
PEcAn.logger::logger.warn("Existing ensemble.samples structure is invalid. Using current site parameters as new baseline.")
}
}
# maintain consistency in trait and sensitivity analysis samples
if (!is.null(existing_env$trait.samples)) {
if (length(names(existing_env$trait.samples)) > 0) {
trait.samples <- existing_env$trait.samples
}
}
if (!is.null(existing_env$sa.samples)) {
if (length(names(existing_env$sa.samples)) > 0) {
sa.samples <- existing_env$sa.samples
}
}
# Run IDs append (this SHOULD grow)
# ---------------------------------------------------------------------
# Run IDs accumulate to track which parameter sets execute at which sites
# eg. with 2 ens members across 2 sites:
# - site 1: ENS-00001-1000004925, ENS-00002-1000004925
# - site 2: ENS-00001-1000004928, ENS-00002-1000004928
# - final: 4 run IDs tracking same 2 parameter sets across 2 sites
# ---------------------------------------------------------------------
if (!is.null(existing_env$runs.samples)) {
# ensemble runs tracking (data frame -> rbind)
# rbind here because run IDs must accumulate across sites
if (!is.null(runs.samples$ensemble)) {
if (!is.null(existing_env$runs.samples$ensemble)) {
# check we are not rbinding garbage. Check for 'id' column.
if("id" %in% names(existing_env$runs.samples$ensemble)) {
runs.samples$ensemble <- rbind(existing_env$runs.samples$ensemble, runs.samples$ensemble)
# run IDs grow to track which sites use which ensemble members
}
}
} else if (!is.null(existing_env$runs.samples$ensemble)) {
runs.samples$ensemble <- existing_env$runs.samples$ensemble
}
# SA runs tracking (list structure -> modifyList)
if (!is.null(runs.samples$sa)) {
if (!is.null(existing_env$runs.samples$sa)) {
runs.samples$sa <- utils::modifyList(existing_env$runs.samples$sa, runs.samples$sa)
}
} else if (!is.null(existing_env$runs.samples$sa)) {
runs.samples$sa <- existing_env$runs.samples$sa
}
}
# metadata (pft names, trait names) should merge to create
# coverage across all sites while avoiding duplication
# ---------------------------------------------------------------------
if (!is.null(existing_env$pft.names)) {
pft.names <- unique(c(existing_env$pft.names, pft.names))
}
if (!is.null(existing_env$trait.names)) {
if (length(names(existing_env$trait.names)) > 0) {
trait.names <- utils::modifyList(existing_env$trait.names, trait.names)
}
}
}
save(ensemble.samples, trait.samples, sa.samples, runs.samples, pft.names, trait.names,
file = samples.file
)
PEcAn.logger::logger.info("parameter values for runs in ", samples.file)
options(scipen = scipen)
invisible(settings)
return(settings)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems overly complicated. get.param.samples should generate one matrix of parameters per PFT and currently there is not within-PFT variability in parameter samples across sites (though this may change in the future). For a particular site, the PFT and input_design should determine which parameter matrix you use, and which row in that matrix you select, respectively. There should be no need to rbind matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Mike here and am leaning toward an even stronger claim that run.write.configs should not be rewriting samples.Rdata at all. Its contents are all either generated once by get.parameter.samples before run.write.configs starts (ensemble.samples, sa.samples, trait.samples) or are not actually samples and can be taken from runs.txt (runs.samples)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not actually samples and can be taken from runs.txt (runs.samples)
does the downstream analysis expect runs.samples to be in samples.Rdata ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into read.ensemble.output and read.sa.output to verify if we can stop writing samples.Rdata, but those functions explicitly load samples.Rdata to find the run ids :
ens.run.ids <- samples$runs.samples$ensemble
sa.run.ids <- samples$runs.samples$sa
they do not currently have logic to parse runs.txt
so here we can ---
1 ) reload the existing parameters (avoid duplication/growth) but I must append the run ids to runs.samples and save the file. (run.write.cofigs)
2 ) or refactor downstream to read runs.txt ( does this feel like to be a separate PR ? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know what you mean by "stop writing samples.Rdata". No one is saying this isn't needed, we're saying it should already exist before run.write.configs is called (it is now generated by the new input design function, which should be called before the write configs functions) and it shouldn't need to be modified.
Only start model runs should rely on runs.txt. There's not enough information in there for anything else downstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know what you mean by "stop writing samples.Rdata".
I was responding specifically to chris's suggestion that runs.samples might be redundant cuz that info
"can be taken from runs.txt".
I was verifying if run.write.configs could stop appending even the new run ids(runs.samples) to samples.Rdata ( relaying on runs.txt instead) , but downstream (read.ensemble.output , .. ) depends on finding those ids in samples.Rdata.
But it's clear now
Only start model runs should rely on runs.txt.
| # write run information to disk | ||
|
|
||
| # Apply input design coordination for SA runs | ||
| settings_copy <- settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's confusing that below here some values are taken from settings and others from settings_copy. Is there actually a need for two separate objects or could these mutations be applied straight to settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SA loop without settings_copy:
for (trait in traits) {
for (quantile in quantiles) {
settings$run$inputs$met$path <- input_paths[[run_index]] # mutates
...
# write config
write.config.SIPNET(settings, ...)
# next iteration uses MUTATED settings;
# quantile 2 sees met path from quantile 1
}
}
- so by having
settings_copyit modify copy only (settings_copy$run$inputs$met$path <- input_paths[[run_index]]) andwrite.config.*gets clean settings each iteration. - If we mutated
settingsdirectly, run 2's config would inherit run 1's modified input paths, causing incorrect configs.
The answer for question: why settings is used for some values ?
cat("model id : ", settings$model$id, "\n") # never changes across runs
cat("site id : ", settings$run$site$id, "\n") # same for all SA runs
BUT inputs must come from settings_copy:
do.call(my.write.config, settings = settings_copy) # has resolved input paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but the path is updated earlier in the loop than the write.config call, so by the time write.config gets called for run 2 the path from run 1 has already been overwritten by the path from run 2. As long as you're sure that every iteration will update the same fields of the settings object, it doesn't matter whether it started from a "clean" object or not.
Similarly, if you are using settings_copy then it also contains all the fields from settings that aren't updated each iteration, so there's definitely no reason to interleave uses of both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it makes sense to use settings_copy instead of settings. It seems this only concerns the README block that pulls fields from the original settings. If we just switch those cat() calls to read from settings_copy, it’ll be obvious that each iteration uses one clean settings object end-to-end.
I've proposed this change below in https://github.com/PecanProject/pecan/pull/3660/files/f075e4d7d0f9ad85ee340bbe01f7dca464988677#r2561591261
| # combine parameter samples for same PFT across sites | ||
| ensemble.samples[[pft_name]] <- rbind( | ||
| existing_env$ensemble.samples[[pft_name]], | ||
| ensemble.samples[[pft_name]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I don't think ensemble.samples should grow at all -- The point of passing in an input design is to enforce that every run with a given ensemble member uses the same value of growth_resp (and every other parameter too), so ensemble.samples needs to stay n_ens lines long. I guess one could repeat the same set of values for each site, but that seems VERY inefficient.
| if (!is.null(existing_data$trait.samples)) { | ||
| trait.samples <- utils::modifyList(existing_data$trait.samples, trait.samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trait.samples ought to be generated exactly once at the beginning of setup (to define the set of posteriors that ensemble.samples is selected from) and not differ from site to site. Since it may contain MCMC results with potentially as many as millions of samples per trait, I really don't think we want to be duplicating it here.
| # write run information to disk | ||
|
|
||
| # Apply input design coordination for SA runs | ||
| settings_copy <- settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but the path is updated earlier in the loop than the write.config call, so by the time write.config gets called for run 2 the path from run 1 has already been overwritten by the path from run 2. As long as you're sure that every iteration will update the same fields of the settings object, it doesn't matter whether it started from a "clean" object or not.
Similarly, if you are using settings_copy then it also contains all the fields from settings that aren't updated each iteration, so there's definitely no reason to interleave uses of both.
| "workflow id : ", workflow.id, "\n", | ||
| "ensemble id : ", ensemble.id, "\n", | ||
| "pft name : ", names(trait.samples)[i], "\n", | ||
| "pft name : ", names(trait.samples)[pft_idx], "\n", | ||
| "quantile : ", quantile.str, "\n", | ||
| "trait : ", trait, "\n", | ||
| "run id : ", run.id, "\n", | ||
| "model : ", model, "\n", | ||
| "model id : ", settings$model$id, "\n", | ||
| "site : ", settings$run$site$name, "\n", | ||
| "site id : ", settings$run$site$id, "\n", | ||
| "met data : ", settings$run$site$met, "\n", | ||
| sa_input_info, | ||
| "start date : ", settings$run$start.date, "\n", | ||
| "end date : ", settings$run$end.date, "\n", | ||
| "hostname : ", settings$host$name, "\n", | ||
| "rundir : ", file.path(settings$host$rundir, run.id), "\n", | ||
| "outdir : ", file.path(settings$host$outdir, run.id), "\n", | ||
| file = file.path(settings$rundir, run.id, "README.txt"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should address the suggestion in #3660 (comment) to avoid mixing use of settings and settings_copy once the copy is made.
| "workflow id : ", workflow.id, "\n", | |
| "ensemble id : ", ensemble.id, "\n", | |
| "pft name : ", names(trait.samples)[i], "\n", | |
| "pft name : ", names(trait.samples)[pft_idx], "\n", | |
| "quantile : ", quantile.str, "\n", | |
| "trait : ", trait, "\n", | |
| "run id : ", run.id, "\n", | |
| "model : ", model, "\n", | |
| "model id : ", settings$model$id, "\n", | |
| "site : ", settings$run$site$name, "\n", | |
| "site id : ", settings$run$site$id, "\n", | |
| "met data : ", settings$run$site$met, "\n", | |
| sa_input_info, | |
| "start date : ", settings$run$start.date, "\n", | |
| "end date : ", settings$run$end.date, "\n", | |
| "hostname : ", settings$host$name, "\n", | |
| "rundir : ", file.path(settings$host$rundir, run.id), "\n", | |
| "outdir : ", file.path(settings$host$outdir, run.id), "\n", | |
| file = file.path(settings$rundir, run.id, "README.txt"), | |
| "workflow id : ", workflow.id, "\n", | |
| "ensemble id : ", ensemble.id, "\n", | |
| "pft name : ", names(trait.samples)[pft_idx], "\n", | |
| "quantile : ", quantile.str, "\n", | |
| "trait : ", trait, "\n", | |
| "run id : ", run.id, "\n", | |
| "model : ", model, "\n", | |
| "model id : ", settings_copy$model$id, "\n", | |
| "site : ", settings_copy$run$site$name, "\n", | |
| "site id : ", settings_copy$run$site$id, "\n", | |
| sa_input_info, | |
| "start date : ", settings_copy$run$start.date, "\n", | |
| "end date : ", settings_copy$run$end.date, "\n", | |
| "hostname : ", settings_copy$host$name, "\n", | |
| "rundir : ", file.path(settings_copy$host$rundir, run.id), "\n", | |
| "outdir : ", file.path(settings_copy$host$outdir, run.id), "\n", | |
| file = file.path(settings_copy$rundir, run.id, "README.txt"), |
| run.id)) | ||
| cat( | ||
| run.id, | ||
| file = file.path(settings$rundir, "runs.txt"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| file = file.path(settings_copy$rundir, "runs.txt"), |
Description
workaround: add
input_designparameter towrite.sa.configto coordinate multiple inputs (met, IC, soil, etc.) across SA runs.sample.Rdatain case of multisite setting;workaround : implemented logic to append each site's data to existing
samples.Rdatainstead of overwritingMotivation and Context
Review Time Estimate
Types of changes
Checklist: