Skip to content

Conversation

@divine7022
Copy link
Collaborator

@divine7022 divine7022 commented Nov 10, 2025

Description

  • enable SA + ensemble size > 1 through input coordination
    workaround: add input_design parameter to write.sa.config to coordinate multiple inputs (met, IC, soil, etc.) across SA runs.
  • enhanced README.txt files to track all inputs used per run
  • fix overwriting issue of sample.Rdata in case of multisite setting;
    workaround : implemented logic to append each site's data to existing samples.Rdata instead of overwriting

Motivation and Context

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Copy link
Member

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Big picture, I don't follow how appending to samples.Rdata will work for the processes that need to read from it downstream and I wonder how much of the information appended is already captured by the input design. I left some more specific comments inline, but it's worth being sure the overall design is right before worrying about most of them.

# This checks for met.id tag in the settings under sensitivity analysis -
# if it's not there it creates it. Then it's gonna use what it created.
if (is.null(settings$sensitivity.analysis$met.id)) {
settings$sensitivity.analysis$met.id <- 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to the overall goal of picking one met (/other input) path to use across the whole SA.

But calling this "met.id" is confusing to me -- looks like this is basically an index into the ensemble, whereas I would usually expect "met ID" to mean an identifier for which source (e.g. CRUNCEP ERA5, etc) the met data came from.

Could it instead be something like settings$sensitivity.analysis$met_path and contain the actual filepath to be used by the SA instead of an index that needs to be extracted from the ensemble? That'd be clearer to me, and would also allow a user to pass an arbitrary path to the SA even if it doesn't exist in the ensemble (not that I expect most people to want that, but it could come in handy sometimes)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...But wait! Doesn't input_design handle this for us just below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, it duplicates with input_design( new approach )

  • and also one of the reason is to be consistent with the ensemble pattern, so SA doing different (path) would be inconsistent
  • and works when input_design is NOT provided (single-site or legacy workflows)
  • The met.id resolution is indeed redundant when input_design is present. The met.id logic serves as a backward compatibility fallback for workflows that don't use input_design. However, when input_design exists it supersedes met.id and handles ALL inputs (met, IC, soil, etc.) .
  • Alternatively we could skip met.id resolution entirely when input_design exists

short-term fix (This PR):
keep current approach, add clarifying comment:

# select single met path for SA runs without input_design (backward compatibility)
# when input_design is provided (recommended), it supersedes this selection 
# and allows different met files per SA run 
if (is.list(settings$run$inputs$met$path)) { 
if (is.null(settings$sensitivity.analysis$met.id)) { 
settings$sensitivity.analysis$met.id <- 1 
} 
settings$run$inputs$met$path <- settings$run$inputs$met$path[[settings$sensitivity.analysis$met.id]] 
}

Long-term (Further iteration):
Document input_design as recommended approach , consider deprecating met.id in favor of input_design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re met path: You've talked me out of the idea, especially since it would need to be site-specific -- so my proposal wouldn't need just one settings$sensitivity.analysis$met_path, it'd be at least a settings$sensitivity.analysis$<site>$met_path for every site. I do still suggest naming it something like met.index instead of met.id.

Copy link
Member

@infotroph infotroph Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re input design: I'll defer to @dlebauer but my sense is that since we've bought into making input_design mandatory for other parts of the workflow and it's specifically designed to be where we keep all the pieces of the sampling design together in one place, it may be sensible to make it mandatory here too rather than duplicate its functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do still suggest naming it something like met.index instead of met.id

that's renaming to met.index would be nice clarity, but i am afraid of breaking backward compatibility for existing workflows may have <met.id> , and a direct rename would silently ignore those settings ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of having a consistent input_design object. It is hard to see an advantage of different approaches if one generalize solution is available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious - where is met.id used? I can't find the string in either the PEcAn repository or in xml settings files used by CCMMF or Dongchen's workflows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlebauer it's only used immediately below to set the met$path.

I agree with the consensus to drop the met.id solution proposed here and just use input_design. If one wants to only vary parameters and hold everything else fixed it's really easy to generate an input_design that's all 1's for everything other than parameters. The met.id "solution" isn't really a solution anyways as it only fixes met ensemble members and not the other ensembled inputs (phenology, soil physics, initial conditions, etc.)

# combine parameter samples for same PFT across sites
ensemble.samples[[pft_name]] <- rbind(
existing_env$ensemble.samples[[pft_name]],
ensemble.samples[[pft_name]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing how one figures out after the fact which samples were used for which site

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The site information is preserved through the run id's; Each run id contains the site id
(e.g, ENS-00001-1000004925 --- > site 1000004925), and this becomes the output folder name. Downstream functions don't parse the site id, they just iterate through all run id's and read from folders named with those id's.

# site 1 : 
runs.samples$ensemble <- data.frame( 
id = c("ENS-00001-1000004925", # site id
"ENS-00002-1000004925") 
) 
ensemble.samples$temperate <- data.frame( 
growth_resp = c(0.28, 0.31) # rows 1-2 
) 

# after site 2 merging: 
runs.samples$ensemble <- data.frame( 
id = c("ENS-00001-1000004925", # site 1, run 1 
"ENS-00002-1000004925", # site 1, run 2 
"ENS-00001-1000004928", # site 2, run 1 <-- Different site id
"ENS-00002-1000004928") # site 2, run 2 
) 
ensemble.samples$temperate <- data.frame( 
growth_resp = c(0.28, 0.31, 0.33, 0.30) # rows match run id's 
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I don't think ensemble.samples should grow at all -- The point of passing in an input design is to enforce that every run with a given ensemble member uses the same value of growth_resp (and every other parameter too), so ensemble.samples needs to stay n_ens lines long. I guess one could repeat the same set of values for each site, but that seems VERY inefficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@infotroph @mdietze @dlebauer want to make sure that i got right logic with the merging approach ?

Explained in comments

  PEcAn.logger::logger.info("###### Finished writing model run config files #####")
  PEcAn.logger::logger.info("config files samples in ", file.path(settings$outdir, "run"))
  
  # -----------------------------------------------------------------------
  # multisite merging logic
  # -----------------------------------------------------------------------
  # In multisite ensemble analysis, we maintain parameter
  # consistency across sites while allowing run tracking to grow. 
  # This ensures that the same parameter samples are applied across all sites,
  # enabling valid comparison of model performance under different environmental
  # conditions while tracking site-specific run execution.
  # -----------------------------------------------------------------------
  
  samples.file <- file.path(settings$outdir, "samples.Rdata")
  
  if (file.exists(samples.file)) {
    PEcAn.logger::logger.info("Merging with existing samples.Rdata to support multisite run...")
    
    # load existing data
    existing_env <- new.env()
    load(samples.file, envir = existing_env)
    
    # Ensemble parameters must remain identical across sites to answer:
    # How do the SAME parameter sets perform under DIFFERENT conditions ?
    
    if (!is.null(existing_env$ensemble.samples)) {
      # does the existing list have names ?
      # this validates we have a properly structured ensemble.samples object
      if (length(names(existing_env$ensemble.samples)) > 0) {
        # use existing parameter set to maintain ensemble consistency
        # all sites use identical parameter samples for ensemble analysis
        ensemble.samples <- existing_env$ensemble.samples
      } else {
        # existing parameter structure is invalid - use current site's parameters
        PEcAn.logger::logger.warn("Existing ensemble.samples structure is invalid. Using current site parameters as new baseline.")
      }
    }
    
    # maintain consistency in trait and sensitivity analysis samples
    if (!is.null(existing_env$trait.samples)) {
      if (length(names(existing_env$trait.samples)) > 0) {
        trait.samples <- existing_env$trait.samples
      }
    }
    if (!is.null(existing_env$sa.samples)) {
      if (length(names(existing_env$sa.samples)) > 0) {
        sa.samples <- existing_env$sa.samples
      }
    }
    
    # Run IDs append (this SHOULD grow)
    # ---------------------------------------------------------------------
    # Run IDs accumulate to track which parameter sets execute at which sites
    # eg. with 2 ens members across 2 sites:
    # - site 1: ENS-00001-1000004925, ENS-00002-1000004925
    # - site 2: ENS-00001-1000004928, ENS-00002-1000004928  
    # - final: 4 run IDs tracking same 2 parameter sets across 2 sites
    # ---------------------------------------------------------------------
    
    if (!is.null(existing_env$runs.samples)) {
      
      # ensemble runs tracking (data frame -> rbind)
      # rbind here because run IDs must accumulate across sites
      if (!is.null(runs.samples$ensemble)) {
        if (!is.null(existing_env$runs.samples$ensemble)) {
          # check we are not rbinding garbage. Check for 'id' column.
          if("id" %in% names(existing_env$runs.samples$ensemble)) {
            runs.samples$ensemble <- rbind(existing_env$runs.samples$ensemble, runs.samples$ensemble)
            # run IDs grow to track which sites use which ensemble members
          }
        }
      } else if (!is.null(existing_env$runs.samples$ensemble)) {
        runs.samples$ensemble <- existing_env$runs.samples$ensemble
      }
      
      # SA runs tracking (list structure -> modifyList)
      if (!is.null(runs.samples$sa)) {
        if (!is.null(existing_env$runs.samples$sa)) {
          runs.samples$sa <- utils::modifyList(existing_env$runs.samples$sa, runs.samples$sa)
        }
      } else if (!is.null(existing_env$runs.samples$sa)) {
        runs.samples$sa <- existing_env$runs.samples$sa
      }
    }
    
    # metadata (pft names, trait names) should merge to create 
    # coverage across all sites while avoiding duplication
    # ---------------------------------------------------------------------
    
    if (!is.null(existing_env$pft.names)) {
      pft.names <- unique(c(existing_env$pft.names, pft.names))
    }
    if (!is.null(existing_env$trait.names)) {
      if (length(names(existing_env$trait.names)) > 0) {
        trait.names <- utils::modifyList(existing_env$trait.names, trait.names)
      }
    }
  }
  
  save(ensemble.samples, trait.samples, sa.samples, runs.samples, pft.names, trait.names,
       file = samples.file
  )
  
  PEcAn.logger::logger.info("parameter values for runs in ", samples.file)
  options(scipen = scipen)
  invisible(settings)
  return(settings)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly complicated. get.param.samples should generate one matrix of parameters per PFT and currently there is not within-PFT variability in parameter samples across sites (though this may change in the future). For a particular site, the PFT and input_design should determine which parameter matrix you use, and which row in that matrix you select, respectively. There should be no need to rbind matrices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Mike here and am leaning toward an even stronger claim that run.write.configs should not be rewriting samples.Rdata at all. Its contents are all either generated once by get.parameter.samples before run.write.configs starts (ensemble.samples, sa.samples, trait.samples) or are not actually samples and can be taken from runs.txt (runs.samples)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not actually samples and can be taken from runs.txt (runs.samples)

does the downstream analysis expect runs.samples to be in samples.Rdata ?

Copy link
Collaborator Author

@divine7022 divine7022 Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into read.ensemble.output and read.sa.output to verify if we can stop writing samples.Rdata, but those functions explicitly load samples.Rdata to find the run ids :
ens.run.ids <- samples$runs.samples$ensemble
sa.run.ids <- samples$runs.samples$sa

they do not currently have logic to parse runs.txt

so here we can ---
1 ) reload the existing parameters (avoid duplication/growth) but I must append the run ids to runs.samples and save the file. (run.write.cofigs)
2 ) or refactor downstream to read runs.txt ( does this feel like to be a separate PR ? )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know what you mean by "stop writing samples.Rdata". No one is saying this isn't needed, we're saying it should already exist before run.write.configs is called (it is now generated by the new input design function, which should be called before the write configs functions) and it shouldn't need to be modified.

Only start model runs should rely on runs.txt. There's not enough information in there for anything else downstream

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know what you mean by "stop writing samples.Rdata".

I was responding specifically to chris's suggestion that runs.samples might be redundant cuz that info

"can be taken from runs.txt".

I was verifying if run.write.configs could stop appending even the new run ids(runs.samples) to samples.Rdata ( relaying on runs.txt instead) , but downstream (read.ensemble.output , .. ) depends on finding those ids in samples.Rdata.

But it's clear now

Only start model runs should rely on runs.txt.

# write run information to disk

# Apply input design coordination for SA runs
settings_copy <- settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing that below here some values are taken from settings and others from settings_copy. Is there actually a need for two separate objects or could these mutations be applied straight to settings?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SA loop without settings_copy:

for (trait in traits) {
  for (quantile in quantiles) {
    settings$run$inputs$met$path <- input_paths[[run_index]]  # mutates
    ...
    # write config
    write.config.SIPNET(settings, ...)
    # next iteration uses MUTATED settings; 
    # quantile 2 sees met path from quantile 1
  }
}

  • so by having settings_copy it modify copy only (settings_copy$run$inputs$met$path <- input_paths[[run_index]]) and write.config.* gets clean settings each iteration.
  • If we mutated settings directly, run 2's config would inherit run 1's modified input paths, causing incorrect configs.

The answer for question: why settings is used for some values ?
cat("model id : ", settings$model$id, "\n") # never changes across runs
cat("site id : ", settings$run$site$id, "\n") # same for all SA runs

BUT inputs must come from settings_copy:
do.call(my.write.config, settings = settings_copy) # has resolved input paths

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the path is updated earlier in the loop than the write.config call, so by the time write.config gets called for run 2 the path from run 1 has already been overwritten by the path from run 2. As long as you're sure that every iteration will update the same fields of the settings object, it doesn't matter whether it started from a "clean" object or not.

Similarly, if you are using settings_copy then it also contains all the fields from settings that aren't updated each iteration, so there's definitely no reason to interleave uses of both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it makes sense to use settings_copy instead of settings. It seems this only concerns the README block that pulls fields from the original settings. If we just switch those cat() calls to read from settings_copy, it’ll be obvious that each iteration uses one clean settings object end-to-end.

I've proposed this change below in https://github.com/PecanProject/pecan/pull/3660/files/f075e4d7d0f9ad85ee340bbe01f7dca464988677#r2561591261

@dlebauer dlebauer self-requested a review November 13, 2025 18:33
# combine parameter samples for same PFT across sites
ensemble.samples[[pft_name]] <- rbind(
existing_env$ensemble.samples[[pft_name]],
ensemble.samples[[pft_name]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I don't think ensemble.samples should grow at all -- The point of passing in an input design is to enforce that every run with a given ensemble member uses the same value of growth_resp (and every other parameter too), so ensemble.samples needs to stay n_ens lines long. I guess one could repeat the same set of values for each site, but that seems VERY inefficient.

Comment on lines +240 to +241
if (!is.null(existing_data$trait.samples)) {
trait.samples <- utils::modifyList(existing_data$trait.samples, trait.samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trait.samples ought to be generated exactly once at the beginning of setup (to define the set of posteriors that ensemble.samples is selected from) and not differ from site to site. Since it may contain MCMC results with potentially as many as millions of samples per trait, I really don't think we want to be duplicating it here.

# write run information to disk

# Apply input design coordination for SA runs
settings_copy <- settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the path is updated earlier in the loop than the write.config call, so by the time write.config gets called for run 2 the path from run 1 has already been overwritten by the path from run 2. As long as you're sure that every iteration will update the same fields of the settings object, it doesn't matter whether it started from a "clean" object or not.

Similarly, if you are using settings_copy then it also contains all the fields from settings that aren't updated each iteration, so there's definitely no reason to interleave uses of both.

Comment on lines 423 to 439
"workflow id : ", workflow.id, "\n",
"ensemble id : ", ensemble.id, "\n",
"pft name : ", names(trait.samples)[i], "\n",
"pft name : ", names(trait.samples)[pft_idx], "\n",
"quantile : ", quantile.str, "\n",
"trait : ", trait, "\n",
"run id : ", run.id, "\n",
"model : ", model, "\n",
"model id : ", settings$model$id, "\n",
"site : ", settings$run$site$name, "\n",
"site id : ", settings$run$site$id, "\n",
"met data : ", settings$run$site$met, "\n",
sa_input_info,
"start date : ", settings$run$start.date, "\n",
"end date : ", settings$run$end.date, "\n",
"hostname : ", settings$host$name, "\n",
"rundir : ", file.path(settings$host$rundir, run.id), "\n",
"outdir : ", file.path(settings$host$outdir, run.id), "\n",
file = file.path(settings$rundir, run.id, "README.txt"),
Copy link
Member

@dlebauer dlebauer Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should address the suggestion in #3660 (comment) to avoid mixing use of settings and settings_copy once the copy is made.

Suggested change
"workflow id : ", workflow.id, "\n",
"ensemble id : ", ensemble.id, "\n",
"pft name : ", names(trait.samples)[i], "\n",
"pft name : ", names(trait.samples)[pft_idx], "\n",
"quantile : ", quantile.str, "\n",
"trait : ", trait, "\n",
"run id : ", run.id, "\n",
"model : ", model, "\n",
"model id : ", settings$model$id, "\n",
"site : ", settings$run$site$name, "\n",
"site id : ", settings$run$site$id, "\n",
"met data : ", settings$run$site$met, "\n",
sa_input_info,
"start date : ", settings$run$start.date, "\n",
"end date : ", settings$run$end.date, "\n",
"hostname : ", settings$host$name, "\n",
"rundir : ", file.path(settings$host$rundir, run.id), "\n",
"outdir : ", file.path(settings$host$outdir, run.id), "\n",
file = file.path(settings$rundir, run.id, "README.txt"),
"workflow id : ", workflow.id, "\n",
"ensemble id : ", ensemble.id, "\n",
"pft name : ", names(trait.samples)[pft_idx], "\n",
"quantile : ", quantile.str, "\n",
"trait : ", trait, "\n",
"run id : ", run.id, "\n",
"model : ", model, "\n",
"model id : ", settings_copy$model$id, "\n",
"site : ", settings_copy$run$site$name, "\n",
"site id : ", settings_copy$run$site$id, "\n",
sa_input_info,
"start date : ", settings_copy$run$start.date, "\n",
"end date : ", settings_copy$run$end.date, "\n",
"hostname : ", settings_copy$host$name, "\n",
"rundir : ", file.path(settings_copy$host$rundir, run.id), "\n",
"outdir : ", file.path(settings_copy$host$outdir, run.id), "\n",
file = file.path(settings_copy$rundir, run.id, "README.txt"),

run.id))
cat(
run.id,
file = file.path(settings$rundir, "runs.txt"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file = file.path(settings_copy$rundir, "runs.txt"),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants