Automate testing of specific cases #698

forsyth2 · 2025-03-27T15:22:33Z

Summary

Refactor image check testing to be more robust.

Objectives:

Automate the process of launching the zppy jobs not just for the comprehensive/weekly tests, but all the min-case tests as well. Initial implementation ideas are in Automate zppy tests #520.
Have tests collect data on the list of missing and mismatched images, such that they can be automatically summarized in a Markdown table. Initial implementation ideas are in Automate organization of test results #695
Print list of missing and mismatched images to files in the image_check_failures directories, so that they are easily accessible.
Automatically summarize mismatched image diffs in a grid, as initially demonstrated in Script to generate image diff grid #685.

Issue resolution:

Closes Further automate tests #516

Select one: This pull request is...

a bug fix: increment the patch version
a small improvement: increment the minor version
a new feature: increment the minor version
an incompatible (non-backwards compatible) API change: increment the major version

Big Change

To merge, I will use "Create a merge commit". That is, this change is large enough to require multiple units of work (i.e., it should be multiple commits).

1. Does this do what we want it to do?

Required:

Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
Testing: I have added or modified at least one "min-case" configuration file to test this change. Every objective above is represented in at least one cfg.
Testing: I have considered likely and/or severe edge cases and have included them in testing.

If applicable:

Testing: this pull request introduces an important feature or bug fix that we must test often. I have updated the weekly-test configuration files, not just a "min-case" one.
Testing: this pull request adds at least one new possible parameter to the cfg. I have tested using this parameter with and without any other parameter that may interact with it.

2. Are the implementation details accurate & efficient?

Required:

Logic: I have visually inspected the entire pull request myself.
Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

If applicable:

Dependencies: This pull request introduces a new dependency. I have discussed this requirement with at least one other team member. The dependency is noted in zppy/conda, not just an import statement.

3. Is this well documented?

Required:

Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.

4. Is this code clean?

Required:

Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
Pre-commit checks: All the pre-commits checks have passed.

If applicable:

Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

forsyth2 · 2025-03-27T15:40:06Z

Why is this important?

While the weekly/comprehensive tests cover many cases, there's really no way to truly comprehensively test zppy's myriad of parameter options and combinations with just 1 or 2 cfg files. We need to test a bunch.

In #697, we saw that while errors in the weekly tests had been resolved, some use cases of parameters had gone untested. In particular, those noted in #516 (comment).

In addition to resolving #516, these changes would address the "Challenges to easily displaying test output" section of #634 (comment).

Current TODO list:

Set up automated bash script to build conda environments for e3sm_diags and zppy_interfaces, and launch zppy jobs for ALL min-case tests. Potential concerns:
- I was running into issues getting the conda environments to work in Automate zppy tests #520, but I think I was just missing a source line.
- Some min-case tests have a "part 2" that needs to be run after "part 1" runs.
- Is there any way to know all the IDs of the jobs that are launched and then wait for them all to complete? If not, there's always going to be the manual step of checking sq output repeatedly until determining the jobs have completed and the tests themselves can be run.
- Just running zppy for the weekly/comprehensive tests launches many jobs. Will launching an even larger number of jobs (to cover all the min-case cfg files) overwhelm a job quota? Is it realistic to run this many jobs on one machine, let alone 3 (chrysalis, perlmutter, compy), for every pull request, or even every week?
Set up expected results for the min-case cfg files too. Potential concerns:
- How much output can I keep on /lcrc/group/e3sm/public_html/zppy_test_resources/ versus /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_resources/?
Automate the creation of an image diff grid, as in Script to generate image diff grid #685.
Update all min-case cfg files to use common prefixes/suffixes for output and www. In particular, grouping paths at the highest level by the unique_id, so tests run on the same code are grouped together in the file structure.
Add a check_images line in the test for every min-case cfg
Once everything is working for the min-case tests, port the weekly/comprehensive tests over to using the new structure as well.

forsyth2 · 2025-03-31T22:04:55Z

launches many jobs.

A path forward here is too make a list of specific cases that need to be tested, and then as much as possible, combine them into a minimal set of cfg files. E.g., if we need feature X climo and and feature Y in ts, to be tested, then those could easily be done in one cfg. However, if we need to test feature Z 3 different ways in global_time_series, then we'll necessarily need 3 cfg files -- notably be cause that task, like some others, don't have subtasks.

That is, we don't need to automate every "min-case" cfg, we just need to create a minimal testing set of cfg files.

forsyth2 · 2025-04-01T17:36:22Z

Update existing test_weekly.py cfg files run on fewer years and then update expected results accordingly. (That way we know any diffs are solely because of the reduction in years).
Make a version of test_weekly.py that tests the existing test_weekly.py cfg files, but all in one test function that keeps track of the missing/mismatched images, etc. and then 1) prints image counts to a Markdown table and 2) prints lists of files to files that can be looked at later (as opposed to just printing to the command line) [EDIT: Improve image checker #699, Update test names to reflect new image checker #700, Further image checker updates #701]
Add more cfg files -- enough to cover important test cases, but combining such cases as much as possible. We'll end up testing more than the current number of cfgs, but fewer than every "min-case" cfg

forsyth2 · 2025-04-01T23:23:46Z

Add performance benchmarking testing, by checking for "Elapsed Time" in .o files

forsyth2 · 2025-04-04T16:54:11Z

A path forward here is too make a list of specific cases that need to be tested, and then as much as possible, combine them into a minimal set of cfg files

Make list of important cases tested in all the min-case cfg files + the weekly cfg files
See which cases can be combined into the same cfg files
Use these new cfg files for test_images.py, updating expected results as necessary.

forsyth2 · 2025-04-04T18:51:21Z

Make list of important cases tested in all the min-case cfg files + the weekly cfg files

Current test analysis (2025-04-04)

The "min-case" cfgs are useful because they allow developers to run as few jobs as possible to test a change. That said, these min cases can be combined for automated testing on main, where we want to test all facets of the code.

Understanding what cases (e.g., parameter combinations) are currently covered

"min_case" cfgs

Run on v3 data, unless otherwise noted

`cfg`	This tests...	It runs these tasks:	Notes
add_dependencies	all tasks/diags sets using the `add_dependencies` function	ts, e3sm_to_cmip, e3sm_diags, global_time_series, ilamb
carryover_dependencies	that dependencies aren't carried over from one instantiation to the next, except for mpas_analysis, where that is the desired behavior	climo, ts, e3sm_to_cmip, tc_analysis, e3sm_diags, mpas_analysis, global_time_series, ilamb	Requires manually reviewing `.o` file
deprecated_parameters	that deprecated parameters have no effect	ts, tc_analysis, global_time_series
e3sm_diags_depend_on_climo_mvm_{1,2}	e3sm_diags sets that depend on the climo atm monthly subtask, mvm	climo, e3sm_diags
e3sm_diags_depend_on_climo	e3sm_diags sets that depend on the climo atm monthly subtask	climo, e3sm_diags
e3sm_diags_depend_on_ts_mvm_{1,2}	e3sm_diags sets that depend on the ts atm monthly subtask, mvm	ts, e3sm_diags
e3sm_diags_depend_on_ts	e3sm_diags sets that depend on the ts atm monthly subtask	ts, e3sm_diags
e3sm_diags_depend_on_diurnal_cycle_mvm_{1,2}	e3sm_diags sets that depend on the climo diurnal subtask, mvm	climo, e3sm_diags
e3sm_diags_depend_on_diurnal_cycle	e3sm_diags sets that depend on the climo diurnal subtask	climo, e3sm_diags
e3sm_diags_depend_on_lat_lon_land_mvm_{1,2}	e3sm_diags sets that depend on the climo land monthly subtask, mvm	climo, e3sm_diags	There is no model-vs-obs test, because that's not a common use case
e3sm_diags_depend_on_streamflow_mvm_{1,2}	e3sm_diags sets that depend on the ts rof monthly subtask, mvm	ts, e3sm_diags
e3sm_diags_depend_on_streamflow	e3sm_diags sets that depend on the ts rof monthly subtask	ts, e3sm_diags
e3sm_diags_depend_on_tc_analysis_mvm_{1,2}	e3sm_diags sets that depend on the tc_analysis task, mvm	tc_analysis, e3sm_diags
e3sm_diags_depend_on_tc_analysis_parallel	e3sm_diags sets that depend on the tc_analysis task, running in parallel	tc_analysis, e3sm_diags	This is a special test case because of #180 & #631
e3sm_diags_depend_on_tc_analysis_v2_mvm_{1,2}	e3sm_diags sets that depend on the tc_analysis task, mvm, v2	tc_analysis, e3sm_diags	Important because of `tc_analysis` issues on v3 data
e3sm_diags_depend_on_tc_analysis_parallel_v2	e3sm_diags sets that depend on the tc_analysis task, running in parallel, v2	tc_analysis, e3sm_diags
e3sm_diags_depend_on_tc_analysis_v2	e3sm_diags sets that depend on the tc_analysis task, v2	tc_analysis, e3sm_diags
e3sm_diags_depend_on_tc_analysis	e3sm_diags sets that depend on the tc_analysis task	tc_analysis, e3sm_diags
e3sm_diags_depend_on_tropical_subseasonal_mvm_{1,2}	e3sm_diags sets that depend on the ts atm daily subtask, mvm	ts, e3sm_diags
e3sm_diags_depend_on_tropical_subseasonal	e3sm_diags sets that depend on the ts atm daily subtask	ts, e3sm_diags
global_time_series_comprehensive_v3_setup_only	N/A	ts, mpas_analysis	This is for creating the input to `zppy-interfaces`'s integration tests.
global_time_series_custom	specifing atm/lnd vars in ts task, running explicitly set plots_atm/lnd in global_time_series task	ts, global_time_series
global_time_series_original_8_missing_ocn	that global_time_series is skipped if requesting all original plots but mpas_analysis is not run	ts, global_time_series
global_time_series_original_8_no_ocn	default atm vars in ts task, the 5 atm plots of plots_original in global_time_series	ts, global_time_series
global_time_series_original_8	default atm vars in ts task, default all 8 of plots_original in global_time_series	ts, mpas_analysis, global_time_series
global_time_series_viewers_all_land_variables	specifing atm/lnd vars in ts task, running explicitly set plots_atm + `plots_lnd="all"` in global_time_series task, viewers on	ts, global_time_series
global_time_series_viewers_original_8	default atm vars in ts task, default all 8 of plots_original in global_time_series task, viewers on	ts, mpas_analysis, global_time_series
global_time_series_viewers_original_atm_plus_land	default atm vars + `vars=""` for lnd in ts task, explcitly set 5 atm plots of plots_original + 1 atm plot + 4 lnd plots in global_time_series task, viewers on	ts, global_time_series	Important for performance testing: E3SM-Project/zppy-interfaces#22
global_time_series_viewers_undefined_variables	default atm vars + `vars=""` for lnd in ts task, running explicitly set plots_atm + `plots_lnd="all"` in global_time_series task, viewers on	ts, global_time_series	Important for performance testing: E3SM-Project/zppy-interfaces#19
global_time_series_viewers	specifing atm/lnd vars in ts task, running explicitly set plots_atm/lnd in global_time_series task, viewers on	ts, global_time_series	Important for performance testing: E3SM-Project/zppy-interfaces#19
ilamb_diff_years	that a 30-year ilamb run can use output from 2 15-year ts/e3sm_to_cmip tasks	ts, e3sm_to_cmip, ilamb
ilamb_land_only	ilamb's `land_only` parameter	ts, e3sm_to_cmip, ilamb
ilamb	ilamb	ts, e3sm_to_cmip, ilamb
mpas_analysis	mpas_analysis	mpas_analysis
nco	tasks that use NCO	climo, ts
tc_analysis_only	tc_analysis	tc_analysis
tc_analysis_res	tc_analysis with explictly set `--res`	tc_analysis
tc_analysis_simultaneous_{1,2}	tc_analysis works when run simultaneously with another tc_analysis run	tc_analysis	This is a special test case because of #180 & #631
tc_analysis_simultaneous_v2_{1,2}	tc_analysis works when run simultaneously with another tc_analysis run, v2	tc_analysis	This is a special test case because of #180 & #631

"weekly" cfgs

Both comprehensive_v2 and comprehensive_v3:

climo > atm monthly, atm diurnal cycle, lnd monthly
ts > atm_monthly, atm_daily, rof_monthly, atm monthly glb, lnd monthly glb, land monthly
e3sm_to_cmip > atm_monthly, lnd_monthly
tc_analysis
e3sm_diags > atm, atm mvm, lnd mvm
mpas_analysis
global_time_series
ilamb

Differences:

comprehensive_v3 has parameter inference turned off, meaning it's a check on which parameters must be specified somehow, whereas comprehensive_v2 is a check that parameter inference is working.

bundles runs a reduced task list:

bundle 1
- climo > atm monthly, atm diurnal cycle
- ts > atm monthly, atm monthly glb, land monthly
- e3sm_to_cmip > atm monthly, lnd monthly
- e3sm_diags > atm
bundle 2
- global_time_series
bundle 3
- ts > rof monthly
- tc_analysis
- e3sm_diags > atm mvm
No bundle
- ilamb

Which "min_case" cases are already covered in the weekly cfgs?

add_dependencies
carryover_dependencies
global_time_series_comprehensive_v3_setup_only
global_time_series_custom
ilamb_diff_years # comprehensive_v3 only; that's why https://github.com/E3SM-Project/zppy/pull/705 was caught only on the v3 test
ilamb
mpas_analysis
nco
tc_analysis_only

The following cases are tested by the weekly cfgs, but there is an important caveat: in some sense, the "min_case" cfgs are testing that the sets also do not depend on any other subtask. That is, if we're running only sets that depend on climo, zppy shouldn't care if we don't have the ts task present in the cfg. That is obviously not tested in a comprehensive cfg running as much as possible.

e3sm_diags_depend_on_climo_mvm_{1,2}
e3sm_diags_depend_on_climo
e3sm_diags_depend_on_ts_mvm_{1,2}
e3sm_diags_depend_on_ts
e3sm_diags_depend_on_diurnal_cycle_mvm_{1,2}
e3sm_diags_depend_on_diurnal_cycle
e3sm_diags_depend_on_lat_lon_land_mvm_{1,2}
e3sm_diags_depend_on_streamflow_mvm_{1,2}
e3sm_diags_depend_on_streamflow
e3sm_diags_depend_on_tc_analysis_mvm_{1,2}
e3sm_diags_depend_on_tc_analysis_v2_mvm_{1,2}
e3sm_diags_depend_on_tc_analysis_v2
e3sm_diags_depend_on_tc_analysis
e3sm_diags_depend_on_tropical_subseasonal_mvm_{1,2}
e3sm_diags_depend_on_tropical_subseasonal

That leaves these cases as untested in weekly testing

deprecated_parameters
e3sm_diags_depend_on_tc_analysis_parallel
e3sm_diags_depend_on_tc_analysis_parallel_v2 
global_time_series_original_8_missing_ocn
global_time_series_original_8_no_ocn 
global_time_series_original_8
global_time_series_viewers_all_land_variables
global_time_series_viewers_original_8
global_time_series_viewers_original_atm_plus_land
global_time_series_viewers_undefined_variables
global_time_series_viewers
ilamb_land_only
tc_analysis_res
tc_analysis_simultaneous_{1,2}
tc_analysis_simultaneous_v2_{1,2}

Given the number of global_time_series parameter combinations to test, it may be wise to permit subtasks of the global_time_series task; then we could do multiple runs with one cfg.

What are other improvements we can make to the tests?

Test on as few years as possible
Since it takes a long time to get nodes on Perlmutter, perhaps a hybrid testing scheme could be used: run comprehensive tests on Chrysalis and the bundles tests on Perlmutter (since bundles were designed with Perlmutter's job scheduling in mind)

forsyth2 · 2025-04-04T19:05:57Z

It may be best to test the global_time_series test cases on the zppy-interfaces side. That would require a couple things:

Add the image checker functionality to zppy-interfaces
Have multiple up-to-date input directories: 1 with 353 land variables, 1 with only a few land variables, 1 with data from mpas_analyis/atm/lnd

forsyth2 · 2025-04-08T00:36:59Z

Automate the process of launching the zppy jobs not just for the comprehensive/weekly tests, but all the min-case tests as well. Initial implementation ideas are in #520.

#709

Update existing test_weekly.py cfg files run on fewer years and then update expected results accordingly. (That way we know any diffs are solely because of the reduction in years).

I don't think this will actually help that much. e3sm_diags is the longest running job (1 hour or more) and its subtasks are only operating on 2-4 years.

Automate testing of specific cases

86dc88b

forsyth2 added the Testing Files in `tests` modified label Mar 27, 2025

forsyth2 self-assigned this Mar 27, 2025

forsyth2 mentioned this pull request Mar 28, 2025

Improve Land variable handling #697

Merged

4 tasks

forsyth2 mentioned this pull request Apr 2, 2025

Improve image checker #699

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate testing of specific cases #698

Automate testing of specific cases #698

forsyth2 commented Mar 27, 2025

forsyth2 commented Mar 27, 2025

forsyth2 commented Mar 31, 2025

forsyth2 commented Apr 1, 2025 •

edited

Loading

forsyth2 commented Apr 1, 2025

forsyth2 commented Apr 4, 2025

forsyth2 commented Apr 4, 2025

forsyth2 commented Apr 4, 2025

forsyth2 commented Apr 8, 2025

Automate testing of specific cases #698

Are you sure you want to change the base?

Automate testing of specific cases #698

Conversation

forsyth2 commented Mar 27, 2025

Summary

Big Change

1. Does this do what we want it to do?

2. Are the implementation details accurate & efficient?

3. Is this well documented?

4. Is this code clean?

forsyth2 commented Mar 27, 2025

Why is this important?

Current TODO list:

forsyth2 commented Mar 31, 2025

forsyth2 commented Apr 1, 2025 • edited Loading

forsyth2 commented Apr 1, 2025

forsyth2 commented Apr 4, 2025

forsyth2 commented Apr 4, 2025

Current test analysis (2025-04-04)

Understanding what cases (e.g., parameter combinations) are currently covered

"min_case" cfgs

"weekly" cfgs

Which "min_case" cases are already covered in the weekly cfgs?

That leaves these cases as untested in weekly testing

What are other improvements we can make to the tests?

forsyth2 commented Apr 4, 2025

forsyth2 commented Apr 8, 2025

forsyth2 commented Apr 1, 2025 •

edited

Loading