Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate testing of specific cases #698

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

forsyth2
Copy link
Collaborator

Summary

Refactor image check testing to be more robust.

Objectives:

  • Automate the process of launching the zppy jobs not just for the comprehensive/weekly tests, but all the min-case tests as well. Initial implementation ideas are in Automate zppy tests #520.
  • Have tests collect data on the list of missing and mismatched images, such that they can be automatically summarized in a Markdown table. Initial implementation ideas are in Automate organization of test results #695
  • Print list of missing and mismatched images to files in the image_check_failures directories, so that they are easily accessible.
  • Automatically summarize mismatched image diffs in a grid, as initially demonstrated in Script to generate image diff grid #685.

Issue resolution:

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Big Change

  • To merge, I will use "Create a merge commit". That is, this change is large enough to require multiple units of work (i.e., it should be multiple commits).

1. Does this do what we want it to do?

Required:

  • Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
  • Testing: I have added or modified at least one "min-case" configuration file to test this change. Every objective above is represented in at least one cfg.
  • Testing: I have considered likely and/or severe edge cases and have included them in testing.

If applicable:

  • Testing: this pull request introduces an important feature or bug fix that we must test often. I have updated the weekly-test configuration files, not just a "min-case" one.
  • Testing: this pull request adds at least one new possible parameter to the cfg. I have tested using this parameter with and without any other parameter that may interact with it.

2. Are the implementation details accurate & efficient?

Required:

  • Logic: I have visually inspected the entire pull request myself.
  • Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

If applicable:

  • Dependencies: This pull request introduces a new dependency. I have discussed this requirement with at least one other team member. The dependency is noted in zppy/conda, not just an import statement.

3. Is this well documented?

Required:

  • Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.

4. Is this code clean?

Required:

  • Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
  • Pre-commit checks: All the pre-commits checks have passed.

If applicable:

  • Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

@forsyth2 forsyth2 added the Testing Files in `tests` modified label Mar 27, 2025
@forsyth2 forsyth2 self-assigned this Mar 27, 2025
@forsyth2
Copy link
Collaborator Author

Why is this important?

While the weekly/comprehensive tests cover many cases, there's really no way to truly comprehensively test zppy's myriad of parameter options and combinations with just 1 or 2 cfg files. We need to test a bunch.

In #697, we saw that while errors in the weekly tests had been resolved, some use cases of parameters had gone untested. In particular, those noted in #516 (comment).

In addition to resolving #516, these changes would address the "Challenges to easily displaying test output" section of #634 (comment).

Current TODO list:

  • Set up automated bash script to build conda environments for e3sm_diags and zppy_interfaces, and launch zppy jobs for ALL min-case tests. Potential concerns:
    • I was running into issues getting the conda environments to work in Automate zppy tests #520, but I think I was just missing a source line.
    • Some min-case tests have a "part 2" that needs to be run after "part 1" runs.
    • Is there any way to know all the IDs of the jobs that are launched and then wait for them all to complete? If not, there's always going to be the manual step of checking sq output repeatedly until determining the jobs have completed and the tests themselves can be run.
    • Just running zppy for the weekly/comprehensive tests launches many jobs. Will launching an even larger number of jobs (to cover all the min-case cfg files) overwhelm a job quota? Is it realistic to run this many jobs on one machine, let alone 3 (chrysalis, perlmutter, compy), for every pull request, or even every week?
  • Set up expected results for the min-case cfg files too. Potential concerns:
    • How much output can I keep on /lcrc/group/e3sm/public_html/zppy_test_resources/ versus /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_resources/?
  • Automate the creation of an image diff grid, as in Script to generate image diff grid #685.
  • Update all min-case cfg files to use common prefixes/suffixes for output and www. In particular, grouping paths at the highest level by the unique_id, so tests run on the same code are grouped together in the file structure.
  • Add a check_images line in the test for every min-case cfg
  • Once everything is working for the min-case tests, port the weekly/comprehensive tests over to using the new structure as well.

@forsyth2 forsyth2 mentioned this pull request Mar 28, 2025
4 tasks
@forsyth2
Copy link
Collaborator Author

launches many jobs.

A path forward here is too make a list of specific cases that need to be tested, and then as much as possible, combine them into a minimal set of cfg files. E.g., if we need feature X climo and and feature Y in ts, to be tested, then those could easily be done in one cfg. However, if we need to test feature Z 3 different ways in global_time_series, then we'll necessarily need 3 cfg files -- notably be cause that task, like some others, don't have subtasks.

That is, we don't need to automate every "min-case" cfg, we just need to create a minimal testing set of cfg files.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 1, 2025

  • Update existing test_weekly.py cfg files run on fewer years and then update expected results accordingly. (That way we know any diffs are solely because of the reduction in years).
  • Make a version of test_weekly.py that tests the existing test_weekly.py cfg files, but all in one test function that keeps track of the missing/mismatched images, etc. and then 1) prints image counts to a Markdown table and 2) prints lists of files to files that can be looked at later (as opposed to just printing to the command line) [EDIT: Improve image checker #699, Update test names to reflect new image checker #700, Further image checker updates #701]
  • Add more cfg files -- enough to cover important test cases, but combining such cases as much as possible. We'll end up testing more than the current number of cfgs, but fewer than every "min-case" cfg

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 1, 2025

  • Add performance benchmarking testing, by checking for "Elapsed Time" in .o files

@forsyth2 forsyth2 mentioned this pull request Apr 2, 2025
7 tasks
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 4, 2025

A path forward here is too make a list of specific cases that need to be tested, and then as much as possible, combine them into a minimal set of cfg files

  • Make list of important cases tested in all the min-case cfg files + the weekly cfg files
  • See which cases can be combined into the same cfg files
  • Use these new cfg files for test_images.py, updating expected results as necessary.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 4, 2025

Make list of important cases tested in all the min-case cfg files + the weekly cfg files

Current test analysis (2025-04-04)

The "min-case" cfgs are useful because they allow developers to run as few jobs as possible to test a change. That said, these min cases can be combined for automated testing on main, where we want to test all facets of the code.

Understanding what cases (e.g., parameter combinations) are currently covered

"min_case" cfgs

Run on v3 data, unless otherwise noted

cfg This tests... It runs these tasks: Notes
add_dependencies all tasks/diags sets using the add_dependencies function ts, e3sm_to_cmip, e3sm_diags, global_time_series, ilamb
carryover_dependencies that dependencies aren't carried over from one instantiation to the next, except for mpas_analysis, where that is the desired behavior climo, ts, e3sm_to_cmip, tc_analysis, e3sm_diags, mpas_analysis, global_time_series, ilamb Requires manually reviewing .o file
deprecated_parameters that deprecated parameters have no effect ts, tc_analysis, global_time_series
e3sm_diags_depend_on_climo_mvm_{1,2} e3sm_diags sets that depend on the climo atm monthly subtask, mvm climo, e3sm_diags
e3sm_diags_depend_on_climo e3sm_diags sets that depend on the climo atm monthly subtask climo, e3sm_diags
e3sm_diags_depend_on_ts_mvm_{1,2} e3sm_diags sets that depend on the ts atm monthly subtask, mvm ts, e3sm_diags
e3sm_diags_depend_on_ts e3sm_diags sets that depend on the ts atm monthly subtask ts, e3sm_diags
e3sm_diags_depend_on_diurnal_cycle_mvm_{1,2} e3sm_diags sets that depend on the climo diurnal subtask, mvm climo, e3sm_diags
e3sm_diags_depend_on_diurnal_cycle e3sm_diags sets that depend on the climo diurnal subtask climo, e3sm_diags
e3sm_diags_depend_on_lat_lon_land_mvm_{1,2} e3sm_diags sets that depend on the climo land monthly subtask, mvm climo, e3sm_diags There is no model-vs-obs test, because that's not a common use case
e3sm_diags_depend_on_streamflow_mvm_{1,2} e3sm_diags sets that depend on the ts rof monthly subtask, mvm ts, e3sm_diags
e3sm_diags_depend_on_streamflow e3sm_diags sets that depend on the ts rof monthly subtask ts, e3sm_diags
e3sm_diags_depend_on_tc_analysis_mvm_{1,2} e3sm_diags sets that depend on the tc_analysis task, mvm tc_analysis, e3sm_diags
e3sm_diags_depend_on_tc_analysis_parallel e3sm_diags sets that depend on the tc_analysis task, running in parallel tc_analysis, e3sm_diags This is a special test case because of #180 & #631
e3sm_diags_depend_on_tc_analysis_v2_mvm_{1,2} e3sm_diags sets that depend on the tc_analysis task, mvm, v2 tc_analysis, e3sm_diags Important because of tc_analysis issues on v3 data
e3sm_diags_depend_on_tc_analysis_parallel_v2 e3sm_diags sets that depend on the tc_analysis task, running in parallel, v2 tc_analysis, e3sm_diags
e3sm_diags_depend_on_tc_analysis_v2 e3sm_diags sets that depend on the tc_analysis task, v2 tc_analysis, e3sm_diags
e3sm_diags_depend_on_tc_analysis e3sm_diags sets that depend on the tc_analysis task tc_analysis, e3sm_diags
e3sm_diags_depend_on_tropical_subseasonal_mvm_{1,2} e3sm_diags sets that depend on the ts atm daily subtask, mvm ts, e3sm_diags
e3sm_diags_depend_on_tropical_subseasonal e3sm_diags sets that depend on the ts atm daily subtask ts, e3sm_diags
global_time_series_comprehensive_v3_setup_only N/A ts, mpas_analysis This is for creating the input to zppy-interfaces's integration tests.
global_time_series_custom specifing atm/lnd vars in ts task, running explicitly set plots_atm/lnd in global_time_series task ts, global_time_series
global_time_series_original_8_missing_ocn that global_time_series is skipped if requesting all original plots but mpas_analysis is not run ts, global_time_series
global_time_series_original_8_no_ocn default atm vars in ts task, the 5 atm plots of plots_original in global_time_series ts, global_time_series
global_time_series_original_8 default atm vars in ts task, default all 8 of plots_original in global_time_series ts, mpas_analysis, global_time_series
global_time_series_viewers_all_land_variables specifing atm/lnd vars in ts task, running explicitly set plots_atm + plots_lnd="all" in global_time_series task, viewers on ts, global_time_series
global_time_series_viewers_original_8 default atm vars in ts task, default all 8 of plots_original in global_time_series task, viewers on ts, mpas_analysis, global_time_series
global_time_series_viewers_original_atm_plus_land default atm vars + vars="" for lnd in ts task, explcitly set 5 atm plots of plots_original + 1 atm plot + 4 lnd plots in global_time_series task, viewers on ts, global_time_series Important for performance testing: E3SM-Project/zppy-interfaces#22
global_time_series_viewers_undefined_variables default atm vars + vars="" for lnd in ts task, running explicitly set plots_atm + plots_lnd="all" in global_time_series task, viewers on ts, global_time_series Important for performance testing: E3SM-Project/zppy-interfaces#19
global_time_series_viewers specifing atm/lnd vars in ts task, running explicitly set plots_atm/lnd in global_time_series task, viewers on ts, global_time_series Important for performance testing: E3SM-Project/zppy-interfaces#19
ilamb_diff_years that a 30-year ilamb run can use output from 2 15-year ts/e3sm_to_cmip tasks ts, e3sm_to_cmip, ilamb
ilamb_land_only ilamb's land_only parameter ts, e3sm_to_cmip, ilamb
ilamb ilamb ts, e3sm_to_cmip, ilamb
mpas_analysis mpas_analysis mpas_analysis
nco tasks that use NCO climo, ts
tc_analysis_only tc_analysis tc_analysis
tc_analysis_res tc_analysis with explictly set --res tc_analysis
tc_analysis_simultaneous_{1,2} tc_analysis works when run simultaneously with another tc_analysis run tc_analysis This is a special test case because of #180 & #631
tc_analysis_simultaneous_v2_{1,2} tc_analysis works when run simultaneously with another tc_analysis run, v2 tc_analysis This is a special test case because of #180 & #631

"weekly" cfgs

Both comprehensive_v2 and comprehensive_v3:

  • climo > atm monthly, atm diurnal cycle, lnd monthly
  • ts > atm_monthly, atm_daily, rof_monthly, atm monthly glb, lnd monthly glb, land monthly
  • e3sm_to_cmip > atm_monthly, lnd_monthly
  • tc_analysis
  • e3sm_diags > atm, atm mvm, lnd mvm
  • mpas_analysis
  • global_time_series
  • ilamb

Differences:

  • comprehensive_v3 has parameter inference turned off, meaning it's a check on which parameters must be specified somehow, whereas comprehensive_v2 is a check that parameter inference is working.

bundles runs a reduced task list:

  • bundle 1
    • climo > atm monthly, atm diurnal cycle
    • ts > atm monthly, atm monthly glb, land monthly
    • e3sm_to_cmip > atm monthly, lnd monthly
    • e3sm_diags > atm
  • bundle 2
    • global_time_series
  • bundle 3
    • ts > rof monthly
    • tc_analysis
    • e3sm_diags > atm mvm
  • No bundle
    • ilamb

Which "min_case" cases are already covered in the weekly cfgs?

add_dependencies
carryover_dependencies
global_time_series_comprehensive_v3_setup_only
global_time_series_custom
ilamb_diff_years # comprehensive_v3 only; that's why https://github.com/E3SM-Project/zppy/pull/705 was caught only on the v3 test
ilamb
mpas_analysis
nco
tc_analysis_only

The following cases are tested by the weekly cfgs, but there is an important caveat: in some sense, the "min_case" cfgs are testing that the sets also do not depend on any other subtask. That is, if we're running only sets that depend on climo, zppy shouldn't care if we don't have the ts task present in the cfg. That is obviously not tested in a comprehensive cfg running as much as possible.

e3sm_diags_depend_on_climo_mvm_{1,2}
e3sm_diags_depend_on_climo
e3sm_diags_depend_on_ts_mvm_{1,2}
e3sm_diags_depend_on_ts
e3sm_diags_depend_on_diurnal_cycle_mvm_{1,2}
e3sm_diags_depend_on_diurnal_cycle
e3sm_diags_depend_on_lat_lon_land_mvm_{1,2}
e3sm_diags_depend_on_streamflow_mvm_{1,2}
e3sm_diags_depend_on_streamflow
e3sm_diags_depend_on_tc_analysis_mvm_{1,2}
e3sm_diags_depend_on_tc_analysis_v2_mvm_{1,2}
e3sm_diags_depend_on_tc_analysis_v2
e3sm_diags_depend_on_tc_analysis
e3sm_diags_depend_on_tropical_subseasonal_mvm_{1,2}
e3sm_diags_depend_on_tropical_subseasonal

That leaves these cases as untested in weekly testing

deprecated_parameters
e3sm_diags_depend_on_tc_analysis_parallel
e3sm_diags_depend_on_tc_analysis_parallel_v2 
global_time_series_original_8_missing_ocn
global_time_series_original_8_no_ocn 
global_time_series_original_8
global_time_series_viewers_all_land_variables
global_time_series_viewers_original_8
global_time_series_viewers_original_atm_plus_land
global_time_series_viewers_undefined_variables
global_time_series_viewers
ilamb_land_only
tc_analysis_res
tc_analysis_simultaneous_{1,2}
tc_analysis_simultaneous_v2_{1,2} 

Given the number of global_time_series parameter combinations to test, it may be wise to permit subtasks of the global_time_series task; then we could do multiple runs with one cfg.

What are other improvements we can make to the tests?

  • Test on as few years as possible
  • Since it takes a long time to get nodes on Perlmutter, perhaps a hybrid testing scheme could be used: run comprehensive tests on Chrysalis and the bundles tests on Perlmutter (since bundles were designed with Perlmutter's job scheduling in mind)

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 4, 2025

It may be best to test the global_time_series test cases on the zppy-interfaces side. That would require a couple things:

  • Add the image checker functionality to zppy-interfaces
  • Have multiple up-to-date input directories: 1 with 353 land variables, 1 with only a few land variables, 1 with data from mpas_analyis/atm/lnd

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 8, 2025

Automate the process of launching the zppy jobs not just for the comprehensive/weekly tests, but all the min-case tests as well. Initial implementation ideas are in #520.

#709

Update existing test_weekly.py cfg files run on fewer years and then update expected results accordingly. (That way we know any diffs are solely because of the reduction in years).

I don't think this will actually help that much. e3sm_diags is the longest running job (1 hour or more) and its subtasks are only operating on 2-4 years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing Files in `tests` modified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Further automate tests
1 participant