Skip to content

Conversation

VeraChristina
Copy link
Contributor

@VeraChristina VeraChristina commented Jul 8, 2025

This PR adds an ecflow suite that can be deployed and monitored by a github workflow. The purpose is to test the anemoi pipeline end-to-end, from dataset creation to inference.

The suite added by this PR consists of

  • set-up
  • dataset creation
  • training

The suite is set up such that new test cases for dataset creation or training can be added as configs (no pyflow knowledge required).

The github workflow

  • deploys and runs the suite
  • monitors whether it has finished or a task failed
  • prints results
  • cleans up the suite if the tests were successful

Before merging, we need to

  • document how to add new test cases
  • remove the pull request trigger and set a schedule for the regular action
  • draft issues for next steps
  • set the schedule: run nightly, and check if there were any commits in any of the repos tested (currently datasets and training)

Not part of this PR:

  • checks for datasets task to verify that the created dataset looks as expected
  • periodic clean up of dirs created by tasks > 2 days
  • inference family
  • finetuning family: forking, resuming, (rollout) etc.
  • Review of monitor action -- needs to finish when all parts are done/aborted/queued (currently finishes also when some tasks are aborted)
  • more test cases -- need to gather requirements for these

📚 Documentation preview 📚: https://anemoi--38.org.readthedocs.build/en/38/

@VeraChristina
Copy link
Contributor Author

As discussed yesterday, I have added a workflow for nightly tests that first checks for commits within the past 24h in any of the repos (currently: anemoi-docs, anemoi-datasets, anemoi-core), and only runs the tests (on all mains) if there has been a change in any of them.

Workflow run here: https://github.com/ecmwf/anemoi-docs/actions/runs/17433383922 (The monitor action fails here because the workflow already points to main of anemoi-docs where the test suite isn't added yet. This should be fine once merged.)

jjlk
jjlk previously approved these changes Sep 3, 2025
Copy link

@jjlk jjlk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor optional comments, ok for me to merge as it is!

Copy link
Contributor

@aaron-hopkinson aaron-hopkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor comments – sorry!

@VeraChristina VeraChristina merged commit 0981601 into main Sep 5, 2025
5 checks passed
@VeraChristina VeraChristina deleted the test/system-level-prototype-ecflow branch September 5, 2025 14:27
@github-project-automation github-project-automation bot moved this from Under Review to Done in Anemoi-dev Sep 5, 2025
@MeraX
Copy link
Contributor

MeraX commented Sep 5, 2025

If needed, we can look into ways of surfacing git hashes more directly, rather than just in the logs.

Thank you @VeraChristina, for your reply. Could you help me to find the git hashes in https://github.com/ecmwf/anemoi-docs/actions/runs/17433383922/job/49499118253

And why has this PR been merged without any ATS tag?

@anaprietonem
Copy link
Contributor

If needed, we can look into ways of surfacing git hashes more directly, rather than just in the logs.

Thank you @VeraChristina, for your reply. Could you help me to find the git hashes in https://github.com/ecmwf/anemoi-docs/actions/runs/17433383922/job/49499118253

And why has this PR been merged without any ATS tag?

Hey @MeraX, good catch that we need to update the labels of this repo and also add the github action so it has the same workflow in terms of PR labelling as the other repos. I will help with this. In terms of ATS, this PR was discussed at ATS where we also presented the overall approach. We clarified the initial focus would be to support testing from main branches and scale to main uses cases like global, LAM and stretched grid. Since there were no major concerns and overall this was see as a useful feature we told Vera this was approved. Note this doesn't mean the work on System Level Tests is done, as @VeraChristina captured https://github.com/ecmwf/anemoi-docs/issues?q=is%3Aissue%20state%3Aopen%20milestone%3A%22System-level%20tests%22 there are still improvements and features to be added, but nonetheless having this merged is a great step towards making Anemoi more robust!

@VeraChristina
Copy link
Contributor Author

VeraChristina commented Sep 8, 2025

Thank you @VeraChristina, for your reply. Could you help me to find the git hashes in https://github.com/ecmwf/anemoi-docs/actions/runs/17433383922/job/49499118253

Hi @MeraX, sure thing! -- You navigate to the summary, then open the logs for the package whose version you want to check, and scroll down to find the local version tag, which includes a short commit hash.

Screenshot 2025-09-08 at 09 37 12

(...)

Screenshot 2025-09-08 at 09 36 47

(...)

Screenshot 2025-09-08 at 09 36 30

Clearly, this is not ideal since it's very buried in the logs and just a short commit hash. Once we start using the test suite more comprehensively, i.e. outside of just testing all main branches nightly (which will make it much more complicated to know which versions were tested together), we definitely need a better way of showing this.

I've added a ticket to the milestone. Feel free to add any details or considerations I've missed!

@MeraX
Copy link
Contributor

MeraX commented Sep 8, 2025

Hi Vera,

thanks for the insight. It appears to me, that this action does not test compatibility across all main branches but that the dependencies for each test are rather taken from PyPI. E.g. one task tests anemoi-datasets==0.1.dev1+gfa0b7e809 while the core tests use the version anemoi-datasets==0.5.26 from PyPI. Is this intended?

From the anemoi-datasets tests:
Image

From the anemoi-core tests:

 + anemoi-datasets==0.5.26
 + anemoi-graphs==0.6.5.post3 (from file:///lus/h1resw02/project/prepml/ecflow_server/workdirs/testing/anemoi_tests/nightly/local/build/training_env/anemoi_training/graphs)
 + anemoi-models==0.9.4.post3 (from file:///lus/h1resw02/project/prepml/ecflow_server/workdirs/testing/anemoi_tests/nightly/local/build/training_env/anemoi_training/models)
 + anemoi-training==0.6.4.post3 (from file:///lus/h1resw02/project/prepml/ecflow_server/workdirs/testing/anemoi_tests/nightly/local/build/training_env/anemoi_training/training)
 + anemoi-transform==0.1.16
 + anemoi-utils==0.4.35

@VeraChristina
Copy link
Contributor Author

VeraChristina commented Sep 8, 2025

@MeraX Thanks for pointing this out. What I meant by “across anemoi branches” is more about the end-to-end user workflow: creating a dataset → training → inference, with the various anemoi packages installed from main at each step. That way, we check whether training works when the dataset was created with the current anemoi-datasets (and soon, whether the resulting checkpoint is compatible with current inference).

So yes, this is intentional in the sense that it’s closer to the workflow we expect users to follow (installing training from main or PyPI, not mixing versions within the same environment). As you note, we may need to update dependencies, and when things break we probably want to see that, since we don’t do synchronized releases across the pipeline. That said, I agree we should have a way to test whether the pipeline with updated dependencies will work.

This setup is not meant to be comprehensive — just a starting point. We’ll likely want to extend what’s tested over time, and it’d be great if others can share/document ideas. If you think the current approach should be changed, could you please open an issue so we can track the discussion there instead of on this closed PR?

@MeraX
Copy link
Contributor

MeraX commented Sep 9, 2025

Thanks for this clarification, I might have missed that point of view in the discussion.

I believe there is a myriad of ways how to set up Anemoi and it's just important to make clear, what has been proven to work and what not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants