Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add the possibility to train a model with a dry MLflow run ID #164

Merged
merged 28 commits into from
Apr 2, 2025

Conversation

jjlk
Copy link
Contributor

@jjlk jjlk commented Feb 28, 2025

Description

If several trainings are chained using a scheduler, a nice feature would be to be able to:

  • Create a MLflow run ID using the API outside of the training CLI.
  • Propagate this run_id through the different trainings and to external registries.
  • Get the lineage in the MLflow server.

First bullet point is handled with a new CLI to generate a new run ID and to save it on disk and in the register:

>$ anemoi-training mlflow prepare --config-name=dev_mlflow
2025-02-28 16:24:06 INFO Access token refreshed: 0.2 seconds.
2025-02-28 16:24:06 INFO Creating new run_id: 00da1078efe4467b81f75587df3696ae
2025-02-28 16:24:06 INFO Saving mlflow_id in file in mlflow_run_id.yaml.

Second bullet point is achieved by:

  • Introducing a new internal variable, called dry_run_id, to run a training without an existing checkpoint.
  • Add a use case for the run ID setup to handle a case when we have a forked run_id + a run ID generated before.

Third bullet point is achieved by:

  • Add a use case in the MLFlow logger to deal with the fact that we can have a run ID and a forked run ID.

Examples

Standard training

Creation of a training run ID:

>$ anemoi-training mlflow prepare --config-name=dev_mlflow
2025-03-04 14:36:59 INFO Access token refreshed: 0.2 seconds.
2025-03-04 14:37:00 INFO Creating new run_id: 4d7f151592534836b0ca8507f2d78a21
2025-03-04 14:37:00 INFO Saving run id in file in mlflow_run_id.yaml. 

Training:

$ anemoi-training train --config-name=dev_mlflow hardware.paths.output=/perm/ecm6116/shared/anemoi/training/ training.run_id="4d7f151592534836b0ca8507f2d78a21"
[2025-03-04 14:39:49,139][anemoi.training.train.train][INFO] - Config validated.
[2025-03-04 14:39:49,139][anemoi.training.train.train][INFO] - Run id: 4d7f151592534836b0ca8507f2d78a21
[2025-03-04 14:39:49,154][anemoi.training.diagnostics.logger][INFO] - AnemoiMLFlow logging to https://mlflow.ecmwf.int/
[2025-03-04 14:39:49,160][anemoi.training.diagnostics.mlflow.logger][INFO] - MLflow token authentication enabled for https://mlflow.ecmwf.int/
...
[2025-03-04 14:41:58,440][anemoi.training.diagnostics.mlflow.auth][INFO] - Your MLflow login token is valid until 2025-04-02 14:39:49 UTC
2025/03/04 14:41:58 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2025/03/04 14:41:58 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!
[2025-03-04 14:41:58,526][anemoi.training.diagnostics.mlflow.logger][INFO] - Stopping terminal log monitoring and saving buffered terminal outputs. Final status: SUCCESS
🏃 View run abundant-perch-919 at: https://mlflow.ecmwf.int/#/experiments/54/runs/54af33a5a8a6459c84bba7c22e6a5ad0
🧪 View experiment at: https://mlflow.ecmwf.int/#/experiments/54 

Screenshot 2025-03-04 at 16 22 08
Screenshot 2025-03-04 at 16 22 16

Forked training

Creation of a training run ID:

>$ anemoi-training mlflow prepare --config-name=dev_mlflow
2025-03-04 14:54:26 INFO Access token refreshed: 0.1 seconds.
2025-03-04 14:54:26 INFO Creating new run_id: c38a48ad48884d44a27fec1773746930
2025-03-04 14:54:26 INFO Saving run id in file in mlflow_run_id.yaml.

Training:

>$ anemoi-training train --config-name=dev_mlflow hardware.paths.output=/perm/ecm6116/shared/anemoi/training/ training.run_id="c38a48ad48884d44a27fec1773746930" training.fork_run_id="4d7f151592534836b0ca8507f2d78a21"
[2025-03-04 15:44:07,161][anemoi.training.train.train][INFO] - Config validated.
[2025-03-04 15:44:07,161][anemoi.training.train.train][INFO] - Run id: c38a48ad48884d44a27fec1773746930
[2025-03-04 15:44:07,174][anemoi.training.diagnostics.logger][INFO] - AnemoiMLFlow logging to https://mlflow.ecmwf.int/
[2025-03-04 15:44:07,177][anemoi.training.diagnostics.mlflow.logger][INFO] - MLflow token authentication enabled for https://mlflow.ecmwf.int/
[2025-03-04 15:44:07,338][anemoi.training.diagnostics.mlflow.auth][INFO] - Access token refreshed: 0.2 seconds.
...
[2025-03-04 15:45:12,594][anemoi.training.diagnostics.mlflow.auth][INFO] - Your MLflow login token is valid until 2025-04-02 15:44:07 UTC
2025/03/04 15:45:12 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2025/03/04 15:45:12 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!
[2025-03-04 15:45:12,662][anemoi.training.diagnostics.mlflow.logger][INFO] - Stopping terminal log monitoring and saving buffered terminal outputs. Final status: SUCCESS
🏃 View run gifted-lamb-292 at: https://mlflow.ecmwf.int/#/experiments/54/runs/74549608b8034bbcab375e1a2bc376fe
🧪 View experiment at: https://mlflow.ecmwf.int/#/experiments/54

Screenshot 2025-03-04 at 16 30 34
Screenshot 2025-03-04 at 16 30 22

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Code Compatibility

  • I have performed a self-review of my code

Code Testing

  • I have added tests that prove my fix is effective or that my feature works
  • I ran the complete Pytest test suite locally, and they pass
  • I have tested the changes on a single GPU
  • I have tested the changes on multiple GPUs / multi-node setups
  • I have run the Benchmark Profiler against the old version of the code
  • If the new feature introduces modifications at the config level, I have made sure to update Pydantic Schemas and default configs accordingly

Dependencies

  • I have ensured that the code is still pip-installable after the changes and runs
  • I have tested that new dependencies themselves are pip-installable.
  • I have not introduced new dependencies in the inference portion of the pipeline

Documentation

  • My code follows the style guidelines of this project
  • I have updated the documentation and docstrings to reflect the changes
  • [x} I have added comments to my code, particularly in hard-to-understand areas

@jjlk jjlk changed the title feat: Add an option to MLflow CLI to generate and create a new dry run feat: gives the possibility to train a model with a dry MLflow run ID Mar 4, 2025
@jjlk jjlk changed the title feat: gives the possibility to train a model with a dry MLflow run ID feat: Add the possibility to train a model with a dry MLflow run ID Mar 4, 2025
Copy link
Member

@gmertes gmertes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool feature! I mainly reviewed the code in the command, not so much the functionality during logging and training.

@anaprietonem
Copy link
Collaborator

I left some comments @jjlk! let me know if something is not clear, thanks again for the great work!

@jjlk
Copy link
Contributor Author

jjlk commented Mar 11, 2025

Hey @anaprietonem,

I think I took into account your comments except the two related to the fork_id because I would like to be sure I really understand the situation.

Right now, when you create a forked run, a new MLflow (or with uuid) run ID will be generated by anemoi.

What I would like is to generate this MLflow run ID with the anemoi-training mlflow prepare CLI before running anemoi-training train "config" training.fork_run_id=xxx. I think the second example given in the description is the what we want to achieve, right? The run has the main MLflow run ID we generated and it is tagged as a forked run with the details of the previous run. If not, I need more explanations! =)

If yes, I need a way to pass the MLFlow run ID to anemoi-training train, and in the PR I used the training.run_id in combination of the training.fork_run_id. It changes a bit the previous philosophy but it looks like the quickest way to achieve propagate the run ID without adding other variables. But I understand that it could be confusing for the users. Do you have any thoughts of how we could tackle this use case?

@jjlk
Copy link
Contributor Author

jjlk commented Mar 12, 2025

Added further tests for metadata on MLflow server:

@jjlk jjlk marked this pull request as ready for review March 12, 2025 10:34
@jjlk
Copy link
Contributor Author

jjlk commented Mar 25, 2025

Test that everything still works with in offline mode:

$> anemoi-training train --config-name=dev_mlflow hardware.paths.output=/perm/ecm6116/shared/anemoi/training/
[2025-03-25 16:27:53,783][anemoi.training.train.train][INFO] - Config validated.
[2025-03-25 16:27:53,792][anemoi.training.diagnostics.mlflow.logger][INFO] - MLflow is logging offline.
[2025-03-25 16:27:55,907][pytorch_lightning.loggers.mlflow][WARNING] - Experiment with name anemoi-debug-ecm6116 not found. Creating it.
[2025-03-25 16:27:57,121][anemoi.training.diagnostics.mlflow.logger][INFO] - Terminal Log Path: /perm/ecm6116/shared/anemoi/training/plots/cdddcefc6297409285d22329f9d20e92/plots/terminal_log.txt
2025/03/25 16:27:57 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.
[2025-03-25 16:27:57,164][anemoi.training.train.train][INFO] - Mlflow Run id: cdddcefc6297409285d22329f9d20e92
[2025-03-25 16:27:57,164][anemoi.training.train.train][INFO] - Parent run server2server: None
[2025-03-25 16:27:57,164][anemoi.training.train.train][INFO] - Fork run server2server: None
[2025-03-25 16:27:57,164][anemoi.training.train.train][INFO] - Checkpoints path: /perm/ecm6116/shared/anemoi/training/checkpoint/cdddcefc6297409285d22329f9d20e92
[2025-03-25 16:27:57,165][anemoi.training.train.train][INFO] - Plots path: /perm/ecm6116/shared/anemoi/training/plots/cdddcefc6297409285d22329f9d20e92
...
2025/03/25 16:29:50 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2025/03/25 16:29:50 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!
[2025-03-25 16:29:50,788][anemoi.training.diagnostics.mlflow.logger][INFO] - Stopping terminal log monitoring and saving buffered terminal outputs. Final status: SUCCESS

then updating everything to the server:

$> anemoi-training mlflow sync -s /perm/ecm6116/shared/anemoi/training/logs/mlflow/ -r cdddcefc6297409285d22329f9d20e92 -d https://mlflow.ecmwf.int -e anemoi-debug-ecm6116 -a 
25-Mar-25 16:44:16 - INFO - Using default logging config with output log file '/dev/shm/_tmpdir_.ecm6116.34275883/ecm6116_wrqm56eo'
25-Mar-25 16:44:17 - INFO - 🌐 Logging in to https://mlflow.ecmwf.int
25-Mar-25 16:44:17 - INFO - Your MLflow login token is valid until 2025-04-23 16:44:17 UTC
25-Mar-25 16:44:17 - INFO - ✅ Successfully logged in to MLflow. Happy logging!
25-Mar-25 16:44:17 - INFO - Access token refreshed: 0.1 seconds.
25-Mar-25 16:44:18 - INFO - Exporting run: {'run_id': 'cdddcefc6297409285d22329f9d20e92', 'lifecycle_stage': 'active', 'experiment_id': '456299991003986338'}
25-Mar-25 16:44:18 - INFO - Starting to export run data
🏃 View run live-mule at: https://mlflow.ecmwf.int/#/experiments/182/runs/bac31bb8cb0549d6b87c05740f56d655
🧪 View experiment at: https://mlflow.ecmwf.int/#/experiments/182
25-Mar-25 16:45:01 - INFO - Imported run bac31bb8cb0549d6b87c05740f56d655 into experiment anemoi-debug-ecm6116

Then result is here: https://mlflow.ecmwf.int/#/experiments/182/runs/bac31bb8cb0549d6b87c05740f56d655.

@anaprietonem anaprietonem self-requested a review March 27, 2025 07:52
@anaprietonem anaprietonem self-requested a review March 31, 2025 07:20
Copy link
Collaborator

@anaprietonem anaprietonem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work @jjlk !

@jjlk jjlk self-assigned this Mar 31, 2025
@jjlk jjlk merged commit 9849d21 into main Apr 2, 2025
28 checks passed
@jjlk jjlk deleted the feature/mlflow-dry-run branch April 2, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants