Implement per channel logginig again #440

kacpnowak · 2025-07-03T16:24:38Z

Description

Implements logging losses per channel, now taking into account that ranks can have different number of samples.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Issue Number

Closes #282

Code Compatibility

I have performed a self-review of my code

Code Performance and Testing

I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline

Documentation

My code follows the style guidelines of this project
I have updated the documentation and docstrings to reflect the changes
I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

Apply the review

ruff

…ub.com/kacpnowak/WeatherGenerator2 into kacpnowak/develop/per-channel-logginig

kacpnowak · 2025-07-04T11:02:41Z

Closes #282

kacpnowak · 2025-07-04T11:07:38Z

Fixes #282

tjhunter

@kacpnowak I have a few comments. It is tricky code and I am a bit limited for reviewing capacity in the coming days. Is it working as intended? It would be great if someone else tried it out as well. @clessig , if this current implementation is good for you, then I think someone else should have a look and try it too. Any thoughts?

tjhunter · 2025-07-07T08:04:09Z

src/weathergen/utils/distributed.py

+    Returns:
+        int: current rank
+    """
+    if not dist.is_available():


_is_distributed_initialized

tjhunter · 2025-07-07T08:04:16Z

src/weathergen/utils/distributed.py

+    return dist.get_world_size()
+
+
+def get_rank() -> int:


please update is_root. that function is following the best practices for pytorch. Maybe Seb Hoffman also has something to say about that part too.

tjhunter · 2025-07-07T08:05:07Z

src/weathergen/utils/distributed.py

+    return dist.get_rank()
+
+
+def all_gather(data: Tensor) -> list[Tensor]:


you should make very explicit that this implementation does not allow gradient propagation (or does it? I would assume it breaks the tape tracking of the tensors but stranger things have happened in pytorch).

it's a great quesiton. I couldn't find anything explicit on this topic, but in my understanding it's not allowing gradients to flow. The gradient flag is preserved but once it's reconstructed it's detached from the autograd's graph

tjhunter · 2025-07-07T08:12:53Z

src/weathergen/train/trainer.py

+        # Make list of losses into a tensor. This is individual tensor per rank
+        real_loss = torch.tensor(self.loss_model_hist, device=self.devices[0])
+        # Gather all tensors from all ranks into a list and stack them into one tensor again
+        real_loss = torch.cat(all_gather(real_loss))


I am surprised it works as expected

clessig · 2025-07-07T08:17:56Z

I will try the code later today. But yes, would be great if others could test as well. Anyone from JSC?

________________________________ From: Timothy Hunter ***@***.***> Sent: Monday, July 7, 2025 10:16:18 AM To: ecmwf/WeatherGenerator ***@***.***> Cc: Christian Lessig ***@***.***>; Mention ***@***.***> Subject: Re: [ecmwf/WeatherGenerator] Implement per channel logginig again (PR #440) @tjhunter commented on this pull request. @kacpnowak<https://github.com/kacpnowak> I have a few comments. It is tricky code and I am a bit limited for reviewing capacity in the coming days. Is it working as intended? It would be great if someone else tried it out as well. @clessig<https://github.com/clessig> , if this current implementation is good for you, then I think someone else should have a look and try it too. Any thoughts?

________________________________ In src/weathergen/utils/distributed.py<#440 (comment)>:

+ """

+ if not dist.is_available(): + return 1 + if not dist.is_initialized(): + return 1 + return dist.get_world_size() + + +def get_rank() -> int: + """ + Get current rank number + + Returns: + int: current rank + """ + if not dist.is_available(): _is_distributed_initialized

________________________________ In src/weathergen/utils/distributed.py<#440 (comment)>:

+

+def get_world_size() -> int: + """ + Get MPI world size + + Returns: + int: world size + """ + if not dist.is_available(): + return 1 + if not dist.is_initialized(): + return 1 + return dist.get_world_size() + + +def get_rank() -> int: please update is_root. that function is following the best practices for pytorch. Maybe Seb Hoffman also has something to say about that part too.

________________________________ In src/weathergen/utils/distributed.py<#440 (comment)>:

+

+def get_rank() -> int: + """ + Get current rank number + + Returns: + int: current rank + """ + if not dist.is_available(): + return 0 + if not dist.is_initialized(): + return 0 + return dist.get_rank() + + +def all_gather(data: Tensor) -> list[Tensor]: you should make very explicit that this implementation does not allow gradient propagation (or does it? I would assume it breaks the tape tracking of the tensors but stranger things have happened in pytorch).

________________________________ In src/weathergen/train/trainer.py<#440 (comment)>:

+ Aggregates across ranks loss and standard deviation data for logging.

+ + Returns: + real_loss (torch.Tensor): The scalar loss used for backpropagation. + losses_all (dict[str, torch.Tensor]): Dictionary mapping each stream name to its + per-channel loss tensor. + stddev_all (dict[str, torch.Tensor]): Dictionary mapping each stream name to its + per-channel standard deviation tensor. + """ + losses_all: dict[str, Tensor] = {} + stddev_all: dict[str, Tensor] = {} + + # Make list of losses into a tensor. This is individual tensor per rank + real_loss = torch.tensor(self.loss_model_hist, device=self.devices[0]) + # Gather all tensors from all ranks into a list and stack them into one tensor again + real_loss = torch.cat(all_gather(real_loss)) I am surprised it works as expected — Reply to this email directly, view it on GitHub<#440 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHCHOHXDDEEYMRO52OHTIZL3HIUFFAVCNFSM6AAAAACAXI6N7OVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSOJSGY2TKMJZGY>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Jubeku · 2025-07-11T15:02:21Z

Testing on a single node on Leonardo: Training without error for 3h already.

clessig

@kacpnowak : I just tried the code and wanted to plot the loss values. I get:

Traceback (most recent call last):
  File "/lus/h2resw01/hpcperm/nacl/WeatherGenerator/.venv/bin/plot_train", line 10, in <module>
    sys.exit(plot_train())
             ^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/nacl/WeatherGenerator/src/weathergen/utils/plot_training.py", line 671, in plot_train
    runs_data = [TrainLogger.read(run_id, model_path=model_base_dir) for run_id in runs_ids]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/nacl/WeatherGenerator/src/weathergen/utils/train_logger.py", line 249, in read
    log_train_df = read_metrics(cf, run_id, "train", cols1, result_dir_base)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/nacl/WeatherGenerator/src/weathergen/utils/train_logger.py", line 371, in read_metrics
    df = clean_df(df, cols)
         ^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/nacl/WeatherGenerator/src/weathergen/utils/train_logger.py", line 395, in clean_df
    df = df.select(columns)
         ^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/nacl/WeatherGenerator/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py", line 9632, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/nacl/WeatherGenerator/.venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 88, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/nacl/WeatherGenerator/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2188, in collect
    return wrap_df(ldf.collect(engine, callback))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ColumnNotFoundError: loss_avg_0_mean

Resolved plan until failure:

        ---> FAILED HERE RESOLVING 'sink' <---
DF ["stream.NPPATMS.loss_mse.loss_obsvaluerawbt3", "stream.SurfaceCombined.loss_mse.loss_obsvaluet2m0", "stream.SurfaceCombined.loss_mse.loss_avg", "weathergen.time", ...]; PROJECT */114 COLUMNS
nacl@ac6-318:WeatherGenerator$ uv run plot_train

Can you implement this patch please:

--- a/src/weathergen/utils/train_logger.py
+++ b/src/weathergen/utils/train_logger.py
@@ -199,7 +199,7 @@ class TrainLogger:
 
         # define cols for training
         cols_train = ["dtime", "samples", "mse", "lr"]
-        cols1 = [_weathergen_timestamp, "num_samples", "loss_avg_0_mean", "learning_rate"]
+        cols1 = [_weathergen_timestamp, "num_samples", "loss_avg_mean", "learning_rate"]

plot_training.py and train_logger need to be adapted to allow one to select the columns that one would like to plot (loss_avg_mean is a good default but now I also want to plot q850 etc). Please open a PR on this.

I still need to test with mlflow.

kacpnowak · 2025-07-15T10:19:01Z

Thanks for finding out this bug. I've patched it

…g mean - Switched from all_gather to this function in trainer to robustly average - Some code cleanup

clessig

Everything looks good and is working but can we replace the all_gather with one that doesn't scratch on the bytes. This would do the job:

def all_gather_vdim(tensor: torch.Tensor, group=None) -> list[torch.Tensor]:
    """Gather tensors with different number of dimensions."""
    world_size = dist.get_world_size(group=group)
    # Gather shapes first
    shapes = all_gather_vlen(
        torch.as_tensor(tensor.shape, device=tensor.device), group=group
    )
    # Gather data
    inputs = [tensor] * world_size
    outputs = [
        torch.empty(*_shape, dtype=tensor.dtype, device=tensor.device)
        for _shape in shapes
    ]
    dist.all_to_all(outputs, inputs, group=group)
    return outputs

…el-logginig/fix_comms Simpler, more robust communication using standard torch primitives

kacpnowak and others added 29 commits May 30, 2025 14:48

Fix bug with seed being divided by 0 for worker ID=0

9c47712

Fix bug causing crash when secrets aren't in private config

7fbd062

Implement logging losses per channel

b1304a9

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

8a2ecdd

Fix issue with empty targets

24787fd

Rework loss logging

6e4f274

ruff

2250892

Remove computing max_channels

0100f24

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

0c2df6a

Change variables names

e7a2135

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

5472317

ruffed

caa2074

Remove redundant enumerations

6c22662

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

cf0dd57

Use stages for logging

17aaa1f

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

622af0b

Add type hints

f1c79fc

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

556af18

Apply the review

ef0d7d1

Merge pull request #10 from kacpnowak/review

e7e634b

Apply the review

ruff

eae28b0

Merge pull request #11 from kacpnowak/review

bf91eab

ruff

fix

cb78b79

Merge branch 'develop' into tjh/develop/per-channel-logginig

2f6e228

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

1f4b7a5

Fix type hints

cb2799b

Merge branch 'kacpnowak/develop/per-channel-logginig' of https://gith…

1fb4193

…ub.com/kacpnowak/WeatherGenerator2 into kacpnowak/develop/per-channel-logginig

ruff

5f72eee

Implement sending tensors of different shapes

3b00ed4

kacpnowak marked this pull request as draft July 3, 2025 16:30

kacpnowak added 3 commits July 4, 2025 11:01

ruff

2727baa

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

19b80dc

Fix merge

659ee07

kacpnowak marked this pull request as ready for review July 4, 2025 11:03

Fix docstring

08ac55e

rerun workflow

e1cd427

tjhunter reviewed Jul 7, 2025

View reviewed changes

tjhunter requested a review from MatKbauer July 7, 2025 12:52

kacpnowak added 2 commits July 11, 2025 12:06

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

35ab104

Review

197197b

clessig requested changes Jul 15, 2025

View reviewed changes

kacpnowak added 2 commits July 15, 2025 12:01

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

82c922b

Change default colums name

5c013e9

kacpnowak and others added 3 commits July 16, 2025 18:17

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

c0a4a49

Fix merge

65b418c

- Added ddp_average_nan that is robust to NaN/0 entries when computin…

6db0a41

…g mean - Switched from all_gather to this function in trainer to robustly average - Some code cleanup

clessig reviewed Jul 18, 2025

View reviewed changes

kacpnowak added 3 commits July 19, 2025 08:32

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

307e67e

Merge pull request #12 from ecmwf/clessig/kacpnowak/develop/per-chann…

0c0e882

…el-logginig/fix_comms Simpler, more robust communication using standard torch primitives

use all_to_all communication

ebd51cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement per channel logginig again #440

Implement per channel logginig again #440

Uh oh!

kacpnowak commented Jul 3, 2025 •

edited by clessig

Loading

Uh oh!

kacpnowak commented Jul 4, 2025

Uh oh!

kacpnowak commented Jul 4, 2025

Uh oh!

tjhunter left a comment

Uh oh!

tjhunter Jul 7, 2025

Uh oh!

tjhunter Jul 7, 2025

Uh oh!

tjhunter Jul 7, 2025

Uh oh!

kacpnowak Jul 7, 2025

Uh oh!

tjhunter Jul 7, 2025

Uh oh!

clessig commented Jul 7, 2025 via email

Uh oh!

Jubeku commented Jul 11, 2025

Uh oh!

clessig left a comment

Uh oh!

kacpnowak commented Jul 15, 2025

Uh oh!

clessig left a comment

Uh oh!

Uh oh!

		return dist.get_rank()


		def all_gather(data: Tensor) -> list[Tensor]:

Implement per channel logginig again #440

Are you sure you want to change the base?

Implement per channel logginig again #440

Uh oh!

Conversation

kacpnowak commented Jul 3, 2025 • edited by clessig Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Issue Number

Code Compatibility

Code Performance and Testing

Dependencies

Documentation

Additional Notes

Uh oh!

kacpnowak commented Jul 4, 2025

Uh oh!

kacpnowak commented Jul 4, 2025

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

tjhunter Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

kacpnowak Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

clessig commented Jul 7, 2025 via email

Uh oh!

Jubeku commented Jul 11, 2025

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

kacpnowak commented Jul 15, 2025

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kacpnowak commented Jul 3, 2025 •

edited by clessig

Loading