Loss class refactoring #533

Jubeku · 2025-07-16T09:51:36Z

Description

In this PR we refactor the trainer.compute_loss() function into a standalone LossModule class.
Note: This PR works off of Kacper's branch kacpnowak:kacpnowak/develop/per-channel-logginig

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Issue Number

Closes #568

Code Compatibility

I have performed a self-review of my code

Code Performance and Testing

I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline

Documentation

My code follows the style guidelines of this project
I have updated the documentation and docstrings to reflect the changes
I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

Apply the review

ruff

…ub.com/kacpnowak/WeatherGenerator2 into kacpnowak/develop/per-channel-logginig

clessig

Great progress, thanks! Two more points:

Please add doc strings to all functions
loss_module is not a great name; "module" is an overloaded term in CS and the class is not a module in most definitions. LossComputer is not great but descriptive. Open for other suggestions.

clessig · 2025-07-20T07:39:57Z

src/weathergen/train/loss_module.py

+        i_batch = 0 # TODO: Iterate over batch dimension here in future
+        for i_strm, strm in enumerate(self.cf.streams):
+            targets = streams_data[i_batch][i_strm].target_tokens[self.cf.forecast_offset:]
+            #assert len(targets) == self.cf.forecast_steps + 1, "Length of targets does not match number of forecast_steps."


Why is this commented out?

While this assertion works well for different configurations under training_mode: "forecast", it crashes with

training_mode: "masking" forecast_offset : 0 forecast_steps: 0

since len(targets)=1 and forecast_offset+forecast_steps=0. This seems incompatible with autoencoder training. I have put the assertion into an if training_mode == "masking".

Edit: After testing more configurations, this assertion seems not valid in general. It only seems valid for training_mode: "forecast" and forecast_policy: "fixed". Furthermore, it should then be

if self.cf.training_mode == "forecast" and self.cf.forecast_policy == "fixed": assert len(targets) == self.forecast_steps

But that seems too constraining.

clessig · 2025-07-20T07:40:41Z

src/weathergen/train/loss_module.py

+            targets = streams_data[i_batch][i_strm].target_tokens[self.cf.forecast_offset:]
+            #assert len(targets) == self.cf.forecast_steps + 1, "Length of targets does not match number of forecast_steps."
+
+            for fstep, target in enumerate(targets):    


Add more comments to the individual lines, e.g. why 108.

l114-122: overwriting in VAL case is not very clean: better to have

if TRAIN : ... elif VAL:

l. 125: we should document what the shape of pred at this point is: (ensemble, target_points, target_channels)

clessig · 2025-07-20T07:46:12Z

src/weathergen/train/loss_module.py

                                ctr_chs += 1
+                        else:


I think the code would be simpler if mask is just identity when tok_spacetime = False and this could be handled completely in _construct_masks.

clessig · 2025-07-20T07:48:01Z

tests/test_loss_module.py

+
+    ### Original logging preparation
+    # TODO: remove dependency from old trainer.compute_loss() function
+    _, _, _, logging_items = trainer.compute_loss(


This would be high priority since we cannot merge with the old dependency.

sophie-xhonneux · 2025-07-23T09:46:06Z

src/weathergen/train/loss_calculator.py

+            pred: The prediction tensor, potentially with an ensemble dimension.
+            mask: A boolean mask tensor, indicating which elements to consider for loss computation.
+            i_ch: The index of the channel for which to compute the loss.
+            loss_fct: The specific loss function to apply. It is expected to accept


so the loss function definition is very rigid here, for instance any latent loss cannot work in this way, indeed many regularisation losses might not work with this

Wondering if we introduce a dead end here or if we can work on this generalization in a new PR?

you mean the KL divergence in the ELBO, in case of a VAE? Since it is coming from an intermediate stage in the model, I would say that we can add it later as an extra parameter.

Thinking about it, we have roughly 3 losses depending on the stage in the model:

initial (stage 0): regularization terms (no need for samples)

mid-way: variational terms in the ELBO

end-to-end: expected empirical risk

regularisation terms can very much need samples frequently, e.g. z-loss as needed for stability in multi-modal transformers as here https://arxiv.org/pdf/2405.09818, or representation regularisation like a dispersion loss here https://arxiv.org/abs/2506.09027

sophie-xhonneux · 2025-07-23T09:51:22Z

src/weathergen/train/loss_calculator.py

+            # If no valid data under the mask, return 0 to avoid errors and not contribute to loss
+            return 0
+
+    def compute_loss(


my suggestion would be to rename this to compute_input_space_loss and people to register custom functions that all get executed in a wrapper compute_loss function.

each register function can come with a list of arguments to recover out of an argument dict kwargs? or we extend the loss function through inheritance for instance

Isn't src/weathergen/train/loss.py such a register of custom functions?

I am not a fan of registration because it includes state to think about. Could it be described fully in the input of the class constructor?

I think there should be an easy to compute a loss on the latents given we are doing representation learning and in particular SSL, this will come quite soon I suspect, we can leave it for a future PR I guess, but this is quite restrictice is my point

tjhunter

I had a light review, but thanks for this hard work. Make sure it does not expand in scope, it is already heavy.

tjhunter · 2025-07-23T09:55:25Z

config/default_config.yml

@@ -40,7 +40,7 @@ pred_mlp_adaln: True

 # number of steps offset applied to first target window; if set to zero and forecast_steps=0 then
 # one is training an auto-encoder
-forecast_offset : 0


just to check: are we ok with changing the default?

No, we have to revert to the default configs before merging.

This comes from #440. We need to clean up before it's merged. Same below.

tjhunter · 2025-07-23T09:55:39Z

config/streams/streams_ocean/fesom.yml

@@ -9,7 +9,7 @@

 FESOM :
  type : fesom
-  filenames : ['fesom_ifs_awi']
+  filenames : ['test4.zarr']


tjhunter · 2025-07-23T09:56:16Z

src/weathergen/train/loss.py

@@ -11,6 +11,8 @@
 import numpy as np
 import torch

+stat_loss_fcts = ["stats", "kernel_crps"]  # Names of loss functions that need std computed


what do you mean by std computed?

tjhunter · 2025-07-23T09:56:41Z

src/weathergen/train/loss_calculator.py

+    loss: Tensor
+    # Dictionaries containing detailed loss values and standard deviation statistics for each
+    # stream, channel, and loss function.
+    losses_all: dict


dict[str, Tensor] ? Be more precise.

tjhunter · 2025-07-23T10:06:50Z

src/weathergen/train/loss_calculator.py

+            pred: The prediction tensor, potentially with an ensemble dimension.
+            mask: A boolean mask tensor, indicating which elements to consider for loss computation.
+            i_ch: The index of the channel for which to compute the loss.
+            loss_fct: The specific loss function to apply. It is expected to accept


you mean the KL divergence in the ELBO, in case of a VAE? Since it is coming from an intermediate stage in the model, I would say that we can add it later as an extra parameter.

Thinking about it, we have roughly 3 losses depending on the stage in the model:

initial (stage 0): regularization terms (no need for samples)

mid-way: variational terms in the ELBO

end-to-end: expected empirical risk

tjhunter · 2025-07-23T10:08:06Z

src/weathergen/train/loss_calculator.py

+            # If no valid data under the mask, return 0 to avoid errors and not contribute to loss
+            return 0
+
+    def compute_loss(


I am not a fan of registration because it includes state to think about. Could it be described fully in the input of the class constructor?

tjhunter · 2025-07-23T10:12:22Z

src/weathergen/utils/distributed.py

+    Returns:
+        int: world size
+    """
+    if not dist.is_available():


there is a _is_distributed_initialized below.

tjhunter · 2025-07-23T10:12:26Z

src/weathergen/utils/distributed.py

+    return dist.get_world_size()
+
+
+def get_rank() -> int:


why not passing processgroups? see is_root below. I still don't fully grasp why we need them.

@sophie-xhonneux or Sebastian Hoffmann would probably know better.

get_rank is afaik often used for things you only want to do on one gpu, e.g. logging

This is again #440. Wrong place to discuss.

We need this function because if distributed is not initialized when we run interactively, all the calls to distributed will fail.

tjhunter · 2025-07-23T10:14:18Z

src/weathergen/utils/train_logger.py

@@ -388,6 +410,13 @@ def _key_loss(st_name: str, lf_name: str) -> str:
    return f"stream.{st_name}.loss_{lf_name}.loss_avg"


+def _key_loss_chn(st_name: str, lf_name: str, ch_name: str) -> str:
+    st_name = _clean_name(st_name)
+    lf_name = _clean_name(lf_name)


do not clean the loss or channel names, they should be standard enough.

Jubeku · 2025-07-23T10:22:27Z

Thanks @tjhunter. I think some of the additions come from @kacpnowak's PR - so probably doesn't make fully sense to review everything before his PR is merged.

clessig · 2025-07-24T04:39:58Z

This PR is just refactoring. So we should merge without add functionality for latent losses. It's already too big, if anything, and designing something now without actually using it will just lead to misalignment between what we think is needed and how we actually use it.

There will be a subsequent PR for the weighting functions but I would still keep it in physical (output) space. Once this is merged, we can add the latent loss.

Jubeku · 2025-07-24T09:40:07Z

Tests are currently failing in masking mode because of issue #553

kacpnowak and others added 30 commits May 30, 2025 14:48

Fix bug with seed being divided by 0 for worker ID=0

9c47712

Fix bug causing crash when secrets aren't in private config

7fbd062

Implement logging losses per channel

b1304a9

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

8a2ecdd

Fix issue with empty targets

24787fd

Rework loss logging

6e4f274

ruff

2250892

Remove computing max_channels

0100f24

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

0c2df6a

Change variables names

e7a2135

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

5472317

ruffed

caa2074

Remove redundant enumerations

6c22662

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

cf0dd57

Use stages for logging

17aaa1f

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

622af0b

Add type hints

f1c79fc

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

556af18

Apply the review

ef0d7d1

Merge pull request #10 from kacpnowak/review

e7e634b

Apply the review

ruff

eae28b0

Merge pull request #11 from kacpnowak/review

bf91eab

ruff

fix

cb78b79

Merge branch 'develop' into tjh/develop/per-channel-logginig

2f6e228

Merge branch 'develop' into kacpnowak/develop/per-channel-logginig

1f4b7a5

Fix type hints

cb2799b

Merge branch 'kacpnowak/develop/per-channel-logginig' of https://gith…

1fb4193

…ub.com/kacpnowak/WeatherGenerator2 into kacpnowak/develop/per-channel-logginig

ruff

5f72eee

Implement sending tensors of different shapes

3b00ed4

ruff

2727baa

kacpnowak and others added 9 commits July 4, 2025 13:06

Fix docstring

08ac55e

rerun workflow

e1cd427

creating loss class

007cccc

Adapted varnames in new compute_loss function to match LossModule

c4cdb13

comments and loss_fcts refactoring

4294834

Suggested a separation of mask creation and loss computation

da38a0b

first working version of LossModule; added unit test

7bdccb6

Modifications and TODOs after meeting with Christian and Julian

5008b77

Added Christian's comments and updated code partially

243c02a

Jubeku assigned Jubeku and MatKbauer Jul 16, 2025

Jubeku added this to WeatherGen-dev Jul 16, 2025

Jubeku added the enhancement New feature or request label Jul 16, 2025

MatKbauer and others added 5 commits July 16, 2025 12:16

Julian & Matze further advances to understand shapes

4bbecda

New mask_t computations. Not yet correct, thus commented

8b0323a

Resolved reshaping of tensors for loss computation

f71e717

small changes in _prepare_logging

586842a

J&M first refactoring version finished, 2 tests ok

3e6d1d4

clessig reviewed Jul 20, 2025

View reviewed changes

MatKbauer and others added 3 commits July 22, 2025 11:23

First round of resolving PR comments

3f49fe6

add ModelLoss dataclass, rearrange mask and loss computation

f690565

Integrating new LossCalculator into trainer.py and adding docstrings

b488371

sophie-xhonneux reviewed Jul 23, 2025

View reviewed changes

tjhunter approved these changes Jul 23, 2025

View reviewed changes

shmh40 mentioned this pull request Jul 24, 2025

Wrong target_times_raw in masking mode #553

Open

MatKbauer mentioned this pull request Jul 24, 2025

Backward pass of flex_attention_backward breaks if forecast_att_dense_rate != 1.0 #588

Open

J&M resolved temp.item() error

7eb48a7

Loss class refactoring #533

Are you sure you want to change the base?

Loss class refactoring #533

Uh oh!

Conversation

Jubeku commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Issue Number

Code Compatibility

Code Performance and Testing

Dependencies

Documentation

Additional Notes

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MatKbauer Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jubeku commented Jul 16, 2025 •

edited

Loading

MatKbauer Jul 22, 2025 •

edited

Loading