Enable FesomDataReader to have different source and target datasets #1046

kacpnowak · 2025-10-06T15:48:50Z

Description

This PR implements separate source and target datasets for FesomDataReader.

It was also necessary modify masking strategies, as healpix cell can empty for the source but not for the target. Additionally there can be different number of tokens per cell. Different adjustments had to be made for each strategy:

random: when mask is missing for each token, it's assumed to be False.
healpix: when mask is missing it's for the entire healpix cell it's assumed to be False.
casual: If mask is missing for an entire cell, it's assumed to be True for every token inside. If number of tokens in cell is different for source and target, source masked ratio is calculated and used for masking the same fraction of target tokens.
channels: unsupported

Huge thanks for @shmh40 for all the help.

Issue Number

Closes #911

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

shmh40 · 2025-10-17T19:00:37Z

config/streams/fesom/fesom.yml


 FESOM_NODE :
  type : fesom
  filenames : ['ocean_node']


Separately, I think the paths in WG-private need to be updated for FESOM data

shmh40 · 2025-10-17T19:01:55Z

src/weathergen/datasets/masking.py

                # select only the target times where mask is True
-                selected_tensors = [c for i, c in enumerate(cc) if pp[i]]
-
+                if len(cc) == len(pp):


Do we want this if statement for the other strategies to be safer?

I'm not sure if this makes sens for others. In case of random the same code handles both cases

shmh40 · 2025-10-17T19:03:43Z

src/weathergen/datasets/masking.py

+
+            elif self.current_strategy == "healpix":
+                selected_tensors = (
+                    cc if len(pp) > 0 and pp[0] else []


I need to double check this

shmh40 · 2025-10-17T19:08:27Z

src/weathergen/datasets/masking.py

+            elif self.current_strategy == "random":
+                # For random masking, we simply select the tensors where the mask is True.
+                # When there's no mask it's assumed to be False. This is done via strict=False
+                selected_tensors = [c for c, p in zip(cc, pp, strict=False) if p]


In the case where cc is longer that pp this just drops the "extra tokens" right? Do we want instead to set them to be true/select them, so they are part of the target?

That's right, they are dropped. I'm not sure what's the best choice, but dropping is definitely easier

shmh40 · 2025-10-17T19:11:27Z

src/weathergen/datasets/masking.py

+                elif len(pp) == 0:
+                    selected_tensors = cc
+                else:  # If length of target and mask doesn't match, create new mask
+                    ratio = np.sum(cc) / len(pp)  # Ratio of masked tokens in source


Shouldn't this be np.sum(pp) or similar? You are trying to look at the ratio of masking in the source right? I am not exactly sure what the right code should be here

shmh40

Thank you for the PR @kacpnowak! Added some comments. It runs for me for random, healpix for the fesom data but it doesn't work for causal, np.sum(cc) I don't think is right here.

Will may try to test again properly including for existing data that we want to make sure still works safely. There are some plans to rewrite some of this code for upcoming work, so we will try to see how it fits in.

Implement separate target and source files, adjust masking

1bb1d2a

github-project-automation bot added this to WeatherGen-dev Oct 6, 2025

kacpnowak marked this pull request as draft October 6, 2025 15:49

shmh40 self-requested a review October 6, 2025 15:57

shmh40 added model Related to model training or definition (not generic infra) data:reading Everything related to data reading labels Oct 6, 2025

Merge branch 'develop' into kacpnowak/develop/source-target

44a23c1

clessig self-requested a review October 16, 2025 09:32

shmh40 reviewed Oct 17, 2025

View reviewed changes

kacpnowak added 2 commits October 20, 2025 11:35

Merge branch 'develop' into kacpnowak/develop/source-target

cdad3f1

Fix casual masking

748f4ca

kacpnowak marked this pull request as ready for review October 20, 2025 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable FesomDataReader to have different source and target datasets #1046

Enable FesomDataReader to have different source and target datasets #1046

Uh oh!

kacpnowak commented Oct 6, 2025

Uh oh!

shmh40 Oct 17, 2025

Uh oh!

shmh40 Oct 17, 2025

Uh oh!

kacpnowak Oct 20, 2025

Uh oh!

shmh40 Oct 17, 2025

Uh oh!

shmh40 Oct 17, 2025 •

edited

Loading

Uh oh!

kacpnowak Oct 20, 2025

Uh oh!

shmh40 Oct 17, 2025

Uh oh!

shmh40 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enable FesomDataReader to have different source and target datasets #1046

Are you sure you want to change the base?

Enable FesomDataReader to have different source and target datasets #1046

Uh oh!

Conversation

kacpnowak commented Oct 6, 2025

Description

Issue Number

Checklist before asking for review

Uh oh!

shmh40 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

shmh40 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

kacpnowak Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

shmh40 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

shmh40 Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kacpnowak Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

shmh40 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

shmh40 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shmh40 Oct 17, 2025 •

edited

Loading

shmh40 left a comment •

edited

Loading