Skip to content

Conversation

kacpnowak
Copy link
Contributor

Description

This PR implements separate source and target datasets for FesomDataReader.

It was also necessary modify masking strategies, as healpix cell can empty for the source but not for the target. Additionally there can be different number of tokens per cell. Different adjustments had to be made for each strategy:

  • random: when mask is missing for each token, it's assumed to be False.
  • healpix: when mask is missing it's for the entire healpix cell it's assumed to be False.
  • casual: If mask is missing for an entire cell, it's assumed to be True for every token inside. If number of tokens in cell is different for source and target, source masked ratio is calculated and used for masking the same fraction of target tokens.
  • channels: unsupported

Huge thanks for @shmh40 for all the help.

Issue Number

Closes #911

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@kacpnowak kacpnowak marked this pull request as draft October 6, 2025 15:49
@shmh40 shmh40 self-requested a review October 6, 2025 15:57
@shmh40 shmh40 added model Related to model training or definition (not generic infra) data:reading Everything related to data reading labels Oct 6, 2025
@clessig clessig self-requested a review October 16, 2025 09:32

FESOM_NODE :
type : fesom
filenames : ['ocean_node']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, I think the paths in WG-private need to be updated for FESOM data

# select only the target times where mask is True
selected_tensors = [c for i, c in enumerate(cc) if pp[i]]

if len(cc) == len(pp):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this if statement for the other strategies to be safer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this makes sens for others. In case of random the same code handles both cases


elif self.current_strategy == "healpix":
selected_tensors = (
cc if len(pp) > 0 and pp[0] else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to double check this

elif self.current_strategy == "random":
# For random masking, we simply select the tensors where the mask is True.
# When there's no mask it's assumed to be False. This is done via strict=False
selected_tensors = [c for c, p in zip(cc, pp, strict=False) if p]
Copy link
Contributor

@shmh40 shmh40 Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where cc is longer that pp this just drops the "extra tokens" right? Do we want instead to set them to be true/select them, so they are part of the target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, they are dropped. I'm not sure what's the best choice, but dropping is definitely easier

elif len(pp) == 0:
selected_tensors = cc
else: # If length of target and mask doesn't match, create new mask
ratio = np.sum(cc) / len(pp) # Ratio of masked tokens in source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be np.sum(pp) or similar? You are trying to look at the ratio of masking in the source right? I am not exactly sure what the right code should be here

Copy link
Contributor

@shmh40 shmh40 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @kacpnowak! Added some comments. It runs for me for random, healpix for the fesom data but it doesn't work for causal, np.sum(cc) I don't think is right here.

Will may try to test again properly including for existing data that we want to make sure still works safely. There are some plans to rewrite some of this code for upcoming work, so we will try to see how it fits in.

@kacpnowak kacpnowak marked this pull request as ready for review October 20, 2025 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data:reading Everything related to data reading model Related to model training or definition (not generic infra)

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Enable FesomDataReader to have different source and target datasets

2 participants