-
Notifications
You must be signed in to change notification settings - Fork 38
Enable FesomDataReader to have different source and target datasets #1046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Enable FesomDataReader to have different source and target datasets #1046
Conversation
|
||
FESOM_NODE : | ||
type : fesom | ||
filenames : ['ocean_node'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately, I think the paths in WG-private need to be updated for FESOM data
# select only the target times where mask is True | ||
selected_tensors = [c for i, c in enumerate(cc) if pp[i]] | ||
|
||
if len(cc) == len(pp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this if statement for the other strategies to be safer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this makes sens for others. In case of random the same code handles both cases
|
||
elif self.current_strategy == "healpix": | ||
selected_tensors = ( | ||
cc if len(pp) > 0 and pp[0] else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to double check this
elif self.current_strategy == "random": | ||
# For random masking, we simply select the tensors where the mask is True. | ||
# When there's no mask it's assumed to be False. This is done via strict=False | ||
selected_tensors = [c for c, p in zip(cc, pp, strict=False) if p] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where cc is longer that pp this just drops the "extra tokens" right? Do we want instead to set them to be true/select them, so they are part of the target?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, they are dropped. I'm not sure what's the best choice, but dropping is definitely easier
src/weathergen/datasets/masking.py
Outdated
elif len(pp) == 0: | ||
selected_tensors = cc | ||
else: # If length of target and mask doesn't match, create new mask | ||
ratio = np.sum(cc) / len(pp) # Ratio of masked tokens in source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be np.sum(pp) or similar? You are trying to look at the ratio of masking in the source right? I am not exactly sure what the right code should be here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @kacpnowak! Added some comments. It runs for me for random, healpix for the fesom data but it doesn't work for causal, np.sum(cc) I don't think is right here.
Will may try to test again properly including for existing data that we want to make sure still works safely. There are some plans to rewrite some of this code for upcoming work, so we will try to see how it fits in.
Description
This PR implements separate source and target datasets for FesomDataReader.
It was also necessary modify masking strategies, as healpix cell can empty for the source but not for the target. Additionally there can be different number of tokens per cell. Different adjustments had to be made for each strategy:
Huge thanks for @shmh40 for all the help.
Issue Number
Closes #911
Checklist before asking for review
./scripts/actions.sh lint
./scripts/actions.sh unit-test
./scripts/actions.sh integration-test
launch-slurm.py --time 60