Implementation of healpix cell masking #407

shmh40 · 2025-06-27T16:45:52Z

Description

This draft PR implements masking of cell based on healpix cells and the healpix level. Masking can be done at arbitrary healpix levels, and makes use of the nested indexing of the healpix cells.

One question (maybe @tjhunter): is the given implementation ok for passing args for specific masking strategies? I am not sure what style we want. For example, here for healpix masking, we want to pass the healpix level of the data, and the healpix level that we want the masking to occur on e.g. our data is healpix level 5, and we want to do very large scale masking (e.g. level 0 or 1). Of course these args are only relevant when we are doing healpix masking, so it is implemented here just as a dictionary of "strategy_kwargs" that can be passed in the config, with args specific to the masking strategy (hl_data, hl_mask), otherwise it is ignored. Hope that is ok.

Note this PR extends PR #383, and is currently set to merge into that branch shmh40/dev/masking_class.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Issue Number

Fixes #397.

Code Compatibility

I have performed a self-review of my code

Code Performance and Testing

I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline

Documentation

My code follows the style guidelines of this project
I have updated the documentation and docstrings to reflect the changes
I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

…sking to use these, then style improvements

…_class

…ng_rate, update comments, remove archived class

…rom batchify_source

…Masker class, remove handling special cases of masking (all masked)

…any prints and hardcoded hl_mask and hl_data

…_class

…g strategy specific args

clessig

Thanks for implementing it! Looks good already, just some minor comments.

clessig · 2025-06-28T11:14:01Z

src/weathergen/datasets/masking.py

+        # NOTE: adding strategy_kwargs to allow for strategy-specific configurations
+        # e.g., for healpix strategy, we might need hl_data and hl_mask parameters
+        # or for different strategies, we might need different parameters?
+        strategy_kwargs: dict,


Thanks. This looks good

clessig · 2025-06-28T11:16:19Z

src/weathergen/datasets/masking.py

    ):
        self.masking_rate = masking_rate
        self.masking_strategy = masking_strategy
        self.masking_rate_sampling = masking_rate_sampling

+        # NOTE: strategy_kwargs is a dictionary that can hold any additional parameters
+        self.strategy_kwargs = strategy_kwargs or {}


It seems better to make sure that {} is passed to the function when strategy_kwargs is not set. This should do it:

cf.get( "strategy_kwargs", {})

Also please don't use NOTE in a comment ... that's the whole point of a code comment :)

Using NOTE for myself! Will remove before merging

Implemented this, I think it works neatly

True, didn't read carefully enough.

clessig · 2025-06-28T11:16:35Z

src/weathergen/datasets/masking.py

@@ -54,6 +61,9 @@ def mask_source(
        token_lens = [len(t) for t in tokenized_data]
        num_tokens = sum(token_lens)

+        # print("Length of each token t in tokenized_data:", token_lens)
+        print("Number of tokens in the batch:", num_tokens)


Remove before we merge the PR

clessig · 2025-06-28T11:17:41Z

src/weathergen/datasets/masking.py

@@ -89,6 +99,10 @@ def mask_source(
            if block_size > 0 and num_tokens > 0:
                start_index = self.rng.integers(0, max(1, num_tokens - block_size + 1))
                flat_mask[start_index : start_index + block_size] = True
+
+        elif self.masking_strategy == "healpix":
+            flat_mask = self._generate_healpix_mask(token_lens, rate)


Then we should have a separate function for each of the masking strategies. My feeling is, we might want to implement in small classes at some point for generality but a separate function for every strategy seems like a good starting point.

Will raise another issue once this is merged

clessig · 2025-06-28T11:18:06Z

src/weathergen/datasets/masking.py

+            np.ndarray: A flat boolean array (the token-level mask).
+        """
+
+        print("Generating HEALPix mask...")


Remove before merging

clessig · 2025-06-28T11:19:11Z

src/weathergen/datasets/masking.py

+
+        # NOTE: hl_data and hl_mask are expected to be provided in strategy_kwargs?
+        hl_data = self.strategy_kwargs.get("hl_data")
+        hl_mask = self.strategy_kwargs.get("hl_mask")


We should fail as early as possible when hl_data and hl_mask are not set. Or can we fall back to default values in this case?

@tjhunter is it ok to fall back to default values for a case like this?

yes! This is a good place for default values.

personally, in this case, I would say self.strategy_kwargs.hl_data etc. rather than looking (yet) for a sensible default.

clessig · 2025-06-28T11:19:29Z

src/weathergen/datasets/masking.py

+        hl_data = self.strategy_kwargs.get("hl_data")
+        hl_mask = self.strategy_kwargs.get("hl_mask")
+
+        print(


clessig · 2025-06-28T11:20:21Z

src/weathergen/datasets/masking.py

+        # print(f"[HEALPix Setup] Each parent cell at L{hl_mask} contains {num_children_per_parent} child cells at L{hl_data}.")
+
+        # Choose parent cells to mask based on the specified rate.
+        num_parents_to_mask = int(np.round(rate * num_parent_cells))


This should be sampled when masking_rate_sampling = True

clessig · 2025-06-28T11:20:46Z

src/weathergen/datasets/masking.py

+        # print(f"[HEALPix Masking] Parent IDs selected: {parent_ids_to_mask}")
+
+        # Now determine which child cells (and their tokens) are masked.
+        # This is cells.


clessig · 2025-06-28T11:21:31Z

src/weathergen/datasets/masking.py

+        # This is cells.
+        cell_mask = np.zeros(num_data_cells, dtype=bool)
+        # print("[HEALPix Masking] Mapping parent cells to child cell indices:")
+        for parent_id in parent_ids_to_mask:


Can this loop be vectorized?

Yes, good point, done

…_strategies

…ly updated docs

…o reflect this

tjhunter

@shmh40 thanks! I have some style comments. 2 high level comments:

it is missing how you would use it. when writing doc / updating the config, think how a newcomer would want to use this feature
(not you): the rng in masking is depending on time. This makes the code non-deterministic. This issue to me is larger than biased training. I would love for us to just have a single seed for everything set in config

tjhunter · 2025-07-03T08:44:01Z

src/weathergen/datasets/masking.py

+
+        # NOTE: hl_data and hl_mask are expected to be provided in strategy_kwargs?
+        hl_data = self.strategy_kwargs.get("hl_data")
+        hl_mask = self.strategy_kwargs.get("hl_mask")


yes! This is a good place for default values.

tjhunter · 2025-07-03T08:48:34Z

src/weathergen/datasets/masking.py

+        hl_data = self.strategy_kwargs.get("hl_data")
+        hl_mask = self.strategy_kwargs.get("hl_mask")
+
+        if hl_data is None or hl_mask is None:


style: it is better to say: assert hl_data is not None and hl_mask is not None, "If ..."

testing systems such as pytest can then do clever things to present to you the offending values if it fails.

But can we test this in the constructor. This will be a much cleaner stack trace and it will fail much earlier

This has been moved to be tested in the constructor, thanks.

I have not included default values at the moment, which I think is fine for now, we can include them later/in another PR if we like. I am not sure what our plan is overall with defaults, or @tjhunter the reasons for not having defaults. Did you write up a brief doc on your discussion with Sophie? Sorry if I missed it.

tjhunter · 2025-07-03T08:48:40Z

src/weathergen/datasets/masking.py

+                "If masking with HEALPix, hl_data and hl_mask must be provided in strategy_kwargs."
+            )
+
+        if hl_mask >= hl_data:


tjhunter · 2025-07-03T08:50:03Z

src/weathergen/datasets/masking.py

+            assert False, "hl_mask must be less than hl_data for HEALPix masking."
+
+        num_data_cells = 12 * (4**hl_data)
+        if len(token_lens) != num_data_cells:


Thank you for pointing these out

tjhunter · 2025-07-03T08:52:35Z

src/weathergen/datasets/masking.py

+
+        # if masking_rate_sampling is enabled, sample the rate from a normal distribution.
+        if self.masking_rate_sampling:
+            rate = np.clip(


this chunk is copied over. can you factorize them into a single class method? I am sure we will need to adjust this formula eventually, better to adjust it once.

There is an existing issue closely related to this, will do it at some point in a separate PR

tjhunter · 2025-07-03T08:57:38Z

config/default_config.yml

 # sample the masking rate (with normal distribution centered at masking_rate)
 masking_rate_sampling: True
 # sample a subset of all target points, useful e.g. to reduce memory requirements
 sampling_rate_target: 1.0
 # include a masking strategy here, currently only supporting "random" and "block"
 masking_strategy: "random"

+


can you put a commented out example of the healpix strategy? Right now, we do not have documentation for what is expected in hl_data and hl_mask

Yes, good point, thank you! Added!

tjhunter · 2025-07-03T08:59:02Z

src/weathergen/datasets/masking.py

+
+        # NOTE: hl_data and hl_mask are expected to be provided in strategy_kwargs?
+        hl_data = self.strategy_kwargs.get("hl_data")
+        hl_mask = self.strategy_kwargs.get("hl_mask")


personally, in this case, I would say self.strategy_kwargs.hl_data etc. rather than looking (yet) for a sensible default.

tjhunter · 2025-07-03T09:00:21Z

src/weathergen/datasets/masking.py

+            rate (float): The desired masking rate, applied to the parent cells.
+
+        Returns:
+            np.ndarray: A flat boolean array (the token-level mask).


what is the len() of this array? the indexing is unclear to me

number of tokens

tjhunter · 2025-07-03T09:01:14Z

src/weathergen/datasets/masking.py

    ):
        self.masking_rate = masking_rate
        self.masking_strategy = masking_strategy
        self.masking_rate_sampling = masking_rate_sampling

+        # strategy_kwargs is a dictionary that can hold any additional parameters
+        self.strategy_kwargs = strategy_kwargs


I would call it strategy_config. kwargs is an implementation choice, and it will be later replaced by a class (likely)

tjhunter · 2025-07-03T09:02:31Z

src/weathergen/datasets/multi_stream_data_sampler.py

@@ -182,7 +182,9 @@ def __init__(
        if cf.training_mode == "forecast":
            self.tokenizer = TokenizerForecast(cf.healpix_level, cf.data_loader_rng_seed)
        elif cf.training_mode == "masking":
-            masker = Masker(cf.masking_rate, cf.masking_strategy, cf.masking_rate_sampling)
+            masker = Masker(
+                cf.masking_rate, cf.masking_strategy, cf.masking_rate_sampling, cf.get("strategy_kwargs", {})


masking_strategy_extra instead of strategy_kwargs ? strategy is vague.

Changed name to masking_strategy_config

clessig · 2025-07-06T08:48:21Z

(not you): the rng in masking is depending on time. This makes the code non-deterministic. This issue to me is larger than biased training. I would love for us to just have a single seed for everything set in config

In my review I already made a suggestion how to fix it ;)

tjhunter · 2025-07-07T07:11:19Z

(not you): the rng in masking is depending on time. This makes the code non-deterministic. This issue to me is larger than biased training. I would love for us to just have a single seed for everything set in config

In my review I already made a suggestion how to fix it ;)

Yes! Somehow it did not appear in the history.

…onfig, update config with example of healpix

…_strategies

shmh40 and others added 29 commits June 24, 2025 08:14

creating masking class and adapting tokenizer_masking to use this class

72fb7de

minor changes to masking.py and tokenizer_masking

570056d

removed old tokenizer_masking

7a383b1

include masking_strategy in default_config

c0b726c

change ValueError to assert

370d5e1

linting formatting changes files

08d676f

further linting of docstrings

5fbc9b1

create mask_source and mask_target in Masker, and update tokenizer_ma…

cbe0a09

…sking to use these, then style improvements

linted masking, tokenizer_masking

0388176

Merge branch 'develop' into shmh40/dev/masking_class

7a75a36

Merge remote-tracking branch 'origin/develop' into shmh40/dev/masking…

e872ba3

…_class

modify masker, rng and perm_sel now part of class, remove extra maski…

15504df

…ng_rate, update comments, remove archived class

remove check if all masked, not masked

4a7a43d

remove self.masking_rate from MultiStreamDS class, and masking args f…

fe4224b

…rom batchify_source

update tokenizer utils with description of idx_ord_lens in comment

6170a69

remove masking args from batchify_, perm_sel removed now internal to …

4c80c13

…Masker class, remove handling special cases of masking (all masked)

working implementation of healpix level masking in Masker, with too m…

82db544

…any prints and hardcoded hl_mask and hl_data

adding masking_strategy: to config

4d5e947

Merge remote-tracking branch 'origin/develop' into shmh40/dev/masking…

d8038d5

…_class

remove unused mentions of masking_combination

0126820

removed comment about streams

ea726ca

changed assert to check self perm_sel is not None

20df58b

ruff masking, tokenizer_masking

099fd39

implementation of healpix masking code with lots of printing

9881f0a

removed print statements from masking.py

a228c41

merged with latest shmh40/dev/masking_class

caf01b1

minor line change

02de1c9

remove default for strategy_kwargs

0b76fb0

add strategy_kwargs to config, and pass through masker to pass maskin…

9d80c0a

…g strategy specific args

shmh40 requested a review from clessig June 27, 2025 16:47

shmh40 added the enhancement New feature or request label Jun 27, 2025

shmh40 added this to WeatherGen-dev Jun 27, 2025

shmh40 moved this to In Progress in WeatherGen-dev Jun 27, 2025

shmh40 added the model Related to model training or definition (not generic infra) label Jun 27, 2025

clessig reviewed Jun 28, 2025

View reviewed changes

Base automatically changed from shmh40/dev/masking_class to develop June 28, 2025 12:31

shmh40 added 8 commits June 30, 2025 10:08

Merge remote-tracking branch 'origin/develop' into shmh40/dev/masking…

ef27602

…_strategies

vectorise child indices calcs, implement masking_rate_sampling, minor…

00c5733

…ly updated docs

remove print statements

d09db10

cf.strategy_kwargs passed to Masker in multi_stream_data_sampler

86205e5

masking_strategy random and strategy kwargs passed to config

44aa6d5

ruffed

ef451c8

pass cf.get(strategy_kwargs or {}) to the Masker and update masking t…

a981733

…o reflect this

update config so it does not include strategy_kwargs, no longer needed

8098357

shmh40 linked an issue Jun 30, 2025 that may be closed by this pull request

Masking strategies based on healpix cells #397

Open

shmh40 marked this pull request as ready for review June 30, 2025 11:41

tjhunter approved these changes Jul 3, 2025

View reviewed changes

shmh40 added 3 commits July 8, 2025 12:42

move asserts for healpix to constructor, rename to masking_strategy_c…

e596a87

…onfig, update config with example of healpix

Merge remote-tracking branch 'origin/develop' into shmh40/dev/masking…

e5d64b6

…_strategies

merged latest changes from develop

a99ec2b

tjhunter added the merge-hold Do not merge this PR yet, being tested. label Jul 16, 2025

shmh40 added 6 commits July 16, 2025 10:04

updated default config to match latest

2a6fbe0

Merge remote-tracking branch 'origin/develop' into shmh40/dev/masking…

e190ee6

…_strategies

merge develop including reset_rng

797a734

fixed docstring of masker

c11bc09

ruffed linted

58a3ae1

rename l in token_lens

fbbecad

shmh40 requested a review from sophie-xhonneux July 21, 2025 14:55

Implementation of healpix cell masking #407

Are you sure you want to change the base?

Implementation of healpix cell masking #407

Conversation

shmh40 commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Issue Number

Code Compatibility

Code Performance and Testing

Dependencies

Documentation

Additional Notes

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

shmh40 commented Jun 27, 2025 •

edited

Loading