shadow models training and synthesizer #51

fatemetkl · 2025-10-02T15:52:56Z

PR Type

Feature

Short Description

Sorry for the big PR. I had initially intended not to include the synthesizer module, but I needed it to test the full shadow training pipeline. Therefore, it is added here almost unchanged from the original MIDSTModels codebase, except for some fixes to make it executable.

Clickup Ticket(s):
Shadow model training pipeline: https://app.clickup.com/t/868fp4tw3
Synthesizer module: https://app.clickup.com/t/868fp4vk7

Shadow model training pipeline of Ensemble Attack:
--> ensemble/rmia/shadow_model_training.py:
Contains the main function for running the ensemble attack’s shadow model training, as designed by the CITADEL & UQAM team.
--> ensemble/shadow_model_utils.py:
Provides helper functions for model training and data synthesis, as well as utilities for modifying configs for each shadow model set.
--> ensemble/tabddpm_fine_tuning.py:
Implements fine-tuning versions of the main training functions. Unlike their counterparts in models.clavaddpm, these functions start from pre-trained models. This module should eventually be unified with the toolkit’s training module in future PRs to reduce refactoring overhead.
Synthesizer module (synthesizer.py): This module contains all the code needed to synthesize data for the attack implementation. It is adapted from the original MidstModels codebase with minor modifications, including ruff and mypy fixes. Significant refactoring is still needed in future PRs to align it with project standards. Note: no dedicated tests were added for this module. Its functions are currently tested only indirectly through shadow model synthesis tests.

Tests Added

tests/unit/attacks/ensemble/test_shadow_model_training.py
tests/unit/attacks/ensemble/test_shadow_model_utils.py

bzamanlooy · 2025-10-06T13:57:46Z

src/midst_toolkit/models/clavaddpm/synthesizer.py

+
+
+# TODO: Too many statements and branches, refactor.
+def sample_from_diffusion(  # noqa: PLR0915, PLR0912


I know that this is a copy from MIDST to get the pipeline running but perhaps in the long run we should not require df as an input to fin the numerical and categorical columns and just use df_info for that.

I agree! Definitely needs a big refactor.

I can take care of that as part of the other refactors I am doing with MIDSTModels code.

lotif

Only had time to review half of this PR, will review the other half tomorrow. Lots of small comments but otherwise it looks good so far.

lotif · 2025-10-06T22:07:57Z

examples/ensemble_attack/data_configs/trans.json

    "clustering": {
        "parent_scale": 1.0,
-        "num_clusters": 50,
+        "num_clusters": 1,


Is this right? Just 1 cluster?

This is just me trying to minimixe the run time of the example 😂
Will fix these values.

lotif · 2025-10-06T22:08:49Z

examples/ensemble_attack/data_configs/trans.json

-        "batch_size": 4096,
-        "lr": 0.0006,
+        "iterations": 2,
+        "batch_size": 1,


Also these values, the ones a few lines above and the classifier values below, they seem odd.

I agree, these values aren’t realistic. I ran the attack example with them on a very small portion of the data just to make sure it works. That’s why iterations is set to 2 and batch_size to 1. I’ll keep some of the values as they are for now so the example can run quickly.

lotif · 2025-10-06T22:10:22Z

examples/ensemble_attack/config.yaml

 # Pipeline control
 pipeline:
-  run_data_processing: true
+  run_data_processing: false # Set this to false if you have already saved the processed data


With that comment, I would assume you wanted to keep this as true

lotif · 2025-10-06T22:11:32Z

examples/ensemble_attack/config.yaml

 pipeline:
-  run_data_processing: true
+  run_data_processing: false # Set this to false if you have already saved the processed data
+  run_shadow_model_training: True


Let's keep the consistency here and use true lowercase.

lotif · 2025-10-06T22:15:10Z

src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py

+
+
+# TODO: This function and the next one can be unified later.
+def train_fine_tuning_shadows(


I think train_fine_tuned_shadow_models would be a better name.

lotif · 2025-10-06T22:57:29Z

src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py

+            will be created if it does not exist, and all the relevant configs will be copied here automatically.
+        training_json_config_paths: Configuration dictionary containing paths to the data JSON config files.
+        fine_tuning_config: Configuration dictionary containing shadow model fine-tuning specific information.
+        n_models_per_set: Number of shadow models to train by each approach, must be even, defaults to 4.


The wording here could use a little bit of love:

Number of shadow models to train by each approach. Must be an even number. Defaults to 4.

lotif · 2025-10-06T22:58:47Z

src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py

+                will be saved. Model artifacts and synthetic data will be saved under this directory as well.
+            training_json_config_paths: Configuration dictionary containing paths to the data JSON config files.
+            fine_tuning_config: Configuration dictionary containing shadow model fine-tuning specific information.
+            init_model_id: Distinguishes the pre-trained initial models.


I think a better description of this parameter would be An ID to assign to the pre-trained initial models. The fact that it distinguishes models from others will depend on the value the caller will assign to it.

lotif · 2025-10-06T23:03:21Z

src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py

+        INFO,
+        f"Second set of shadow model training completed and saved at {second_set_result_path}.",
+    )
+    # Attack codebase comment: "The following eight models are trained from scratch on the challenge points,


lotif · 2025-10-06T23:03:44Z

src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py

+        INFO,
+        f"First set of shadow model training completed and saved at {first_set_result_path}",
+    )
+    # Attack codebase comment: "The following four models are trained in the same way, with a new initial training set


I think this can say Original codebase comment instead.

lotif · 2025-10-06T23:05:02Z

src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py

+) -> tuple[Path, Path, Path]:
+    """
+    Runs the shadow model training pipeline of the ensemble attack. This pipeline consists of three sets of
+    shadow model training.


I think This pipeline trains three sets of shadow models. is a better wording for this.

lotif

Second batch of comments. Still not done yet :)

lotif · 2025-10-07T19:22:00Z

src/midst_toolkit/attacks/ensemble/data_utils.py

+# The following function is the slightly modified version of
+# ``midst_toolkit.models.clavaddpm.data_loaders`` by the CITADEL & UQAM team.
+def load_multi_table(
+    data_dir: Path, train_df: pd.DataFrame | None = None, verbose: bool = True


What do you think about renaming train_df to train_data? I've been doing that in the training code but sometimes I second guess those decisions.

I agree, this is definitely a better name!
I also merged the attack's load_multi_table() with the existing load_multi_table() in midst_toolkit.models.clavaddpm.data_loader, and it seems to work fine. Now, we can optionally send the table dataframes to load_multi_table(). This will make the following refactors easier.

lotif · 2025-10-07T19:25:47Z

src/midst_toolkit/attacks/ensemble/data_utils.py

+            "domain": domain,
+            "children": meta["children"],
+            "parents": meta["parents"],
+        }


Optional: I'm going to do this at one point on my copy of this function, but if you have the bandwidth it would be nice to have a Table dataclass to have this information. This way we don't need to add explanations about the format of this dictionary in the docstrings.

lotif · 2025-10-07T19:27:13Z

src/midst_toolkit/attacks/ensemble/data_utils.py

+        id_cols = [col for col in tables[table]["df"].columns if "_id" in col]
+        df_no_id = tables[table]["df"].drop(columns=id_cols)
+        info = get_info_from_domain(df_no_id, tables[table]["domain"])
+        data, info = pipeline_process_data(


Sorry, I just renamed this function in my refactor 😬

lotif · 2025-10-07T20:03:22Z

src/midst_toolkit/attacks/ensemble/process_split_data.py

+    along with their corresponding labels, to the specified path under ``processed_attack_data_path``.
+
+    The size of the master challenge datasets (train and test) is half of the total population data size each,
+    as determined by the attack design.


This might be a newbie question, but how can this attack design be determined? It's not one of the parameters of the function.

The attack design is currently determined by the original codebase. I’ve added a comment.

long answer:
I agree this is very annoying, but for now the attack implementation is heavily based on the original attack design with little flexibility. For example, the size of the master challenge datasets is currently set to half of the population. This could potentially be adjusted by modifying the proportion input variable in split_real_data, but doing so would require changes to the codebase.

The plan is to refactor the code in multiple stages to make it more flexible once the full pipeline is implemented.

lotif · 2025-10-07T20:04:45Z

src/midst_toolkit/attacks/ensemble/process_split_data.py

    processed_attack_data_path: Path,
    column_to_stratify: str,
    num_total_samples: int = 40000,
    random_seed: int = 42,


Same problem with the seed here. I think it has to be None and passed in by parameter in case the user wants to ensure reproducibility.

lotif · 2025-10-07T20:21:56Z

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py

+from midst_toolkit.models.clavaddpm.train import clava_training
+
+
+def config_tabddpm(


I think this needs a better name. Maybe save_additional_tabddpm_config?

lotif · 2025-10-07T20:24:35Z

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py

+def config_tabddpm(
+    data_dir: Path,
+    training_json_path: Path,
+    final_json_path: Path,


I think those two parameters need to say they are the paths for the configs: training_config_json_path, final_config_json_path. Maybe remove the json part for shorter names, but that's optional.

lotif · 2025-10-07T20:32:14Z

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py

+        "models": {},
+        "configs": configs,
+        "synth_data": {},
+    }


This can definitely be turned into a dataclass, it will make this simpler. Name it as TrainingResult, which makes more sense:

from dataclasses import dataclass @dataclass class TrainingResult: save_dir: Path, configs: Configs, tables: dict = {}, relation_order: dict = {}, all_group_lengths_probabilities: dict = {}, models: dict = {}, synthetic_data: dict = {},

result = TrainingResult(save_dir=save_dir, configs=configs)

Also, the types are too simple, please add type hints for those dicts in you know them.

Awesome suggestion! Thank you!

This data class could be very useful for any example involving training and synthesis. We could likely move it out of the attack module and into a more general inference module in future refactors.

lotif · 2025-10-07T20:36:01Z

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py

+        "new_models": {},
+        "configs": configs,
+        "synth_data": {},
+    }


Same TrainingResult class can be used here. If there are additional attributes, they should be added to the class as optional fields.

Actually, this could be a FineTuningResult class that inherits from TrainingResult:

@dataclass class FineTuningResult(TrainingResult): fine_tuned_models: dict = {}

I think it should be sufficient to keep only the fine-tuned models; there’s no need to save both the pre-trained and fine-tuned versions simultaneously.

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py

lotif · 2025-10-07T20:46:04Z

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py

+
+def fine_tune_tabddpm_and_synthesize(
+    trained_models: dict[tuple[str, str], dict[str, Any]],
+    new_train_set: pd.DataFrame,


Could this be called fine_tune_set instead?

lotif

Final comments :)

I didn't review the synthesizer.py file, I am planning to do a bigger refactor on it similar to what I am doing with the other training files.

lotif · 2025-10-08T15:41:09Z

tests/unit/attacks/ensemble/test_shadow_model_training.py

+    shadow_models_data_path = tmp_path
+    # Input
+    # Population data is used to pre-train some of the shadow models.
+    population_data = load_dataframe(Path("tests/unit/attacks/ensemble/assets/population_data"), "all_population.csv")


Can you put this in a constant at the top of the file and use it here and on the method below?

lotif · 2025-10-08T15:42:08Z

tests/unit/attacks/ensemble/test_shadow_model_training.py

+        return compose(config_name="shadow_training_config")
+
+
+def test_train_fine_tuning_shadows(cfg: DictConfig, tmp_path: Path) -> None:


This should be marked with @pytest.mark.integration_test() since it looks like it's running the whole pipeline from the top with no mocks.

Also, it should be moved to the integration tests folder.

lotif · 2025-10-08T15:46:11Z

tests/unit/attacks/ensemble/test_shadow_model_training.py

+        n_models=2,
+        n_reps=1,
+        population_data=population_data,
+        master_challenge_data=population_data[0:20],  # For testing purposes only.


A better comment here would be good, since this is already in a test file this comment does not add much information. Maybe # Limiting the data to 20 samples for faster test execution or something along those lines.

tests/integration/attacks/ensemble/test_shadow_model_training.py

lotif · 2025-10-08T15:47:41Z

tests/unit/attacks/ensemble/test_shadow_model_training.py

+    result_path = train_shadow_on_half_challenge_data(
+        n_models=2,
+        n_reps=1,
+        master_challenge_data=population_data[0:40],  # For testing purposes only.


Same comment issue here.

lotif · 2025-10-08T16:26:26Z

src/midst_toolkit/attacks/ensemble/tabddpm_fine_tuning.py

+    classifier_config: Configs | None,
+    fine_tuning_diffusion_iterations: int,
+    fine_tuning_classifier_iterations: int,
+    device: str = "cuda" if torch.cuda.is_available() else "cpu",


Same thing for the device here.

lotif · 2025-10-08T16:27:34Z

src/midst_toolkit/attacks/ensemble/tabddpm_fine_tuning.py

+
+    """
+    if parent_name is None:
+        y_col = "placeholder"


Can we rename this to target_column? I am in the process of renaming all x to feature and y to target.

lotif · 2025-10-08T16:28:09Z

src/midst_toolkit/attacks/ensemble/tabddpm_fine_tuning.py

+        child_info,
+        child_model_params,
+        child_transformations,
+        fine_tuning_diffusion_iterations,  # fine_tuning_diffusion_iterations used here.


This comment does not add a lot of information...

lotif · 2025-10-08T16:30:45Z

src/midst_toolkit/attacks/ensemble/clavaddpm_fine_tuning.py

+
+def clava_fine_tuning(
+    trained_models: dict[tuple[str, str], dict[str, Any]],
+    new_tables: Tables,


This can be called fine_tuning_tables.

Should we do the renaming now? Currently, it aligns well with the corresponding training function name clava_training.
I think we can rename this whenever the main function's name is also changed. Ideally, they should be merged sooner than that if possible.

Is this part of the new ticket you made? If yes, it's fine to keep it like this until you work on it.

lotif · 2025-10-08T16:32:44Z

src/midst_toolkit/models/clavaddpm/synthesizer.py

+
+
+# TODO: Too many statements and branches, refactor.
+def sample_from_diffusion(  # noqa: PLR0915, PLR0912


I can take care of that as part of the other refactors I am doing with MIDSTModels code.

…rts, and fixed a unit test

…ion fault error

lotif

Approved with minor comments. Thanks for addressing all the comments!!!

lotif · 2025-10-15T19:29:37Z

examples/ensemble_attack/run_attack.py

        run_data_processing(config)
+    # Note: Importing the following two modules causes a segmentation fault error if imported together in this file.
+    # A quick solution is to load modules dynamically if any of the pipelines is called.
+    # TODO: Investigate the source of error.


Wow this is wild... We should definitely investigate.

lotif · 2025-10-15T19:38:03Z

src/midst_toolkit/attacks/ensemble/clavaddpm_fine_tuning.py

+    transformations: Transformations,
+    steps: int,
+    batch_size: int,
+    # model_type: ModelType,


Delete this line.

lotif · 2025-10-15T19:39:13Z

src/midst_toolkit/attacks/ensemble/clavaddpm_fine_tuning.py

+    # fine_tuning_params.d_in = input_dimension
+
+    # model = model_type.get_model(fine_tuning_params)
+    # model.to(device)


Delete these commented lines.

lotif · 2025-10-15T19:42:18Z

src/midst_toolkit/attacks/ensemble/clavaddpm_fine_tuning.py

+    schedule_sampler = ScheduleSamplerType.UNIFORM.create_named_schedule_sampler(num_timesteps)
+    key_value_logger = KeyValueLogger()
+    classifier.train()
+    for _step in range(classifier_steps):


Remove the _ as it is normally used for private functions and variables, which is not the case here.

lotif · 2025-10-15T19:44:50Z

src/midst_toolkit/attacks/ensemble/clavaddpm_fine_tuning.py

+    for _step in range(classifier_steps):
+        key_value_logger.save_entry("step", float(_step))
+        key_value_logger.save_entry("samples", float((_step + 1) * batch_size))
+        _numerical_forward_backward_log(


I'll add a TODO on the train code to remove the leading _ from this one.

lotif · 2025-10-15T19:51:49Z

src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py

+    )
+
+    # Train the initial model if it is not already trained and saved.
+    if not (save_dir / f"initial_model_rmia_{init_model_id}.pkl").exists():


3-strike rule: extract save_dir / f"initial_model_rmia_{init_model_id}.pkl" into a variable as it has been used 3 times in this code block.

lotif · 2025-10-15T19:52:49Z

src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py

+        with open(save_dir / f"initial_model_rmia_{init_model_id}.pkl", "rb") as f:
+            initial_model_training_results = pickle.load(f)
+
+    # assert initial_model_training_results.models[("", table_name)]["diffusion"] is not None


Remove this commented line.

fatemetkl added 16 commits August 21, 2025 10:55

Data processing and shadow models

8b7a781

Merged main

71178c0

mypy fixes

5e4e3d9

Merged main

a47fe95

Removed an old code

5ab4ade

A working example

6166728

Merged main

dbea030

small fixes

e5a3263

Merged main

3ce5786

Improved the shadow training pipeline

d7977b3

Small fixes

7f7d120

Merge branch 'main' into ft/shadow_models

5ac0172

small change

822e0a3

Merge branch 'main' into ft/shadow_models

8e410b6

Small improvements

c2a5115

Added tests, fixed mypy and ruff errors

260f1cb

fatemetkl changed the title ~~Ft/shadow models~~ shadow models training and synthesizer Oct 2, 2025

fatemetkl added 4 commits October 3, 2025 08:28

Merged main into branch, and addressed conflicts, refactors

85c43c3

Fixed mypy errors

3de5e07

Small fixes

2772609

Added more to the docstrings and comments

0bd7849

fatemetkl requested review from emersodb, lotif and sarakodeiri October 3, 2025 16:41

fatemetkl marked this pull request as ready for review October 3, 2025 16:41

fatemetkl requested review from ElahehBassak, bzamanlooy and masi-sh October 3, 2025 16:42

bzamanlooy reviewed Oct 6, 2025

View reviewed changes

lotif reviewed Oct 6, 2025

View reviewed changes

lotif reviewed Oct 7, 2025

View reviewed changes

lotif reviewed Oct 8, 2025

View reviewed changes

fatemetkl added 12 commits October 8, 2025 18:38

Merged main, addressed conflicts and some fixes

3d95d9f

Small fix

1afc8e4

Addressed Marcelo's comments part 1

fc2a772

Sync shadow model data with blendingplusplus

65234a7

Unified attack's load_multi_table, and the one currently in our codebase

3f22c21

Addressed Marcelo's comments part 2

a0b5122

Fixed segmentation fault error due to dependencies by seperating impo…

945658e

…rts, and fixed a unit test

Merge branch 'main' into ft/shadow_models

c5d9a0a

Seperated metaclassifier and shadow pipeline scripts to fix segmentat…

85873c5

…ion fault error

Small fixes

ef7c6ee

Directory naming fix

812a54e

Final set of Marcelo's comments

1151622

fatemetkl requested a review from lotif October 15, 2025 16:57

lotif approved these changes Oct 15, 2025

View reviewed changes

Removed extra commented lines

78b991f

fatemetkl merged commit 7186253 into main Oct 15, 2025
5 checks passed

fatemetkl deleted the ft/shadow_models branch October 15, 2025 21:43



		# TODO: Too many statements and branches, refactor.
		def sample_from_diffusion( # noqa: PLR0915, PLR0912



		# TODO: This function and the next one can be unified later.
		def train_fine_tuning_shadows(

		from midst_toolkit.models.clavaddpm.train import clava_training


		def config_tabddpm(

		return compose(config_name="shadow_training_config")


		def test_train_fine_tuning_shadows(cfg: DictConfig, tmp_path: Path) -> None:

shadow models training and synthesizer #51

shadow models training and synthesizer #51

Uh oh!

Conversation

fatemetkl commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Short Description

Tests Added

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fatemetkl Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lotif Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

fatemetkl commented Oct 2, 2025 •

edited

Loading

fatemetkl Oct 10, 2025 •

edited

Loading

lotif Oct 7, 2025 •

edited

Loading