Skip to content

Conversation

fatemetkl
Copy link
Collaborator

@fatemetkl fatemetkl commented Oct 2, 2025

PR Type

Feature

Short Description

Sorry for the big PR. I had initially intended not to include the synthesizer module, but I needed it to test the full shadow training pipeline. Therefore, it is added here almost unchanged from the original MIDSTModels codebase, except for some fixes to make it executable.

Clickup Ticket(s):
Shadow model training pipeline: https://app.clickup.com/t/868fp4tw3
Synthesizer module: https://app.clickup.com/t/868fp4vk7

  • Shadow model training pipeline of Ensemble Attack:
    --> ensemble/rmia/shadow_model_training.py:
    Contains the main function for running the ensemble attack’s shadow model training, as designed by the CITADEL & UQAM team.
    --> ensemble/shadow_model_utils.py:
    Provides helper functions for model training and data synthesis, as well as utilities for modifying configs for each shadow model set.
    --> ensemble/tabddpm_fine_tuning.py:
    Implements fine-tuning versions of the main training functions. Unlike their counterparts in models.clavaddpm, these functions start from pre-trained models. This module should eventually be unified with the toolkit’s training module in future PRs to reduce refactoring overhead.

  • Synthesizer module (synthesizer.py): This module contains all the code needed to synthesize data for the attack implementation. It is adapted from the original MidstModels codebase with minor modifications, including ruff and mypy fixes. Significant refactoring is still needed in future PRs to align it with project standards. Note: no dedicated tests were added for this module. Its functions are currently tested only indirectly through shadow model synthesis tests.

Tests Added

  • tests/unit/attacks/ensemble/test_shadow_model_training.py
  • tests/unit/attacks/ensemble/test_shadow_model_utils.py

@fatemetkl fatemetkl changed the title Ft/shadow models shadow models training and synthesizer Oct 2, 2025
@fatemetkl fatemetkl marked this pull request as ready for review October 3, 2025 16:41


# TODO: Too many statements and branches, refactor.
def sample_from_diffusion( # noqa: PLR0915, PLR0912
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is a copy from MIDST to get the pipeline running but perhaps in the long run we should not require df as an input to fin the numerical and categorical columns and just use df_info for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Definitely needs a big refactor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take care of that as part of the other refactors I am doing with MIDSTModels code.

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only had time to review half of this PR, will review the other half tomorrow. Lots of small comments but otherwise it looks good so far.

"clustering": {
"parent_scale": 1.0,
"num_clusters": 50,
"num_clusters": 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? Just 1 cluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just me trying to minimixe the run time of the example 😂
Will fix these values.

"batch_size": 4096,
"lr": 0.0006,
"iterations": 2,
"batch_size": 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also these values, the ones a few lines above and the classifier values below, they seem odd.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, these values aren’t realistic. I ran the attack example with them on a very small portion of the data just to make sure it works. That’s why iterations is set to 2 and batch_size to 1. I’ll keep some of the values as they are for now so the example can run quickly.

# Pipeline control
pipeline:
run_data_processing: true
run_data_processing: false # Set this to false if you have already saved the processed data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that comment, I would assume you wanted to keep this as true

pipeline:
run_data_processing: true
run_data_processing: false # Set this to false if you have already saved the processed data
run_shadow_model_training: True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the consistency here and use true lowercase.



# TODO: This function and the next one can be unified later.
def train_fine_tuning_shadows(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think train_fine_tuned_shadow_models would be a better name.

will be created if it does not exist, and all the relevant configs will be copied here automatically.
training_json_config_paths: Configuration dictionary containing paths to the data JSON config files.
fine_tuning_config: Configuration dictionary containing shadow model fine-tuning specific information.
n_models_per_set: Number of shadow models to train by each approach, must be even, defaults to 4.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording here could use a little bit of love:

Number of shadow models to train by each approach. Must be an even number. Defaults to 4.

will be saved. Model artifacts and synthetic data will be saved under this directory as well.
training_json_config_paths: Configuration dictionary containing paths to the data JSON config files.
fine_tuning_config: Configuration dictionary containing shadow model fine-tuning specific information.
init_model_id: Distinguishes the pre-trained initial models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better description of this parameter would be An ID to assign to the pre-trained initial models. The fact that it distinguishes models from others will depend on the value the caller will assign to it.

INFO,
f"Second set of shadow model training completed and saved at {second_set_result_path}.",
)
# Attack codebase comment: "The following eight models are trained from scratch on the challenge points,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

INFO,
f"First set of shadow model training completed and saved at {first_set_result_path}",
)
# Attack codebase comment: "The following four models are trained in the same way, with a new initial training set
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can say Original codebase comment instead.

) -> tuple[Path, Path, Path]:
"""
Runs the shadow model training pipeline of the ensemble attack. This pipeline consists of three sets of
shadow model training.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think This pipeline trains three sets of shadow models. is a better wording for this.

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second batch of comments. Still not done yet :)

# The following function is the slightly modified version of
# ``midst_toolkit.models.clavaddpm.data_loaders`` by the CITADEL & UQAM team.
def load_multi_table(
data_dir: Path, train_df: pd.DataFrame | None = None, verbose: bool = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about renaming train_df to train_data? I've been doing that in the training code but sometimes I second guess those decisions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is definitely a better name!
I also merged the attack's load_multi_table() with the existing load_multi_table() in midst_toolkit.models.clavaddpm.data_loader, and it seems to work fine. Now, we can optionally send the table dataframes to load_multi_table(). This will make the following refactors easier.

"domain": domain,
"children": meta["children"],
"parents": meta["parents"],
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I'm going to do this at one point on my copy of this function, but if you have the bandwidth it would be nice to have a Table dataclass to have this information. This way we don't need to add explanations about the format of this dictionary in the docstrings.

id_cols = [col for col in tables[table]["df"].columns if "_id" in col]
df_no_id = tables[table]["df"].drop(columns=id_cols)
info = get_info_from_domain(df_no_id, tables[table]["domain"])
data, info = pipeline_process_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just renamed this function in my refactor 😬

along with their corresponding labels, to the specified path under ``processed_attack_data_path``.
The size of the master challenge datasets (train and test) is half of the total population data size each,
as determined by the attack design.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a newbie question, but how can this attack design be determined? It's not one of the parameters of the function.

Copy link
Collaborator Author

@fatemetkl fatemetkl Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attack design is currently determined by the original codebase. I’ve added a comment.

long answer:
I agree this is very annoying, but for now the attack implementation is heavily based on the original attack design with little flexibility. For example, the size of the master challenge datasets is currently set to half of the population. This could potentially be adjusted by modifying the proportion input variable in split_real_data, but doing so would require changes to the codebase.

The plan is to refactor the code in multiple stages to make it more flexible once the full pipeline is implemented.

processed_attack_data_path: Path,
column_to_stratify: str,
num_total_samples: int = 40000,
random_seed: int = 42,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem with the seed here. I think it has to be None and passed in by parameter in case the user wants to ensure reproducibility.

from midst_toolkit.models.clavaddpm.train import clava_training


def config_tabddpm(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a better name. Maybe save_additional_tabddpm_config?

def config_tabddpm(
data_dir: Path,
training_json_path: Path,
final_json_path: Path,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those two parameters need to say they are the paths for the configs: training_config_json_path, final_config_json_path. Maybe remove the json part for shorter names, but that's optional.

"models": {},
"configs": configs,
"synth_data": {},
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can definitely be turned into a dataclass, it will make this simpler. Name it as TrainingResult, which makes more sense:

from dataclasses import dataclass

@dataclass
class TrainingResult:
    save_dir: Path,
    configs: Configs,
    tables: dict = {},
    relation_order: dict = {},
    all_group_lengths_probabilities: dict = {},
    models: dict = {},
    synthetic_data: dict = {},
result = TrainingResult(save_dir=save_dir, configs=configs)

Also, the types are too simple, please add type hints for those dicts in you know them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome suggestion! Thank you!

This data class could be very useful for any example involving training and synthesis. We could likely move it out of the attack module and into a more general inference module in future refactors.

"new_models": {},
"configs": configs,
"synth_data": {},
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same TrainingResult class can be used here. If there are additional attributes, they should be added to the class as optional fields.

Copy link
Collaborator

@lotif lotif Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this could be a FineTuningResult class that inherits from TrainingResult:

@dataclass
class FineTuningResult(TrainingResult):
    fine_tuned_models: dict = {}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be sufficient to keep only the fine-tuned models; there’s no need to save both the pre-trained and fine-tuned versions simultaneously.


def fine_tune_tabddpm_and_synthesize(
trained_models: dict[tuple[str, str], dict[str, Any]],
new_train_set: pd.DataFrame,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be called fine_tune_set instead?

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final comments :)

I didn't review the synthesizer.py file, I am planning to do a bigger refactor on it similar to what I am doing with the other training files.

shadow_models_data_path = tmp_path
# Input
# Population data is used to pre-train some of the shadow models.
population_data = load_dataframe(Path("tests/unit/attacks/ensemble/assets/population_data"), "all_population.csv")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in a constant at the top of the file and use it here and on the method below?

return compose(config_name="shadow_training_config")


def test_train_fine_tuning_shadows(cfg: DictConfig, tmp_path: Path) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be marked with @pytest.mark.integration_test() since it looks like it's running the whole pipeline from the top with no mocks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it should be moved to the integration tests folder.

n_models=2,
n_reps=1,
population_data=population_data,
master_challenge_data=population_data[0:20], # For testing purposes only.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better comment here would be good, since this is already in a test file this comment does not add much information. Maybe # Limiting the data to 20 samples for faster test execution or something along those lines.

result_path = train_shadow_on_half_challenge_data(
n_models=2,
n_reps=1,
master_challenge_data=population_data[0:40], # For testing purposes only.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment issue here.

classifier_config: Configs | None,
fine_tuning_diffusion_iterations: int,
fine_tuning_classifier_iterations: int,
device: str = "cuda" if torch.cuda.is_available() else "cpu",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing for the device here.

"""
if parent_name is None:
y_col = "placeholder"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to target_column? I am in the process of renaming all x to feature and y to target.

child_info,
child_model_params,
child_transformations,
fine_tuning_diffusion_iterations, # fine_tuning_diffusion_iterations used here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment does not add a lot of information...


def clava_fine_tuning(
trained_models: dict[tuple[str, str], dict[str, Any]],
new_tables: Tables,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be called fine_tuning_tables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the renaming now? Currently, it aligns well with the corresponding training function name clava_training.
I think we can rename this whenever the main function's name is also changed. Ideally, they should be merged sooner than that if possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of the new ticket you made? If yes, it's fine to keep it like this until you work on it.



# TODO: Too many statements and branches, refactor.
def sample_from_diffusion( # noqa: PLR0915, PLR0912
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take care of that as part of the other refactors I am doing with MIDSTModels code.

@fatemetkl fatemetkl requested a review from lotif October 15, 2025 16:57
Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with minor comments. Thanks for addressing all the comments!!!

run_data_processing(config)
# Note: Importing the following two modules causes a segmentation fault error if imported together in this file.
# A quick solution is to load modules dynamically if any of the pipelines is called.
# TODO: Investigate the source of error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is wild... We should definitely investigate.

transformations: Transformations,
steps: int,
batch_size: int,
# model_type: ModelType,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this line.

# fine_tuning_params.d_in = input_dimension

# model = model_type.get_model(fine_tuning_params)
# model.to(device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete these commented lines.

schedule_sampler = ScheduleSamplerType.UNIFORM.create_named_schedule_sampler(num_timesteps)
key_value_logger = KeyValueLogger()
classifier.train()
for _step in range(classifier_steps):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the _ as it is normally used for private functions and variables, which is not the case here.

for _step in range(classifier_steps):
key_value_logger.save_entry("step", float(_step))
key_value_logger.save_entry("samples", float((_step + 1) * batch_size))
_numerical_forward_backward_log(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a TODO on the train code to remove the leading _ from this one.

)

# Train the initial model if it is not already trained and saved.
if not (save_dir / f"initial_model_rmia_{init_model_id}.pkl").exists():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3-strike rule: extract save_dir / f"initial_model_rmia_{init_model_id}.pkl" into a variable as it has been used 3 times in this code block.

with open(save_dir / f"initial_model_rmia_{init_model_id}.pkl", "rb") as f:
initial_model_training_results = pickle.load(f)

# assert initial_model_training_results.models[("", table_name)]["diffusion"] is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this commented line.

@fatemetkl fatemetkl merged commit 7186253 into main Oct 15, 2025
5 checks passed
@fatemetkl fatemetkl deleted the ft/shadow_models branch October 15, 2025 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants