-
Notifications
You must be signed in to change notification settings - Fork 1
shadow models training and synthesizer #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
|
||
# TODO: Too many statements and branches, refactor. | ||
def sample_from_diffusion( # noqa: PLR0915, PLR0912 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this is a copy from MIDST to get the pipeline running but perhaps in the long run we should not require df as an input to fin the numerical and categorical columns and just use df_info for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Definitely needs a big refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take care of that as part of the other refactors I am doing with MIDSTModels code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only had time to review half of this PR, will review the other half tomorrow. Lots of small comments but otherwise it looks good so far.
"clustering": { | ||
"parent_scale": 1.0, | ||
"num_clusters": 50, | ||
"num_clusters": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? Just 1 cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just me trying to minimixe the run time of the example 😂
Will fix these values.
"batch_size": 4096, | ||
"lr": 0.0006, | ||
"iterations": 2, | ||
"batch_size": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also these values, the ones a few lines above and the classifier values below, they seem odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, these values aren’t realistic. I ran the attack example with them on a very small portion of the data just to make sure it works. That’s why iterations is set to 2 and batch_size to 1. I’ll keep some of the values as they are for now so the example can run quickly.
# Pipeline control | ||
pipeline: | ||
run_data_processing: true | ||
run_data_processing: false # Set this to false if you have already saved the processed data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that comment, I would assume you wanted to keep this as true
examples/ensemble_attack/config.yaml
Outdated
pipeline: | ||
run_data_processing: true | ||
run_data_processing: false # Set this to false if you have already saved the processed data | ||
run_shadow_model_training: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the consistency here and use true
lowercase.
|
||
|
||
# TODO: This function and the next one can be unified later. | ||
def train_fine_tuning_shadows( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think train_fine_tuned_shadow_models
would be a better name.
will be created if it does not exist, and all the relevant configs will be copied here automatically. | ||
training_json_config_paths: Configuration dictionary containing paths to the data JSON config files. | ||
fine_tuning_config: Configuration dictionary containing shadow model fine-tuning specific information. | ||
n_models_per_set: Number of shadow models to train by each approach, must be even, defaults to 4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording here could use a little bit of love:
Number of shadow models to train by each approach. Must be an even number. Defaults to 4.
will be saved. Model artifacts and synthetic data will be saved under this directory as well. | ||
training_json_config_paths: Configuration dictionary containing paths to the data JSON config files. | ||
fine_tuning_config: Configuration dictionary containing shadow model fine-tuning specific information. | ||
init_model_id: Distinguishes the pre-trained initial models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better description of this parameter would be An ID to assign to the pre-trained initial models
. The fact that it distinguishes models from others will depend on the value the caller will assign to it.
INFO, | ||
f"Second set of shadow model training completed and saved at {second_set_result_path}.", | ||
) | ||
# Attack codebase comment: "The following eight models are trained from scratch on the challenge points, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
INFO, | ||
f"First set of shadow model training completed and saved at {first_set_result_path}", | ||
) | ||
# Attack codebase comment: "The following four models are trained in the same way, with a new initial training set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can say Original codebase comment
instead.
) -> tuple[Path, Path, Path]: | ||
""" | ||
Runs the shadow model training pipeline of the ensemble attack. This pipeline consists of three sets of | ||
shadow model training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think This pipeline trains three sets of shadow models.
is a better wording for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second batch of comments. Still not done yet :)
# The following function is the slightly modified version of | ||
# ``midst_toolkit.models.clavaddpm.data_loaders`` by the CITADEL & UQAM team. | ||
def load_multi_table( | ||
data_dir: Path, train_df: pd.DataFrame | None = None, verbose: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about renaming train_df
to train_data
? I've been doing that in the training code but sometimes I second guess those decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is definitely a better name!
I also merged the attack's load_multi_table()
with the existing load_multi_table()
in midst_toolkit.models.clavaddpm.data_loader
, and it seems to work fine. Now, we can optionally send the table dataframes to load_multi_table()
. This will make the following refactors easier.
"domain": domain, | ||
"children": meta["children"], | ||
"parents": meta["parents"], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: I'm going to do this at one point on my copy of this function, but if you have the bandwidth it would be nice to have a Table
dataclass to have this information. This way we don't need to add explanations about the format of this dictionary in the docstrings.
id_cols = [col for col in tables[table]["df"].columns if "_id" in col] | ||
df_no_id = tables[table]["df"].drop(columns=id_cols) | ||
info = get_info_from_domain(df_no_id, tables[table]["domain"]) | ||
data, info = pipeline_process_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just renamed this function in my refactor 😬
along with their corresponding labels, to the specified path under ``processed_attack_data_path``. | ||
The size of the master challenge datasets (train and test) is half of the total population data size each, | ||
as determined by the attack design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a newbie question, but how can this attack design be determined? It's not one of the parameters of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attack design is currently determined by the original codebase. I’ve added a comment.
long answer:
I agree this is very annoying, but for now the attack implementation is heavily based on the original attack design with little flexibility. For example, the size of the master challenge datasets is currently set to half of the population. This could potentially be adjusted by modifying the proportion
input variable in split_real_data
, but doing so would require changes to the codebase.
The plan is to refactor the code in multiple stages to make it more flexible once the full pipeline is implemented.
processed_attack_data_path: Path, | ||
column_to_stratify: str, | ||
num_total_samples: int = 40000, | ||
random_seed: int = 42, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem with the seed here. I think it has to be None
and passed in by parameter in case the user wants to ensure reproducibility.
from midst_toolkit.models.clavaddpm.train import clava_training | ||
|
||
|
||
def config_tabddpm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs a better name. Maybe save_additional_tabddpm_config
?
def config_tabddpm( | ||
data_dir: Path, | ||
training_json_path: Path, | ||
final_json_path: Path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those two parameters need to say they are the paths for the configs: training_config_json_path
, final_config_json_path
. Maybe remove the json
part for shorter names, but that's optional.
"models": {}, | ||
"configs": configs, | ||
"synth_data": {}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can definitely be turned into a dataclass, it will make this simpler. Name it as TrainingResult
, which makes more sense:
from dataclasses import dataclass
@dataclass
class TrainingResult:
save_dir: Path,
configs: Configs,
tables: dict = {},
relation_order: dict = {},
all_group_lengths_probabilities: dict = {},
models: dict = {},
synthetic_data: dict = {},
result = TrainingResult(save_dir=save_dir, configs=configs)
Also, the types are too simple, please add type hints for those dicts in you know them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome suggestion! Thank you!
This data class could be very useful for any example involving training and synthesis. We could likely move it out of the attack module and into a more general inference module in future refactors.
"new_models": {}, | ||
"configs": configs, | ||
"synth_data": {}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same TrainingResult
class can be used here. If there are additional attributes, they should be added to the class as optional fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this could be a FineTuningResult
class that inherits from TrainingResult
:
@dataclass
class FineTuningResult(TrainingResult):
fine_tuned_models: dict = {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be sufficient to keep only the fine-tuned models; there’s no need to save both the pre-trained and fine-tuned versions simultaneously.
|
||
def fine_tune_tabddpm_and_synthesize( | ||
trained_models: dict[tuple[str, str], dict[str, Any]], | ||
new_train_set: pd.DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be called fine_tune_set
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final comments :)
I didn't review the synthesizer.py
file, I am planning to do a bigger refactor on it similar to what I am doing with the other training files.
shadow_models_data_path = tmp_path | ||
# Input | ||
# Population data is used to pre-train some of the shadow models. | ||
population_data = load_dataframe(Path("tests/unit/attacks/ensemble/assets/population_data"), "all_population.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this in a constant at the top of the file and use it here and on the method below?
return compose(config_name="shadow_training_config") | ||
|
||
|
||
def test_train_fine_tuning_shadows(cfg: DictConfig, tmp_path: Path) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be marked with @pytest.mark.integration_test()
since it looks like it's running the whole pipeline from the top with no mocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it should be moved to the integration tests folder.
n_models=2, | ||
n_reps=1, | ||
population_data=population_data, | ||
master_challenge_data=population_data[0:20], # For testing purposes only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better comment here would be good, since this is already in a test file this comment does not add much information. Maybe # Limiting the data to 20 samples for faster test execution
or something along those lines.
result_path = train_shadow_on_half_challenge_data( | ||
n_models=2, | ||
n_reps=1, | ||
master_challenge_data=population_data[0:40], # For testing purposes only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment issue here.
classifier_config: Configs | None, | ||
fine_tuning_diffusion_iterations: int, | ||
fine_tuning_classifier_iterations: int, | ||
device: str = "cuda" if torch.cuda.is_available() else "cpu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing for the device here.
""" | ||
if parent_name is None: | ||
y_col = "placeholder" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to target_column
? I am in the process of renaming all x
to feature
and y
to target
.
child_info, | ||
child_model_params, | ||
child_transformations, | ||
fine_tuning_diffusion_iterations, # fine_tuning_diffusion_iterations used here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment does not add a lot of information...
|
||
def clava_fine_tuning( | ||
trained_models: dict[tuple[str, str], dict[str, Any]], | ||
new_tables: Tables, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be called fine_tuning_tables
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the renaming now? Currently, it aligns well with the corresponding training function name clava_training
.
I think we can rename this whenever the main function's name is also changed. Ideally, they should be merged sooner than that if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part of the new ticket you made? If yes, it's fine to keep it like this until you work on it.
|
||
|
||
# TODO: Too many statements and branches, refactor. | ||
def sample_from_diffusion( # noqa: PLR0915, PLR0912 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take care of that as part of the other refactors I am doing with MIDSTModels code.
…rts, and fixed a unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with minor comments. Thanks for addressing all the comments!!!
run_data_processing(config) | ||
# Note: Importing the following two modules causes a segmentation fault error if imported together in this file. | ||
# A quick solution is to load modules dynamically if any of the pipelines is called. | ||
# TODO: Investigate the source of error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is wild... We should definitely investigate.
transformations: Transformations, | ||
steps: int, | ||
batch_size: int, | ||
# model_type: ModelType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete this line.
# fine_tuning_params.d_in = input_dimension | ||
|
||
# model = model_type.get_model(fine_tuning_params) | ||
# model.to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete these commented lines.
schedule_sampler = ScheduleSamplerType.UNIFORM.create_named_schedule_sampler(num_timesteps) | ||
key_value_logger = KeyValueLogger() | ||
classifier.train() | ||
for _step in range(classifier_steps): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the _
as it is normally used for private functions and variables, which is not the case here.
for _step in range(classifier_steps): | ||
key_value_logger.save_entry("step", float(_step)) | ||
key_value_logger.save_entry("samples", float((_step + 1) * batch_size)) | ||
_numerical_forward_backward_log( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a TODO on the train code to remove the leading _
from this one.
) | ||
|
||
# Train the initial model if it is not already trained and saved. | ||
if not (save_dir / f"initial_model_rmia_{init_model_id}.pkl").exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3-strike rule: extract save_dir / f"initial_model_rmia_{init_model_id}.pkl"
into a variable as it has been used 3 times in this code block.
with open(save_dir / f"initial_model_rmia_{init_model_id}.pkl", "rb") as f: | ||
initial_model_training_results = pickle.load(f) | ||
|
||
# assert initial_model_training_results.models[("", table_name)]["diffusion"] is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this commented line.
PR Type
Feature
Short Description
Sorry for the big PR. I had initially intended not to include the synthesizer module, but I needed it to test the full shadow training pipeline. Therefore, it is added here almost unchanged from the original MIDSTModels codebase, except for some fixes to make it executable.
Clickup Ticket(s):
Shadow model training pipeline: https://app.clickup.com/t/868fp4tw3
Synthesizer module: https://app.clickup.com/t/868fp4vk7
Shadow model training pipeline of Ensemble Attack:
-->
ensemble/rmia/shadow_model_training.py
:Contains the main function for running the ensemble attack’s shadow model training, as designed by the CITADEL & UQAM team.
-->
ensemble/shadow_model_utils.py
:Provides helper functions for model training and data synthesis, as well as utilities for modifying configs for each shadow model set.
-->
ensemble/tabddpm_fine_tuning.py
:Implements fine-tuning versions of the main training functions. Unlike their counterparts in
models.clavaddpm
, these functions start from pre-trained models. This module should eventually be unified with the toolkit’s training module in future PRs to reduce refactoring overhead.Synthesizer module (
synthesizer.py
): This module contains all the code needed to synthesize data for the attack implementation. It is adapted from the original MidstModels codebase with minor modifications, includingruff
andmypy
fixes. Significant refactoring is still needed in future PRs to align it with project standards. Note: no dedicated tests were added for this module. Its functions are currently tested only indirectly through shadow model synthesis tests.Tests Added