Update DNN for HistTuple Inputs#70
Conversation
…ds and batch configs
|
@kandrosov please review, I think I'm done with this PR. It successfully creates training datasets, models, and validation plots Next step for me is to work on the AnalysisCache payload to run this new style of DNN, for which I'd like to merge this and clean my local branch. |
| - TWminustoLNu2Q | ||
|
|
||
| - WW | ||
| - WZ |
There was a problem hiding this comment.
Why is ZZ missing? What about single Higgs backgrounds?
Same questions for the resolved setup.
There was a problem hiding this comment.
Since I'm focusing on 'main backgrounds' right now, I considered them to be too small to focus on. This PR is mostly to formalize the code and share with Artem -- not to make everything perfect in configuration yet.
If you want to have a real review of how the training is done then that is okay, but I was under the idea we do a bare minimum proof of concept and then if it works, we make it official.
There was a problem hiding this comment.
Pull request overview
This pull request represents a major architectural change to the DNN training infrastructure, migrating from AnaTuple-based inputs to HistTuple-based inputs. The change simplifies the data preparation pipeline by eliminating complex batching logic and using ROOT's RDataFrame for filtering and processing. The new approach directly reads from HistTuple files that already contain high-level features and HME estimates, removing the need for friend files and batch configuration files.
Changes:
- Refactored dataset creation from ~1700 lines to ~250 lines using RDataFrame instead of custom batching
- Updated tasks to use
fs_histogramsfilesystem instead offs_anaTuple - Created new
DNN_Class_HistTuples.pymodule for training with HistTuple inputs - Added new configuration files for resolved and boosted topologies
- Removed obsolete batch configuration parameters and HME friend file handling
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| Studies/DNN/tasks.py | Changed filesystem from fs_anaTuple to fs_histograms; removed batch_config and hme_friend_file parameters |
| Studies/DNN/create_dataset.py | Complete rewrite using RDataFrame; simplified from complex batching to direct filtering and merging |
| Studies/DNN/create_condor_configs.py | Updated to reference new training setup configs; removed batch_config and hme_friend_file references |
| Studies/DNN/config/training_setup_doubleLep_resolved.yaml | New training configuration for resolved topology with HME features |
| Studies/DNN/config/training_setup_doubleLep_boosted.yaml | New training configuration for boosted topology with fat jet features |
| Studies/DNN/config/dataset_setup_doubleLep_resolved.yaml | New dataset configuration for resolved topology with selection cuts |
| Studies/DNN/config/dataset_setup_doubleLep_boosted.yaml | New dataset configuration for boosted topology with selection cuts |
| Studies/DNN/config/step2_training_setup_doubleLep.yaml | Removed obsolete configuration file |
| Studies/DNN/config/default_training_setup_singleLep.yaml | Removed obsolete configuration file |
| Studies/DNN/config/default_training_setup_doubleLep.yaml | Removed obsolete configuration file |
| Studies/DNN/config/default_split_singleLep.yaml | Removed obsolete split configuration |
| Studies/DNN/config/default_split_doubleLep.yaml | Removed obsolete split configuration |
| Studies/DNN/config/dataset_split.yaml | Removed obsolete dataset split configuration |
| Studies/DNN/config/YH_split_singleLep.yaml | Removed obsolete YH split configuration |
| Studies/DNN/config/YH_split_doubleLep.yaml | Removed obsolete YH split configuration |
| Studies/DNN/TupleMaker.h | Removed obsolete C++ header for custom batching |
| Studies/DNN/DNNEntryQueue.h | Removed obsolete C++ header for thread-safe queue |
| Studies/DNN/DNN_Validator_Condor.py | Updated to use DNN_Class_HistTuples; removed batch_config and hme_friend_file parameters |
| Studies/DNN/DNN_Trainer_Condor.py | Updated to use DNN_Class_HistTuples; removed batch_config and hme_friend_file parameters |
| Studies/DNN/DNN_Class_HistTuples.py | New module implementing DataWrapper and Model classes for HistTuple-based training |
Studies/DNN/create_dataset.py
Outdated
| parity_cut = parity_cut.format( | ||
| nParity=config_dict["nParity"], parity_scan=nParity | ||
| ) |
There was a problem hiding this comment.
The variable name parity_cut is being reused and mutated inside the loop. On line 21, parity_cut is loaded from the config, then on lines 82 and 125 it's reassigned with a formatted string. This overwrites the original template string from the config, which would break if the loop were to iterate again.
Consider renaming the formatted version to parity_cut_formatted or similar to avoid confusion and potential bugs. The same pattern occurs in both signal and background loops.
Studies/DNN/create_dataset.py
Outdated
| parity_cut = parity_cut.format( | ||
| nParity=config_dict["nParity"], parity_scan=nParity | ||
| ) |
There was a problem hiding this comment.
Same issue as lines 82-84: the variable parity_cut is reassigned inside the loop, overwriting the template string. This should use a differently named variable like parity_cut_formatted to preserve the original template for subsequent iterations.
Studies/DNN/create_dataset.py
Outdated
| if subprocess == config_dict["signal"][process]["combined_name"]: | ||
| continue | ||
| process_dict[process]["total"] += process_dict[process][subprocess][ | ||
| process_dict[nParity_string][signal_name][mass_point]["total"] += total |
There was a problem hiding this comment.
The total count is being added multiple times (once per nParity iteration) to the same mass_point entry. Since total is computed once outside the nParity loop but added inside it, the dictionary will accumulate total * nParity rather than just total.
If the intent is to track total events per mass_point regardless of parity, move this line outside the nParity loop. If the intent is to track per-parity totals, restructure the dictionary accordingly.
| process_dict[nParity_string][background_name][dataset_name][ | ||
| "total" | ||
| ] += total | ||
| process_dict[nParity_string][background_name][dataset_name][ |
There was a problem hiding this comment.
Similar to the signal loop issue, the total count is being added multiple times (once per nParity iteration) to the same dataset_name entry. This will accumulate total * nParity instead of just total.
Move this line outside the nParity loop if the intent is to track total events per dataset regardless of parity.
| process_dict[nParity_string][background_name][dataset_name][ | |
| "total" | |
| ] += total | |
| process_dict[nParity_string][background_name][dataset_name][ | |
| if nParity == 0: | |
| process_dict[nParity_string][background_name][dataset_name][ | |
| "total" | |
| ] += total | |
| process_dict[nParity_string][background_name][dataset_name][ |
Studies/DNN/DNN_Class_HistTuples.py
Outdated
| self, file_name, entry_start=None, entry_stop=None, hme_friend_file=None | ||
| ): | ||
| if self.feature_names == None: | ||
| print("Uknown branches to read! DefineInputFeatures first!") |
There was a problem hiding this comment.
Spelling error in the error message: "Uknown" should be "Unknown".
| print("Uknown branches to read! DefineInputFeatures first!") | |
| print("Unknown branches to read! DefineInputFeatures first!") |
| output_file = os.path.join(output_nParity, f"{dataset_name}_merge.root") | ||
|
|
||
| current_index = 0 | ||
| for process in process_dict.keys(): | ||
| for subprocess in process_dict[process].keys(): | ||
| if subprocess.startswith("total") or subprocess.startswith("weight"): | ||
| continue | ||
| process_dict[process][subprocess]["batch_start"] = current_index | ||
| current_index += process_dict[process][subprocess]["batch_size"] | ||
| rdf_tmp = rdf.Filter(parity_cut) | ||
| cut = rdf_tmp.Count().GetValue() | ||
| weighted_cut = rdf_tmp.Sum("weight_Central").GetValue() | ||
|
|
||
| nBatches = 1e100 | ||
| for process in process_dict.keys(): | ||
| for subprocess in process_dict[process].keys(): | ||
| if subprocess.startswith("total") or subprocess.startswith("weight"): | ||
| continue | ||
| if process_dict[process][subprocess]["nBatches"] < nBatches and ( | ||
| process_dict[process][subprocess]["nBatches"] != 0 | ||
| ): | ||
| nBatches = process_dict[process][subprocess]["nBatches"] | ||
|
|
||
| print(f"Creating {nBatches} batches, according to distribution. ") | ||
| print(process_dict) | ||
| print(f"And total batch size is {batch_dict['batch_size']}") | ||
|
|
||
| machine_yaml = { | ||
| "meta_data": {}, | ||
| "processes": [], | ||
| } | ||
|
|
||
| machine_yaml["meta_data"]["storage_folder"] = storage_folder | ||
| machine_yaml["meta_data"]["batch_dict"] = batch_dict | ||
| machine_yaml["meta_data"]["selection_branches"] = selection_branches | ||
| machine_yaml["meta_data"]["selection_cut"] = total_cut | ||
| machine_yaml["meta_data"]["iterate_cut"] = config_dict["iterate_cut"].format( | ||
| nParity=config_dict["nParity"], parity_scan=nParity | ||
| ) | ||
| machine_yaml["meta_data"][ | ||
| "empty_dict_example" | ||
| ] = empty_dict_example_file # Example for empty dict structure | ||
| machine_yaml["meta_data"]["input_filename"] = f"batchfile{nParity}.root" | ||
| machine_yaml["meta_data"][ | ||
| "hme_friend_filename" | ||
| ] = f"batchfile{nParity}_HME_Friend.root" | ||
| machine_yaml["meta_data"][ | ||
| "output_DNNname" | ||
| ] = f"ResHH_Classifier_parity{nParity}" | ||
|
|
||
| machine_yaml["meta_data"][ | ||
| "spin_mass_dist" | ||
| ] = spin_mass_dist # Dict of spin/mass distribution values for random choice parametric | ||
|
|
||
| for process in process_dict: | ||
| for subprocess in process_dict[process]: | ||
| if subprocess.startswith("total") or subprocess.startswith("weight"): | ||
| continue | ||
| print("Using subprocess ", subprocess) | ||
| subprocess_dict = process_dict[process][subprocess] | ||
| datasets_full_pathway = [ | ||
| os.path.join(subprocess_dict["storage_folder"], fn) | ||
| for fn in subprocess_dict["all_extensions"] | ||
| ] | ||
| tmp_process_dict = { | ||
| "datasets": datasets_full_pathway, | ||
| "class_value": subprocess_dict["class_value"], | ||
| "batch_start": subprocess_dict["batch_start"], | ||
| "batch_size": subprocess_dict["batch_size"], | ||
| "nBatches": subprocess_dict["nBatches"], | ||
| } | ||
| machine_yaml["processes"].append(tmp_process_dict) | ||
| process_dict[nParity_string][background_name][dataset_name][ | ||
| "total" | ||
| ] += total | ||
| process_dict[nParity_string][background_name][dataset_name][ | ||
| "total_cut" | ||
| ] += cut | ||
| process_dict[nParity_string][background_name][dataset_name][ | ||
| "total_cut_weighted" | ||
| ] += weighted_cut | ||
| rdf_tmp = rdf_tmp.Define("class_value", f"{class_value}") | ||
| rdf_tmp = rdf_tmp.Define("X_mass", f"{X_mass}") | ||
| rdf_tmp.Snapshot(treeName, output_file) |
There was a problem hiding this comment.
The same file overwriting issue exists in the background loop. Each dataset_name writes to the same output_file path for a given nParity, which will overwrite previous files.
Additionally, line 157 uses hadd to merge all .root files in the nParity{n}_Merged directory, but if files are being overwritten as noted, this won't produce the expected merged result. The logic needs to ensure each dataset writes to a unique file name (e.g., including dataset_name in the filename) before the hadd step.
Studies/DNN/create_dataset.py
Outdated
|
|
||
| for nParity in range(config_dict["nParity"]): | ||
| nParity_string = f"nParity_{nParity}" | ||
| out_yaml = f"dataset_distritubution_parity{nParity}.yaml" |
There was a problem hiding this comment.
Spelling error in the output filename: "distritubution" should be "distribution".
| out_yaml = f"dataset_distritubution_parity{nParity}.yaml" | |
| out_yaml = f"dataset_distribution_parity{nParity}.yaml" |
Studies/DNN/create_dataset.py
Outdated
| # hadd the files together to make a final merged.root | ||
| hadd_out = os.path.join(output_folder, f"nParity{nParity}_Merged.root") | ||
| hadd_in = os.path.join(output_folder, f"nParity{nParity}_Merged/*.root") | ||
| os.system(f"hadd {hadd_out} {hadd_in}") |
There was a problem hiding this comment.
Using os.system() for critical operations like hadd is risky because it doesn't check for errors. If the hadd command fails (e.g., due to missing files, disk space issues, or command not found), the script will continue silently.
Consider using subprocess.run() with check=True to ensure the command succeeds, or at least check the return code from os.system() and handle failures appropriately.
Studies/DNN/DNN_Class_HistTuples.py
Outdated
| def ReadFile( | ||
| self, file_name, entry_start=None, entry_stop=None, hme_friend_file=None | ||
| ): |
There was a problem hiding this comment.
The parameter hme_friend_file is defined but never used in the function body. This appears to be leftover from the old AnaTuple-based implementation. Since the PR is migrating to HistTuple inputs where HME data is embedded in the main file, this parameter should be removed for clarity.
| def ReadFile( | |
| self, file_name, entry_start=None, entry_stop=None, hme_friend_file=None | |
| ): | |
| def ReadFile(self, file_name, entry_start=None, entry_stop=None): |
Studies/DNN/DNN_Class_HistTuples.py
Outdated
| print(f"Reading file {file_name}") | ||
|
|
||
| features_to_load = self.feature_names | ||
| features_to_load = features_to_load + self.feature_names |
There was a problem hiding this comment.
Line 95 duplicates self.feature_names by concatenating it with itself, resulting in features_to_load containing each feature twice. This will cause issues when loading branches from the ROOT file.
This line should likely be removed, keeping only line 94 which initializes features_to_load with self.feature_names.
| features_to_load = features_to_load + self.feature_names |
No description provided.