Skip to content

Update DNN for HistTuple Inputs#70

Merged
kandrosov merged 16 commits intocms-flaf:mainfrom
aebid:DNN_HistTuple_Training_Support
Feb 27, 2026
Merged

Update DNN for HistTuple Inputs#70
kandrosov merged 16 commits intocms-flaf:mainfrom
aebid:DNN_HistTuple_Training_Support

Conversation

@aebid
Copy link
Contributor

@aebid aebid commented Feb 25, 2026

No description provided.

@aebid
Copy link
Contributor Author

aebid commented Feb 26, 2026

@kandrosov please review, I think I'm done with this PR. It successfully creates training datasets, models, and validation plots

Next step for me is to work on the AnalysisCache payload to run this new style of DNN, for which I'd like to merge this and clean my local branch.

- TWminustoLNu2Q

- WW
- WZ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ZZ missing? What about single Higgs backgrounds?
Same questions for the resolved setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I'm focusing on 'main backgrounds' right now, I considered them to be too small to focus on. This PR is mostly to formalize the code and share with Artem -- not to make everything perfect in configuration yet.

If you want to have a real review of how the training is done then that is okay, but I was under the idea we do a bare minimum proof of concept and then if it works, we make it official.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request represents a major architectural change to the DNN training infrastructure, migrating from AnaTuple-based inputs to HistTuple-based inputs. The change simplifies the data preparation pipeline by eliminating complex batching logic and using ROOT's RDataFrame for filtering and processing. The new approach directly reads from HistTuple files that already contain high-level features and HME estimates, removing the need for friend files and batch configuration files.

Changes:

  • Refactored dataset creation from ~1700 lines to ~250 lines using RDataFrame instead of custom batching
  • Updated tasks to use fs_histograms filesystem instead of fs_anaTuple
  • Created new DNN_Class_HistTuples.py module for training with HistTuple inputs
  • Added new configuration files for resolved and boosted topologies
  • Removed obsolete batch configuration parameters and HME friend file handling

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
Studies/DNN/tasks.py Changed filesystem from fs_anaTuple to fs_histograms; removed batch_config and hme_friend_file parameters
Studies/DNN/create_dataset.py Complete rewrite using RDataFrame; simplified from complex batching to direct filtering and merging
Studies/DNN/create_condor_configs.py Updated to reference new training setup configs; removed batch_config and hme_friend_file references
Studies/DNN/config/training_setup_doubleLep_resolved.yaml New training configuration for resolved topology with HME features
Studies/DNN/config/training_setup_doubleLep_boosted.yaml New training configuration for boosted topology with fat jet features
Studies/DNN/config/dataset_setup_doubleLep_resolved.yaml New dataset configuration for resolved topology with selection cuts
Studies/DNN/config/dataset_setup_doubleLep_boosted.yaml New dataset configuration for boosted topology with selection cuts
Studies/DNN/config/step2_training_setup_doubleLep.yaml Removed obsolete configuration file
Studies/DNN/config/default_training_setup_singleLep.yaml Removed obsolete configuration file
Studies/DNN/config/default_training_setup_doubleLep.yaml Removed obsolete configuration file
Studies/DNN/config/default_split_singleLep.yaml Removed obsolete split configuration
Studies/DNN/config/default_split_doubleLep.yaml Removed obsolete split configuration
Studies/DNN/config/dataset_split.yaml Removed obsolete dataset split configuration
Studies/DNN/config/YH_split_singleLep.yaml Removed obsolete YH split configuration
Studies/DNN/config/YH_split_doubleLep.yaml Removed obsolete YH split configuration
Studies/DNN/TupleMaker.h Removed obsolete C++ header for custom batching
Studies/DNN/DNNEntryQueue.h Removed obsolete C++ header for thread-safe queue
Studies/DNN/DNN_Validator_Condor.py Updated to use DNN_Class_HistTuples; removed batch_config and hme_friend_file parameters
Studies/DNN/DNN_Trainer_Condor.py Updated to use DNN_Class_HistTuples; removed batch_config and hme_friend_file parameters
Studies/DNN/DNN_Class_HistTuples.py New module implementing DataWrapper and Model classes for HistTuple-based training

Comment on lines 82 to 84
parity_cut = parity_cut.format(
nParity=config_dict["nParity"], parity_scan=nParity
)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name parity_cut is being reused and mutated inside the loop. On line 21, parity_cut is loaded from the config, then on lines 82 and 125 it's reassigned with a formatted string. This overwrites the original template string from the config, which would break if the loop were to iterate again.

Consider renaming the formatted version to parity_cut_formatted or similar to avoid confusion and potential bugs. The same pattern occurs in both signal and background loops.

Copilot uses AI. Check for mistakes.
Comment on lines 125 to 127
parity_cut = parity_cut.format(
nParity=config_dict["nParity"], parity_scan=nParity
)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as lines 82-84: the variable parity_cut is reassigned inside the loop, overwriting the template string. This should use a differently named variable like parity_cut_formatted to preserve the original template for subsequent iterations.

Copilot uses AI. Check for mistakes.
if subprocess == config_dict["signal"][process]["combined_name"]:
continue
process_dict[process]["total"] += process_dict[process][subprocess][
process_dict[nParity_string][signal_name][mass_point]["total"] += total
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total count is being added multiple times (once per nParity iteration) to the same mass_point entry. Since total is computed once outside the nParity loop but added inside it, the dictionary will accumulate total * nParity rather than just total.

If the intent is to track total events per mass_point regardless of parity, move this line outside the nParity loop. If the intent is to track per-parity totals, restructure the dictionary accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +135 to +138
process_dict[nParity_string][background_name][dataset_name][
"total"
] += total
process_dict[nParity_string][background_name][dataset_name][
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the signal loop issue, the total count is being added multiple times (once per nParity iteration) to the same dataset_name entry. This will accumulate total * nParity instead of just total.

Move this line outside the nParity loop if the intent is to track total events per dataset regardless of parity.

Suggested change
process_dict[nParity_string][background_name][dataset_name][
"total"
] += total
process_dict[nParity_string][background_name][dataset_name][
if nParity == 0:
process_dict[nParity_string][background_name][dataset_name][
"total"
] += total
process_dict[nParity_string][background_name][dataset_name][

Copilot uses AI. Check for mistakes.
self, file_name, entry_start=None, entry_stop=None, hme_friend_file=None
):
if self.feature_names == None:
print("Uknown branches to read! DefineInputFeatures first!")
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error in the error message: "Uknown" should be "Unknown".

Suggested change
print("Uknown branches to read! DefineInputFeatures first!")
print("Unknown branches to read! DefineInputFeatures first!")

Copilot uses AI. Check for mistakes.
Comment on lines +129 to +146
output_file = os.path.join(output_nParity, f"{dataset_name}_merge.root")

current_index = 0
for process in process_dict.keys():
for subprocess in process_dict[process].keys():
if subprocess.startswith("total") or subprocess.startswith("weight"):
continue
process_dict[process][subprocess]["batch_start"] = current_index
current_index += process_dict[process][subprocess]["batch_size"]
rdf_tmp = rdf.Filter(parity_cut)
cut = rdf_tmp.Count().GetValue()
weighted_cut = rdf_tmp.Sum("weight_Central").GetValue()

nBatches = 1e100
for process in process_dict.keys():
for subprocess in process_dict[process].keys():
if subprocess.startswith("total") or subprocess.startswith("weight"):
continue
if process_dict[process][subprocess]["nBatches"] < nBatches and (
process_dict[process][subprocess]["nBatches"] != 0
):
nBatches = process_dict[process][subprocess]["nBatches"]

print(f"Creating {nBatches} batches, according to distribution. ")
print(process_dict)
print(f"And total batch size is {batch_dict['batch_size']}")

machine_yaml = {
"meta_data": {},
"processes": [],
}

machine_yaml["meta_data"]["storage_folder"] = storage_folder
machine_yaml["meta_data"]["batch_dict"] = batch_dict
machine_yaml["meta_data"]["selection_branches"] = selection_branches
machine_yaml["meta_data"]["selection_cut"] = total_cut
machine_yaml["meta_data"]["iterate_cut"] = config_dict["iterate_cut"].format(
nParity=config_dict["nParity"], parity_scan=nParity
)
machine_yaml["meta_data"][
"empty_dict_example"
] = empty_dict_example_file # Example for empty dict structure
machine_yaml["meta_data"]["input_filename"] = f"batchfile{nParity}.root"
machine_yaml["meta_data"][
"hme_friend_filename"
] = f"batchfile{nParity}_HME_Friend.root"
machine_yaml["meta_data"][
"output_DNNname"
] = f"ResHH_Classifier_parity{nParity}"

machine_yaml["meta_data"][
"spin_mass_dist"
] = spin_mass_dist # Dict of spin/mass distribution values for random choice parametric

for process in process_dict:
for subprocess in process_dict[process]:
if subprocess.startswith("total") or subprocess.startswith("weight"):
continue
print("Using subprocess ", subprocess)
subprocess_dict = process_dict[process][subprocess]
datasets_full_pathway = [
os.path.join(subprocess_dict["storage_folder"], fn)
for fn in subprocess_dict["all_extensions"]
]
tmp_process_dict = {
"datasets": datasets_full_pathway,
"class_value": subprocess_dict["class_value"],
"batch_start": subprocess_dict["batch_start"],
"batch_size": subprocess_dict["batch_size"],
"nBatches": subprocess_dict["nBatches"],
}
machine_yaml["processes"].append(tmp_process_dict)
process_dict[nParity_string][background_name][dataset_name][
"total"
] += total
process_dict[nParity_string][background_name][dataset_name][
"total_cut"
] += cut
process_dict[nParity_string][background_name][dataset_name][
"total_cut_weighted"
] += weighted_cut
rdf_tmp = rdf_tmp.Define("class_value", f"{class_value}")
rdf_tmp = rdf_tmp.Define("X_mass", f"{X_mass}")
rdf_tmp.Snapshot(treeName, output_file)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same file overwriting issue exists in the background loop. Each dataset_name writes to the same output_file path for a given nParity, which will overwrite previous files.

Additionally, line 157 uses hadd to merge all .root files in the nParity{n}_Merged directory, but if files are being overwritten as noted, this won't produce the expected merged result. The logic needs to ensure each dataset writes to a unique file name (e.g., including dataset_name in the filename) before the hadd step.

Copilot uses AI. Check for mistakes.

for nParity in range(config_dict["nParity"]):
nParity_string = f"nParity_{nParity}"
out_yaml = f"dataset_distritubution_parity{nParity}.yaml"
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error in the output filename: "distritubution" should be "distribution".

Suggested change
out_yaml = f"dataset_distritubution_parity{nParity}.yaml"
out_yaml = f"dataset_distribution_parity{nParity}.yaml"

Copilot uses AI. Check for mistakes.
# hadd the files together to make a final merged.root
hadd_out = os.path.join(output_folder, f"nParity{nParity}_Merged.root")
hadd_in = os.path.join(output_folder, f"nParity{nParity}_Merged/*.root")
os.system(f"hadd {hadd_out} {hadd_in}")
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using os.system() for critical operations like hadd is risky because it doesn't check for errors. If the hadd command fails (e.g., due to missing files, disk space issues, or command not found), the script will continue silently.

Consider using subprocess.run() with check=True to ensure the command succeeds, or at least check the return code from os.system() and handle failures appropriately.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +87
def ReadFile(
self, file_name, entry_start=None, entry_stop=None, hme_friend_file=None
):
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter hme_friend_file is defined but never used in the function body. This appears to be leftover from the old AnaTuple-based implementation. Since the PR is migrating to HistTuple inputs where HME data is embedded in the main file, this parameter should be removed for clarity.

Suggested change
def ReadFile(
self, file_name, entry_start=None, entry_stop=None, hme_friend_file=None
):
def ReadFile(self, file_name, entry_start=None, entry_stop=None):

Copilot uses AI. Check for mistakes.
print(f"Reading file {file_name}")

features_to_load = self.feature_names
features_to_load = features_to_load + self.feature_names
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 95 duplicates self.feature_names by concatenating it with itself, resulting in features_to_load containing each feature twice. This will cause issues when loading branches from the ROOT file.

This line should likely be removed, keeping only line 94 which initializes features_to_load with self.feature_names.

Suggested change
features_to_load = features_to_load + self.feature_names

Copilot uses AI. Check for mistakes.
@kandrosov kandrosov merged commit 19bfc78 into cms-flaf:main Feb 27, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants