Update DNN for HistTuple Inputs by aebid · Pull Request #70 · cms-flaf/HH_bbWW

aebid · 2026-02-25T14:02:34Z

No description provided.

…lists

…ds and batch configs

aebid · 2026-02-26T17:00:21Z

@kandrosov please review, I think I'm done with this PR. It successfully creates training datasets, models, and validation plots

Next step for me is to work on the AnalysisCache payload to run this new style of DNN, for which I'd like to merge this and clean my local branch.

kandrosov · 2026-02-27T09:29:12Z

Studies/DNN/config/dataset_setup_doubleLep_boosted.yaml

+      - TWminustoLNu2Q
+
+      - WW
+      - WZ


Why is ZZ missing? What about single Higgs backgrounds?
Same questions for the resolved setup.

Since I'm focusing on 'main backgrounds' right now, I considered them to be too small to focus on. This PR is mostly to formalize the code and share with Artem -- not to make everything perfect in configuration yet.

If you want to have a real review of how the training is done then that is okay, but I was under the idea we do a bare minimum proof of concept and then if it works, we make it official.

Copilot

Pull request overview

This pull request represents a major architectural change to the DNN training infrastructure, migrating from AnaTuple-based inputs to HistTuple-based inputs. The change simplifies the data preparation pipeline by eliminating complex batching logic and using ROOT's RDataFrame for filtering and processing. The new approach directly reads from HistTuple files that already contain high-level features and HME estimates, removing the need for friend files and batch configuration files.

Changes:

Refactored dataset creation from ~1700 lines to ~250 lines using RDataFrame instead of custom batching
Updated tasks to use fs_histograms filesystem instead of fs_anaTuple
Created new DNN_Class_HistTuples.py module for training with HistTuple inputs
Added new configuration files for resolved and boosted topologies
Removed obsolete batch configuration parameters and HME friend file handling

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
Studies/DNN/tasks.py	Changed filesystem from fs_anaTuple to fs_histograms; removed batch_config and hme_friend_file parameters
Studies/DNN/create_dataset.py	Complete rewrite using RDataFrame; simplified from complex batching to direct filtering and merging
Studies/DNN/create_condor_configs.py	Updated to reference new training setup configs; removed batch_config and hme_friend_file references
Studies/DNN/config/training_setup_doubleLep_resolved.yaml	New training configuration for resolved topology with HME features
Studies/DNN/config/training_setup_doubleLep_boosted.yaml	New training configuration for boosted topology with fat jet features
Studies/DNN/config/dataset_setup_doubleLep_resolved.yaml	New dataset configuration for resolved topology with selection cuts
Studies/DNN/config/dataset_setup_doubleLep_boosted.yaml	New dataset configuration for boosted topology with selection cuts
Studies/DNN/config/step2_training_setup_doubleLep.yaml	Removed obsolete configuration file
Studies/DNN/config/default_training_setup_singleLep.yaml	Removed obsolete configuration file
Studies/DNN/config/default_training_setup_doubleLep.yaml	Removed obsolete configuration file
Studies/DNN/config/default_split_singleLep.yaml	Removed obsolete split configuration
Studies/DNN/config/default_split_doubleLep.yaml	Removed obsolete split configuration
Studies/DNN/config/dataset_split.yaml	Removed obsolete dataset split configuration
Studies/DNN/config/YH_split_singleLep.yaml	Removed obsolete YH split configuration
Studies/DNN/config/YH_split_doubleLep.yaml	Removed obsolete YH split configuration
Studies/DNN/TupleMaker.h	Removed obsolete C++ header for custom batching
Studies/DNN/DNNEntryQueue.h	Removed obsolete C++ header for thread-safe queue
Studies/DNN/DNN_Validator_Condor.py	Updated to use DNN_Class_HistTuples; removed batch_config and hme_friend_file parameters
Studies/DNN/DNN_Trainer_Condor.py	Updated to use DNN_Class_HistTuples; removed batch_config and hme_friend_file parameters
Studies/DNN/DNN_Class_HistTuples.py	New module implementing DataWrapper and Model classes for HistTuple-based training

Copilot · 2026-02-27T11:12:47Z

Studies/DNN/create_dataset.py

+                parity_cut = parity_cut.format(
+                    nParity=config_dict["nParity"], parity_scan=nParity
                )


The variable name parity_cut is being reused and mutated inside the loop. On line 21, parity_cut is loaded from the config, then on lines 82 and 125 it's reassigned with a formatted string. This overwrites the original template string from the config, which would break if the loop were to iterate again.

Consider renaming the formatted version to parity_cut_formatted or similar to avoid confusion and potential bugs. The same pattern occurs in both signal and background loops.

Copilot · 2026-02-27T11:12:47Z

Studies/DNN/create_dataset.py

+                parity_cut = parity_cut.format(
+                    nParity=config_dict["nParity"], parity_scan=nParity
                )


Same issue as lines 82-84: the variable parity_cut is reassigned inside the loop, overwriting the template string. This should use a differently named variable like parity_cut_formatted to preserve the original template for subsequent iterations.

Copilot · 2026-02-27T11:12:47Z

Studies/DNN/create_dataset.py

-                    if subprocess == config_dict["signal"][process]["combined_name"]:
-                        continue
-                process_dict[process]["total"] += process_dict[process][subprocess][
+                process_dict[nParity_string][signal_name][mass_point]["total"] += total


The total count is being added multiple times (once per nParity iteration) to the same mass_point entry. Since total is computed once outside the nParity loop but added inside it, the dictionary will accumulate total * nParity rather than just total.

If the intent is to track total events per mass_point regardless of parity, move this line outside the nParity loop. If the intent is to track per-parity totals, restructure the dictionary accordingly.

Copilot · 2026-02-27T11:12:47Z

Studies/DNN/create_dataset.py

+                process_dict[nParity_string][background_name][dataset_name][
+                    "total"
+                ] += total
+                process_dict[nParity_string][background_name][dataset_name][


Similar to the signal loop issue, the total count is being added multiple times (once per nParity iteration) to the same dataset_name entry. This will accumulate total * nParity instead of just total.

Move this line outside the nParity loop if the intent is to track total events per dataset regardless of parity.

Suggested change

process_dict[nParity_string][background_name][dataset_name][

"total"

] += total

process_dict[nParity_string][background_name][dataset_name][

if nParity == 0:

process_dict[nParity_string][background_name][dataset_name][

"total"

] += total

process_dict[nParity_string][background_name][dataset_name][

Copilot · 2026-02-27T11:12:48Z

Studies/DNN/DNN_Class_HistTuples.py

+        self, file_name, entry_start=None, entry_stop=None, hme_friend_file=None
+    ):
+        if self.feature_names == None:
+            print("Uknown branches to read! DefineInputFeatures first!")


Spelling error in the error message: "Uknown" should be "Unknown".

Suggested change

print("Uknown branches to read! DefineInputFeatures first!")

print("Unknown branches to read! DefineInputFeatures first!")

Copilot · 2026-02-27T11:12:49Z

Studies/DNN/create_dataset.py

+                output_file = os.path.join(output_nParity, f"{dataset_name}_merge.root")

-        current_index = 0
-        for process in process_dict.keys():
-            for subprocess in process_dict[process].keys():
-                if subprocess.startswith("total") or subprocess.startswith("weight"):
-                    continue
-                process_dict[process][subprocess]["batch_start"] = current_index
-                current_index += process_dict[process][subprocess]["batch_size"]
+                rdf_tmp = rdf.Filter(parity_cut)
+                cut = rdf_tmp.Count().GetValue()
+                weighted_cut = rdf_tmp.Sum("weight_Central").GetValue()

-        nBatches = 1e100
-        for process in process_dict.keys():
-            for subprocess in process_dict[process].keys():
-                if subprocess.startswith("total") or subprocess.startswith("weight"):
-                    continue
-                if process_dict[process][subprocess]["nBatches"] < nBatches and (
-                    process_dict[process][subprocess]["nBatches"] != 0
-                ):
-                    nBatches = process_dict[process][subprocess]["nBatches"]
-
-        print(f"Creating {nBatches} batches, according to distribution. ")
-        print(process_dict)
-        print(f"And total batch size is {batch_dict['batch_size']}")
-
-        machine_yaml = {
-            "meta_data": {},
-            "processes": [],
-        }
-
-        machine_yaml["meta_data"]["storage_folder"] = storage_folder
-        machine_yaml["meta_data"]["batch_dict"] = batch_dict
-        machine_yaml["meta_data"]["selection_branches"] = selection_branches
-        machine_yaml["meta_data"]["selection_cut"] = total_cut
-        machine_yaml["meta_data"]["iterate_cut"] = config_dict["iterate_cut"].format(
-            nParity=config_dict["nParity"], parity_scan=nParity
-        )
-        machine_yaml["meta_data"][
-            "empty_dict_example"
-        ] = empty_dict_example_file  # Example for empty dict structure
-        machine_yaml["meta_data"]["input_filename"] = f"batchfile{nParity}.root"
-        machine_yaml["meta_data"][
-            "hme_friend_filename"
-        ] = f"batchfile{nParity}_HME_Friend.root"
-        machine_yaml["meta_data"][
-            "output_DNNname"
-        ] = f"ResHH_Classifier_parity{nParity}"
-
-        machine_yaml["meta_data"][
-            "spin_mass_dist"
-        ] = spin_mass_dist  # Dict of spin/mass distribution values for random choice parametric
-
-        for process in process_dict:
-            for subprocess in process_dict[process]:
-                if subprocess.startswith("total") or subprocess.startswith("weight"):
-                    continue
-                print("Using subprocess ", subprocess)
-                subprocess_dict = process_dict[process][subprocess]
-                datasets_full_pathway = [
-                    os.path.join(subprocess_dict["storage_folder"], fn)
-                    for fn in subprocess_dict["all_extensions"]
-                ]
-                tmp_process_dict = {
-                    "datasets": datasets_full_pathway,
-                    "class_value": subprocess_dict["class_value"],
-                    "batch_start": subprocess_dict["batch_start"],
-                    "batch_size": subprocess_dict["batch_size"],
-                    "nBatches": subprocess_dict["nBatches"],
-                }
-                machine_yaml["processes"].append(tmp_process_dict)
+                process_dict[nParity_string][background_name][dataset_name][
+                    "total"
+                ] += total
+                process_dict[nParity_string][background_name][dataset_name][
+                    "total_cut"
+                ] += cut
+                process_dict[nParity_string][background_name][dataset_name][
+                    "total_cut_weighted"
+                ] += weighted_cut
+                rdf_tmp = rdf_tmp.Define("class_value", f"{class_value}")
+                rdf_tmp = rdf_tmp.Define("X_mass", f"{X_mass}")
+                rdf_tmp.Snapshot(treeName, output_file)


The same file overwriting issue exists in the background loop. Each dataset_name writes to the same output_file path for a given nParity, which will overwrite previous files.

Additionally, line 157 uses hadd to merge all .root files in the nParity{n}_Merged directory, but if files are being overwritten as noted, this won't produce the expected merged result. The logic needs to ensure each dataset writes to a unique file name (e.g., including dataset_name in the filename) before the hadd step.

Copilot · 2026-02-27T11:12:49Z

Studies/DNN/create_dataset.py


+    for nParity in range(config_dict["nParity"]):
+        nParity_string = f"nParity_{nParity}"
+        out_yaml = f"dataset_distritubution_parity{nParity}.yaml"


Spelling error in the output filename: "distritubution" should be "distribution".

Suggested change

out_yaml = f"dataset_distritubution_parity{nParity}.yaml"

out_yaml = f"dataset_distribution_parity{nParity}.yaml"

Copilot · 2026-02-27T11:12:49Z

Studies/DNN/create_dataset.py

+        # hadd the files together to make a final merged.root
+        hadd_out = os.path.join(output_folder, f"nParity{nParity}_Merged.root")
+        hadd_in = os.path.join(output_folder, f"nParity{nParity}_Merged/*.root")
+        os.system(f"hadd {hadd_out} {hadd_in}")


Using os.system() for critical operations like hadd is risky because it doesn't check for errors. If the hadd command fails (e.g., due to missing files, disk space issues, or command not found), the script will continue silently.

Consider using subprocess.run() with check=True to ensure the command succeeds, or at least check the return code from os.system() and handle failures appropriately.

Copilot · 2026-02-27T11:12:49Z

Studies/DNN/DNN_Class_HistTuples.py

+    def ReadFile(
+        self, file_name, entry_start=None, entry_stop=None, hme_friend_file=None
+    ):


The parameter hme_friend_file is defined but never used in the function body. This appears to be leftover from the old AnaTuple-based implementation. Since the PR is migrating to HistTuple inputs where HME data is embedded in the main file, this parameter should be removed for clarity.

Suggested change

def ReadFile(

self, file_name, entry_start=None, entry_stop=None, hme_friend_file=None

):

def ReadFile(self, file_name, entry_start=None, entry_stop=None):

Copilot · 2026-02-27T11:12:50Z

Studies/DNN/DNN_Class_HistTuples.py

+        print(f"Reading file {file_name}")
+
+        features_to_load = self.feature_names
+        features_to_load = features_to_load + self.feature_names


Line 95 duplicates self.feature_names by concatenating it with itself, resulting in features_to_load containing each feature twice. This will cause issues when loading branches from the ROOT file.

This line should likely be removed, keeping only line 94 which initializes features_to_load with self.feature_names.

Suggested change

features_to_load = features_to_load + self.feature_names

aebid added 13 commits February 25, 2026 14:11

Remove many old configs

ce20e1d

Update dataset creation to work on new histTuple method

97cbcd5

Update tasks and condor script to work with new histTuple datasets

d057f42

New DNN Class for histTuple datasets

0a3b091

Add new configurations for training and datasets for double lepton

45f70a4

Added tqdm monitor to backgrounds and hadd after dataset creation

826acff

Update dnn class handling

dedf7ee

Fix bug in dataset naming

70fc10c

Update file service location

36b1dde

Update training setups for new variable names, add HME, update param …

697c25d

…lists

Update validation, add physics weight for real value comparisons

c49347d

Update condor configuration maker to use new format without hme frien…

45d2324

…ds and batch configs

Apply formatting

d2474fb

kandrosov reviewed Feb 27, 2026

View reviewed changes

kandrosov requested a review from Copilot February 27, 2026 11:08

Copilot started reviewing on behalf of kandrosov February 27, 2026 11:08 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

aebid added 3 commits February 27, 2026 12:25

Copilot comments

35a2bac

Formatting

a0c2f45

Moved to ps_call for hadd

4359289

kandrosov merged commit 19bfc78 into cms-flaf:main Feb 27, 2026
3 checks passed

	print("Uknown branches to read! DefineInputFeatures first!")
	print("Unknown branches to read! DefineInputFeatures first!")

	out_yaml = f"dataset_distritubution_parity{nParity}.yaml"
	out_yaml = f"dataset_distribution_parity{nParity}.yaml"

Conversation

aebid commented Feb 25, 2026

Uh oh!

aebid commented Feb 26, 2026

Uh oh!

kandrosov Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

aebid Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants