Simultaneous Multi-Category Binning Optimization by akhter-towsifa · Pull Request #15 · cms-flaf/StatInference

akhter-towsifa · 2026-02-11T15:32:14Z

Currently, the binning optimization algorithm handles multi category cases in the following way:

takes the most sensistive category (i.e. res2b) in a given channel and finds the optimal binning with low expected limit,
freezes this optimal binning and uses it as a starting point for the next category (i.e. res1b),
calculates the best binning for the current category (i.e. res1b), freezes this one and uses it as a starting point for next category (i.e. boosted, and so on).

The current algorithm assumes some correlation in kinematics between the categories, but we want a more rigorous option that will optimize all the categories in a channel simultaneously.

This approach will be computationally heavy (evaluating limits over multiple categories at a time)
But will create a more consistent binning scheme throughout different categories.

Copilot

Pull request overview

This pull request implements simultaneous multi-category binning optimization for statistical inference, moving from a sequential approach where categories are optimized one-by-one to a simultaneous approach where all categories in a channel are optimized together. This is computationally heavier but produces more consistent binning schemes across categories.

Changes:

Adds new multi_optimize_binning.py module with MultiBinning and MultiBayesianOptimization classes for simultaneous multi-category optimization
Updates optimize_channel.py to support simultaneous multi-category optimization mode controlled by configuration
Replaces deprecated distutils.util.strtobool with a custom implementation in optimize_binning.py
Adds multi_category_optimization flag to bin_optimization.yaml configuration

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 16 comments.

File	Description
bin_opt/multi_optimize_binning.py	New file implementing simultaneous multi-category binning optimization with MultiBinning class and MultiBayesianOptimization algorithm
bin_opt/optimize_channel.py	Adds multi-category optimization path, new helper functions for multi-category result selection and input file discovery
bin_opt/optimize_binning.py	Replaces deprecated distutils.util.strtobool with custom implementation for Python 3.12+ compatibility
bin_opt/bin_optimization.yaml	Adds multi_category_optimization configuration flag (set to True)

Comments suppressed due to low confidence (1)

bin_opt/optimize_channel.py:286

The variable other_cat_file is only assigned inside the inner loop (line 284) if a matching file is found. If no matching file exists in best_dir, the variable will be undefined when checked at line 286, causing a NameError. The check should be inside the loop or a default value should be set before the loop.

        for cat_idx in range(cat_index):
            cat = categories[cat_idx][0]
            for file in os.listdir(best_dir):
                if file.endswith('.txt') and cat in file:
                    other_cat_file = f"{best_dir}/{file}"

            if not os.path.isfile(other_cat_file):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T16:08:09Z

bin_opt/multi_optimize_binning.py

+
+        if os.path.isfile(self.log_output):
+            with open(self.log_output, 'r') as f:
+                prev_binnins = json.loads('[' + ', '.join(f.readlines()) + ']')


Variable name mismatch: prev_binnins is assigned but prev_binnings is used in the loop. This will cause a NameError at runtime.

Suggested change

prev_binnins = json.loads('[' + ', '.join(f.readlines()) + ']')

prev_binnings = json.loads('[' + ', '.join(f.readlines()) + ']')

Copilot · 2026-02-25T16:08:10Z

bin_opt/multi_optimize_binning.py

+            n_best = len(self.best_binning.edges)
+            self.optimizer_lock.acquire()
+            if n_best - 2 < self.max_n_bins:
+                for k in range(n_best - 1):
+                    edges = np.zeros(n_best + 1)
+                    edges[0:k+1] = self.best_binning.edges[0:k+1]
+                    edges[k+2:] = self.best_binning.edges[k+1:]
+                    for delta_edge in [ 0.1, 0.5, 0.9 ]:
+                        edges[k+1] = self.best_binning.edges[k] \
+                                     + (self.best_binning.edges[k+1] - self.best_binning.edges[k]) * delta_edge
+                        self.addSuggestion(edges)
+            if n_best > 2:
+                for k in range(1, n_best):
+                    edges = np.zeros(n_best - 1)
+                    edges[0:k] = self.best_binning.edges[0:k]
+                    edges[k:] = self.best_binning.edges[k+1:]


The tryBestBinningSplit method references self.best_binning.edges, but MultiBinning objects don't have an edges attribute. They have edges_by_category which is a dictionary. This logic appears to be copied from the single-category optimization without proper adaptation.

Suggested change

n_best = len(self.best_binning.edges)

self.optimizer_lock.acquire()

if n_best - 2 < self.max_n_bins:

for k in range(n_best - 1):

edges = np.zeros(n_best + 1)

edges[0:k+1] = self.best_binning.edges[0:k+1]

edges[k+2:] = self.best_binning.edges[k+1:]

for delta_edge in [ 0.1, 0.5, 0.9 ]:

edges[k+1] = self.best_binning.edges[k] \

+ (self.best_binning.edges[k+1] - self.best_binning.edges[k]) * delta_edge

self.addSuggestion(edges)

if n_best > 2:

for k in range(1, n_best):

edges = np.zeros(n_best - 1)

edges[0:k] = self.best_binning.edges[0:k]

edges[k:] = self.best_binning.edges[k+1:]

# In the multi-binning case, edges are stored per category.

edges_by_category = getattr(self.best_binning, "edges_by_category", None)

if not edges_by_category:

# Nothing to split if there are no categories or edges

self.binning_lock.release()

return

# Use the first category's edges as the reference for splitting

first_category = next(iter(edges_by_category))

best_edges = edges_by_category[first_category]

n_best = len(best_edges)

self.optimizer_lock.acquire()

if n_best - 2 < self.max_n_bins:

for k in range(n_best - 1):

edges = np.zeros(n_best + 1)

edges[0:k+1] = best_edges[0:k+1]

edges[k+2:] = best_edges[k+1:]

for delta_edge in [ 0.1, 0.5, 0.9 ]:

edges[k+1] = best_edges[k] \

+ (best_edges[k+1] - best_edges[k]) * delta_edge

self.addSuggestion(edges)

if n_best > 2:

for k in range(1, n_best):

edges = np.zeros(n_best - 1)

edges[0:k] = best_edges[0:k]

edges[k:] = best_edges[k+1:]

Copilot · 2026-02-25T16:08:10Z

bin_opt/optimize_channel.py

+            for i, (category, _poi) in enumerate(categories):
+                if category not in best_binnings[channel + "_" + era]:
+                    continue
+                f.write('\t\t "{}": '.format(category))
+                json.dump(best_binnings[channel + "_" + era][category], f)
+                # trailing commas only between written entries
+                f.write(",\n")
+            f.write("\t}\n}\n")


The JSON output will have invalid trailing commas. After the last category entry, a comma is always written (line 234), but this will be a trailing comma before the closing brace on line 235, which is invalid JSON syntax.

Suggested change

for i, (category, _poi) in enumerate(categories):

if category not in best_binnings[channel + "_" + era]:

continue

f.write('\t\t "{}": '.format(category))

json.dump(best_binnings[channel + "_" + era][category], f)

# trailing commas only between written entries

f.write(",\n")

f.write("\t}\n}\n")

first_written = True

for i, (category, _poi) in enumerate(categories):

if category not in best_binnings[channel + "_" + era]:

continue

if not first_written:

# comma between written entries

f.write(",\n")

first_written = False

f.write('\t\t "{}": '.format(category))

json.dump(best_binnings[channel + "_" + era][category], f)

if first_written:

# no categories were written

f.write("\t}\n}\n")

else:

# close JSON object after the last written entry

f.write("\n\t}\n}\n")

Copilot · 2026-02-25T16:08:10Z

bin_opt/multi_optimize_binning.py

+    def pointToBundle(self, point):
+
+        edges_by_cat = {}
+        for c in range(len(self.datacards)):


The pointToBundle method references self.datacards which does not exist. The attribute is named self.input_datacards based on the init method.

Suggested change

for c in range(len(self.datacards)):

for c in range(len(self.input_datacards)):

Copilot · 2026-02-25T16:08:10Z

bin_opt/multi_optimize_binning.py

+            point = self.optimizer.suggest()
+        else:
+            point = self.suggestions.pop(0)
+            self.suggestions.remove(point)


The suggestions.remove(point) is redundant since pop(0) already removes the item from the list. Calling remove again will cause a ValueError if the list only had one item.

Suggested change

self.suggestions.remove(point)

Copilot · 2026-02-25T16:08:12Z

bin_opt/multi_optimize_binning.py

+                    if os.path.isfile(result_file):
+                        with open(result_file, 'r') as f:
+                            result = json.load(f)
+                        if result['input_datacard'] in self.input_datacards.values():


The condition checks if result['input_datacard'] is in self.input_datacards.values(), but self.input_datacards is a list (not a dict), so it should check membership with in self.input_datacards instead.

Suggested change

if result['input_datacard'] in self.input_datacards.values():

if result['input_datacard'] in self.input_datacards:

Copilot · 2026-02-25T16:08:12Z

bin_opt/multi_optimize_binning.py

+        else:
+            point = self.suggestions.pop(0)
+            self.suggestions.remove(point)
+        self.optimize_lock.release()


Incorrect lock variable name: self.optimize_lock should be self.optimizer_lock to match the lock defined in init.

Suggested change

self.optimize_lock.release()

self.optimizer_lock.release()

Copilot · 2026-02-25T16:08:12Z

bin_opt/multi_optimize_binning.py

+
+class MultiBayesianOptimization:
+    def __init__(self, max_n_bins, working_area, workers_dir, acq, kappa, xi,
+                input_datacards_list, poi, bkg_yields_dict, input_queue_size,


The parameter is named bkg_yields_dict but it's being passed as bkg_yields in the instantiation call, causing a parameter name mismatch.

Copilot · 2026-02-25T16:08:13Z

bin_opt/optimize_channel.py

+            input_arg = ",".join([os.path.abspath(p) for p in datacards])
+            shape_arg = ",".join([os.path.abspath(p) for p in shape_files])
+            opt_cmd = f"python3 bin_opt/multi_optimize_binning.py --input {input_arg} --shape-file {shape_arg} --output {cat_dir} --workers-dir {workers_dir} --max_n_bins {max_n_bins} --poi {poi}"
+


In the sequential loop (lines 271-274), when multi_category_optimization is enabled, the code attempts to use multi_optimize_binning.py for a single category. However, the variables datacards and shape_files at this point may be empty if no matching files were found, since the loop at lines 251-259 may not populate them. Additionally, this conflicts with the intended design: the multi-category path starting at line 142 should handle all categories simultaneously and return, so this code path should never execute when multi_category_optimization is True.

Suggested change

input_arg = ",".join([os.path.abspath(p) for p in datacards])

shape_arg = ",".join([os.path.abspath(p) for p in shape_files])

opt_cmd = f"python3 bin_opt/multi_optimize_binning.py --input {input_arg} --shape-file {shape_arg} --output {cat_dir} --workers-dir {workers_dir} --max_n_bins {max_n_bins} --poi {poi}"

raise RuntimeError(

"Reached per-category optimization loop with multi_category_optimization=True. "

"The multi-category optimization path should have handled all categories and returned "

"before this point."

)

Copilot · 2026-02-25T16:08:13Z

bin_opt/multi_optimize_binning.py

+if __name__ == '__main__':
+    import argparse
+
+    parser = argparse.ArgumentParser(description='Find optimal binning for multiple catgories that minimizes the combined expected limits.')


Spelling error in the docstring: "catgories" should be "categories".

Suggested change

parser = argparse.ArgumentParser(description='Find optimal binning for multiple catgories that minimizes the combined expected limits.')

parser = argparse.ArgumentParser(description='Find optimal binning for multiple categories that minimizes the combined expected limits.')

akhter-towsifa added 6 commits February 4, 2026 06:49

distutils has been removed from python3.12 library

2bb6edb

removing the dependency on setuptools or distutils package

e9a9655

adding multi-category binning optimizing configuration

5babe36

multiple datacard option in point generator function

0c79966

adding options for multi-category to the server script

ad6e976

multi category optimization

44767b9

akhter-towsifa marked this pull request as ready for review February 25, 2026 15:39

Merge branch 'cms-flaf:main' into multi_bin_opt

8a84aad

kandrosov requested a review from Copilot February 25, 2026 16:02

Copilot started reviewing on behalf of kandrosov February 25, 2026 16:02 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simultaneous Multi-Category Binning Optimization#15

Simultaneous Multi-Category Binning Optimization#15
akhter-towsifa wants to merge 7 commits intocms-flaf:mainfrom
akhter-towsifa:multi_bin_opt

akhter-towsifa commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	prev_binnins = json.loads('[' + ', '.join(f.readlines()) + ']')
	prev_binnings = json.loads('[' + ', '.join(f.readlines()) + ']')

	for c in range(len(self.datacards)):
	for c in range(len(self.input_datacards)):

	if result['input_datacard'] in self.input_datacards.values():
	if result['input_datacard'] in self.input_datacards:

-            input_arg = ",".join([os.path.abspath(p) for p in datacards])
-            shape_arg = ",".join([os.path.abspath(p) for p in shape_files])
-            opt_cmd = f"python3 bin_opt/multi_optimize_binning.py --input {input_arg} --shape-file {shape_arg} --output {cat_dir} --workers-dir {workers_dir} --max_n_bins {max_n_bins} --poi {poi}"
+            raise RuntimeError(
+                "Reached per-category optimization loop with multi_category_optimization=True. "
+                "The multi-category optimization path should have handled all categories and returned "
+                "before this point."
+            )

	parser = argparse.ArgumentParser(description='Find optimal binning for multiple catgories that minimizes the combined expected limits.')
	parser = argparse.ArgumentParser(description='Find optimal binning for multiple categories that minimizes the combined expected limits.')

Conversation

akhter-towsifa commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants