Skip to content

Simultaneous Multi-Category Binning Optimization#15

Open
akhter-towsifa wants to merge 7 commits intocms-flaf:mainfrom
akhter-towsifa:multi_bin_opt
Open

Simultaneous Multi-Category Binning Optimization#15
akhter-towsifa wants to merge 7 commits intocms-flaf:mainfrom
akhter-towsifa:multi_bin_opt

Conversation

@akhter-towsifa
Copy link

Currently, the binning optimization algorithm handles multi category cases in the following way:

  • takes the most sensistive category (i.e. res2b) in a given channel and finds the optimal binning with low expected limit,
  • freezes this optimal binning and uses it as a starting point for the next category (i.e. res1b),
  • calculates the best binning for the current category (i.e. res1b), freezes this one and uses it as a starting point for next category (i.e. boosted, and so on).

The current algorithm assumes some correlation in kinematics between the categories, but we want a more rigorous option that will optimize all the categories in a channel simultaneously.

  • This approach will be computationally heavy (evaluating limits over multiple categories at a time)
  • But will create a more consistent binning scheme throughout different categories.

@akhter-towsifa akhter-towsifa marked this pull request as ready for review February 25, 2026 15:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements simultaneous multi-category binning optimization for statistical inference, moving from a sequential approach where categories are optimized one-by-one to a simultaneous approach where all categories in a channel are optimized together. This is computationally heavier but produces more consistent binning schemes across categories.

Changes:

  • Adds new multi_optimize_binning.py module with MultiBinning and MultiBayesianOptimization classes for simultaneous multi-category optimization
  • Updates optimize_channel.py to support simultaneous multi-category optimization mode controlled by configuration
  • Replaces deprecated distutils.util.strtobool with a custom implementation in optimize_binning.py
  • Adds multi_category_optimization flag to bin_optimization.yaml configuration

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 16 comments.

File Description
bin_opt/multi_optimize_binning.py New file implementing simultaneous multi-category binning optimization with MultiBinning class and MultiBayesianOptimization algorithm
bin_opt/optimize_channel.py Adds multi-category optimization path, new helper functions for multi-category result selection and input file discovery
bin_opt/optimize_binning.py Replaces deprecated distutils.util.strtobool with custom implementation for Python 3.12+ compatibility
bin_opt/bin_optimization.yaml Adds multi_category_optimization configuration flag (set to True)
Comments suppressed due to low confidence (1)

bin_opt/optimize_channel.py:286

  • The variable other_cat_file is only assigned inside the inner loop (line 284) if a matching file is found. If no matching file exists in best_dir, the variable will be undefined when checked at line 286, causing a NameError. The check should be inside the loop or a default value should be set before the loop.
        for cat_idx in range(cat_index):
            cat = categories[cat_idx][0]
            for file in os.listdir(best_dir):
                if file.endswith('.txt') and cat in file:
                    other_cat_file = f"{best_dir}/{file}"

            if not os.path.isfile(other_cat_file):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


if os.path.isfile(self.log_output):
with open(self.log_output, 'r') as f:
prev_binnins = json.loads('[' + ', '.join(f.readlines()) + ']')
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name mismatch: prev_binnins is assigned but prev_binnings is used in the loop. This will cause a NameError at runtime.

Suggested change
prev_binnins = json.loads('[' + ', '.join(f.readlines()) + ']')
prev_binnings = json.loads('[' + ', '.join(f.readlines()) + ']')

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +280
n_best = len(self.best_binning.edges)
self.optimizer_lock.acquire()
if n_best - 2 < self.max_n_bins:
for k in range(n_best - 1):
edges = np.zeros(n_best + 1)
edges[0:k+1] = self.best_binning.edges[0:k+1]
edges[k+2:] = self.best_binning.edges[k+1:]
for delta_edge in [ 0.1, 0.5, 0.9 ]:
edges[k+1] = self.best_binning.edges[k] \
+ (self.best_binning.edges[k+1] - self.best_binning.edges[k]) * delta_edge
self.addSuggestion(edges)
if n_best > 2:
for k in range(1, n_best):
edges = np.zeros(n_best - 1)
edges[0:k] = self.best_binning.edges[0:k]
edges[k:] = self.best_binning.edges[k+1:]
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tryBestBinningSplit method references self.best_binning.edges, but MultiBinning objects don't have an edges attribute. They have edges_by_category which is a dictionary. This logic appears to be copied from the single-category optimization without proper adaptation.

Suggested change
n_best = len(self.best_binning.edges)
self.optimizer_lock.acquire()
if n_best - 2 < self.max_n_bins:
for k in range(n_best - 1):
edges = np.zeros(n_best + 1)
edges[0:k+1] = self.best_binning.edges[0:k+1]
edges[k+2:] = self.best_binning.edges[k+1:]
for delta_edge in [ 0.1, 0.5, 0.9 ]:
edges[k+1] = self.best_binning.edges[k] \
+ (self.best_binning.edges[k+1] - self.best_binning.edges[k]) * delta_edge
self.addSuggestion(edges)
if n_best > 2:
for k in range(1, n_best):
edges = np.zeros(n_best - 1)
edges[0:k] = self.best_binning.edges[0:k]
edges[k:] = self.best_binning.edges[k+1:]
# In the multi-binning case, edges are stored per category.
edges_by_category = getattr(self.best_binning, "edges_by_category", None)
if not edges_by_category:
# Nothing to split if there are no categories or edges
self.binning_lock.release()
return
# Use the first category's edges as the reference for splitting
first_category = next(iter(edges_by_category))
best_edges = edges_by_category[first_category]
n_best = len(best_edges)
self.optimizer_lock.acquire()
if n_best - 2 < self.max_n_bins:
for k in range(n_best - 1):
edges = np.zeros(n_best + 1)
edges[0:k+1] = best_edges[0:k+1]
edges[k+2:] = best_edges[k+1:]
for delta_edge in [ 0.1, 0.5, 0.9 ]:
edges[k+1] = best_edges[k] \
+ (best_edges[k+1] - best_edges[k]) * delta_edge
self.addSuggestion(edges)
if n_best > 2:
for k in range(1, n_best):
edges = np.zeros(n_best - 1)
edges[0:k] = best_edges[0:k]
edges[k:] = best_edges[k+1:]

Copilot uses AI. Check for mistakes.
Comment on lines +228 to +235
for i, (category, _poi) in enumerate(categories):
if category not in best_binnings[channel + "_" + era]:
continue
f.write('\t\t "{}": '.format(category))
json.dump(best_binnings[channel + "_" + era][category], f)
# trailing commas only between written entries
f.write(",\n")
f.write("\t}\n}\n")
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON output will have invalid trailing commas. After the last category entry, a comma is always written (line 234), but this will be a trailing comma before the closing brace on line 235, which is invalid JSON syntax.

Suggested change
for i, (category, _poi) in enumerate(categories):
if category not in best_binnings[channel + "_" + era]:
continue
f.write('\t\t "{}": '.format(category))
json.dump(best_binnings[channel + "_" + era][category], f)
# trailing commas only between written entries
f.write(",\n")
f.write("\t}\n}\n")
first_written = True
for i, (category, _poi) in enumerate(categories):
if category not in best_binnings[channel + "_" + era]:
continue
if not first_written:
# comma between written entries
f.write(",\n")
first_written = False
f.write('\t\t "{}": '.format(category))
json.dump(best_binnings[channel + "_" + era][category], f)
if first_written:
# no categories were written
f.write("\t}\n}\n")
else:
# close JSON object after the last written entry
f.write("\n\t}\n}\n")

Copilot uses AI. Check for mistakes.
def pointToBundle(self, point):

edges_by_cat = {}
for c in range(len(self.datacards)):
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pointToBundle method references self.datacards which does not exist. The attribute is named self.input_datacards based on the init method.

Suggested change
for c in range(len(self.datacards)):
for c in range(len(self.input_datacards)):

Copilot uses AI. Check for mistakes.
point = self.optimizer.suggest()
else:
point = self.suggestions.pop(0)
self.suggestions.remove(point)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestions.remove(point) is redundant since pop(0) already removes the item from the list. Calling remove again will cause a ValueError if the list only had one item.

Suggested change
self.suggestions.remove(point)

Copilot uses AI. Check for mistakes.
if os.path.isfile(result_file):
with open(result_file, 'r') as f:
result = json.load(f)
if result['input_datacard'] in self.input_datacards.values():
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks if result['input_datacard'] is in self.input_datacards.values(), but self.input_datacards is a list (not a dict), so it should check membership with in self.input_datacards instead.

Suggested change
if result['input_datacard'] in self.input_datacards.values():
if result['input_datacard'] in self.input_datacards:

Copilot uses AI. Check for mistakes.
else:
point = self.suggestions.pop(0)
self.suggestions.remove(point)
self.optimize_lock.release()
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect lock variable name: self.optimize_lock should be self.optimizer_lock to match the lock defined in init.

Suggested change
self.optimize_lock.release()
self.optimizer_lock.release()

Copilot uses AI. Check for mistakes.

class MultiBayesianOptimization:
def __init__(self, max_n_bins, working_area, workers_dir, acq, kappa, xi,
input_datacards_list, poi, bkg_yields_dict, input_queue_size,
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter is named bkg_yields_dict but it's being passed as bkg_yields in the instantiation call, causing a parameter name mismatch.

Copilot uses AI. Check for mistakes.
Comment on lines +272 to +275
input_arg = ",".join([os.path.abspath(p) for p in datacards])
shape_arg = ",".join([os.path.abspath(p) for p in shape_files])
opt_cmd = f"python3 bin_opt/multi_optimize_binning.py --input {input_arg} --shape-file {shape_arg} --output {cat_dir} --workers-dir {workers_dir} --max_n_bins {max_n_bins} --poi {poi}"

Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the sequential loop (lines 271-274), when multi_category_optimization is enabled, the code attempts to use multi_optimize_binning.py for a single category. However, the variables datacards and shape_files at this point may be empty if no matching files were found, since the loop at lines 251-259 may not populate them. Additionally, this conflicts with the intended design: the multi-category path starting at line 142 should handle all categories simultaneously and return, so this code path should never execute when multi_category_optimization is True.

Suggested change
input_arg = ",".join([os.path.abspath(p) for p in datacards])
shape_arg = ",".join([os.path.abspath(p) for p in shape_files])
opt_cmd = f"python3 bin_opt/multi_optimize_binning.py --input {input_arg} --shape-file {shape_arg} --output {cat_dir} --workers-dir {workers_dir} --max_n_bins {max_n_bins} --poi {poi}"
raise RuntimeError(
"Reached per-category optimization loop with multi_category_optimization=True. "
"The multi-category optimization path should have handled all categories and returned "
"before this point."
)

Copilot uses AI. Check for mistakes.
if __name__ == '__main__':
import argparse

parser = argparse.ArgumentParser(description='Find optimal binning for multiple catgories that minimizes the combined expected limits.')
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error in the docstring: "catgories" should be "categories".

Suggested change
parser = argparse.ArgumentParser(description='Find optimal binning for multiple catgories that minimizes the combined expected limits.')
parser = argparse.ArgumentParser(description='Find optimal binning for multiple categories that minimizes the combined expected limits.')

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants