Simultaneous Multi-Category Binning Optimization#15
Simultaneous Multi-Category Binning Optimization#15akhter-towsifa wants to merge 7 commits intocms-flaf:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request implements simultaneous multi-category binning optimization for statistical inference, moving from a sequential approach where categories are optimized one-by-one to a simultaneous approach where all categories in a channel are optimized together. This is computationally heavier but produces more consistent binning schemes across categories.
Changes:
- Adds new
multi_optimize_binning.pymodule withMultiBinningandMultiBayesianOptimizationclasses for simultaneous multi-category optimization - Updates
optimize_channel.pyto support simultaneous multi-category optimization mode controlled by configuration - Replaces deprecated
distutils.util.strtoboolwith a custom implementation inoptimize_binning.py - Adds
multi_category_optimizationflag tobin_optimization.yamlconfiguration
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 16 comments.
| File | Description |
|---|---|
| bin_opt/multi_optimize_binning.py | New file implementing simultaneous multi-category binning optimization with MultiBinning class and MultiBayesianOptimization algorithm |
| bin_opt/optimize_channel.py | Adds multi-category optimization path, new helper functions for multi-category result selection and input file discovery |
| bin_opt/optimize_binning.py | Replaces deprecated distutils.util.strtobool with custom implementation for Python 3.12+ compatibility |
| bin_opt/bin_optimization.yaml | Adds multi_category_optimization configuration flag (set to True) |
Comments suppressed due to low confidence (1)
bin_opt/optimize_channel.py:286
- The variable
other_cat_fileis only assigned inside the inner loop (line 284) if a matching file is found. If no matching file exists inbest_dir, the variable will be undefined when checked at line 286, causing a NameError. The check should be inside the loop or a default value should be set before the loop.
for cat_idx in range(cat_index):
cat = categories[cat_idx][0]
for file in os.listdir(best_dir):
if file.endswith('.txt') and cat in file:
other_cat_file = f"{best_dir}/{file}"
if not os.path.isfile(other_cat_file):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| if os.path.isfile(self.log_output): | ||
| with open(self.log_output, 'r') as f: | ||
| prev_binnins = json.loads('[' + ', '.join(f.readlines()) + ']') |
There was a problem hiding this comment.
Variable name mismatch: prev_binnins is assigned but prev_binnings is used in the loop. This will cause a NameError at runtime.
| prev_binnins = json.loads('[' + ', '.join(f.readlines()) + ']') | |
| prev_binnings = json.loads('[' + ', '.join(f.readlines()) + ']') |
| n_best = len(self.best_binning.edges) | ||
| self.optimizer_lock.acquire() | ||
| if n_best - 2 < self.max_n_bins: | ||
| for k in range(n_best - 1): | ||
| edges = np.zeros(n_best + 1) | ||
| edges[0:k+1] = self.best_binning.edges[0:k+1] | ||
| edges[k+2:] = self.best_binning.edges[k+1:] | ||
| for delta_edge in [ 0.1, 0.5, 0.9 ]: | ||
| edges[k+1] = self.best_binning.edges[k] \ | ||
| + (self.best_binning.edges[k+1] - self.best_binning.edges[k]) * delta_edge | ||
| self.addSuggestion(edges) | ||
| if n_best > 2: | ||
| for k in range(1, n_best): | ||
| edges = np.zeros(n_best - 1) | ||
| edges[0:k] = self.best_binning.edges[0:k] | ||
| edges[k:] = self.best_binning.edges[k+1:] |
There was a problem hiding this comment.
The tryBestBinningSplit method references self.best_binning.edges, but MultiBinning objects don't have an edges attribute. They have edges_by_category which is a dictionary. This logic appears to be copied from the single-category optimization without proper adaptation.
| n_best = len(self.best_binning.edges) | |
| self.optimizer_lock.acquire() | |
| if n_best - 2 < self.max_n_bins: | |
| for k in range(n_best - 1): | |
| edges = np.zeros(n_best + 1) | |
| edges[0:k+1] = self.best_binning.edges[0:k+1] | |
| edges[k+2:] = self.best_binning.edges[k+1:] | |
| for delta_edge in [ 0.1, 0.5, 0.9 ]: | |
| edges[k+1] = self.best_binning.edges[k] \ | |
| + (self.best_binning.edges[k+1] - self.best_binning.edges[k]) * delta_edge | |
| self.addSuggestion(edges) | |
| if n_best > 2: | |
| for k in range(1, n_best): | |
| edges = np.zeros(n_best - 1) | |
| edges[0:k] = self.best_binning.edges[0:k] | |
| edges[k:] = self.best_binning.edges[k+1:] | |
| # In the multi-binning case, edges are stored per category. | |
| edges_by_category = getattr(self.best_binning, "edges_by_category", None) | |
| if not edges_by_category: | |
| # Nothing to split if there are no categories or edges | |
| self.binning_lock.release() | |
| return | |
| # Use the first category's edges as the reference for splitting | |
| first_category = next(iter(edges_by_category)) | |
| best_edges = edges_by_category[first_category] | |
| n_best = len(best_edges) | |
| self.optimizer_lock.acquire() | |
| if n_best - 2 < self.max_n_bins: | |
| for k in range(n_best - 1): | |
| edges = np.zeros(n_best + 1) | |
| edges[0:k+1] = best_edges[0:k+1] | |
| edges[k+2:] = best_edges[k+1:] | |
| for delta_edge in [ 0.1, 0.5, 0.9 ]: | |
| edges[k+1] = best_edges[k] \ | |
| + (best_edges[k+1] - best_edges[k]) * delta_edge | |
| self.addSuggestion(edges) | |
| if n_best > 2: | |
| for k in range(1, n_best): | |
| edges = np.zeros(n_best - 1) | |
| edges[0:k] = best_edges[0:k] | |
| edges[k:] = best_edges[k+1:] |
| for i, (category, _poi) in enumerate(categories): | ||
| if category not in best_binnings[channel + "_" + era]: | ||
| continue | ||
| f.write('\t\t "{}": '.format(category)) | ||
| json.dump(best_binnings[channel + "_" + era][category], f) | ||
| # trailing commas only between written entries | ||
| f.write(",\n") | ||
| f.write("\t}\n}\n") |
There was a problem hiding this comment.
The JSON output will have invalid trailing commas. After the last category entry, a comma is always written (line 234), but this will be a trailing comma before the closing brace on line 235, which is invalid JSON syntax.
| for i, (category, _poi) in enumerate(categories): | |
| if category not in best_binnings[channel + "_" + era]: | |
| continue | |
| f.write('\t\t "{}": '.format(category)) | |
| json.dump(best_binnings[channel + "_" + era][category], f) | |
| # trailing commas only between written entries | |
| f.write(",\n") | |
| f.write("\t}\n}\n") | |
| first_written = True | |
| for i, (category, _poi) in enumerate(categories): | |
| if category not in best_binnings[channel + "_" + era]: | |
| continue | |
| if not first_written: | |
| # comma between written entries | |
| f.write(",\n") | |
| first_written = False | |
| f.write('\t\t "{}": '.format(category)) | |
| json.dump(best_binnings[channel + "_" + era][category], f) | |
| if first_written: | |
| # no categories were written | |
| f.write("\t}\n}\n") | |
| else: | |
| # close JSON object after the last written entry | |
| f.write("\n\t}\n}\n") |
| def pointToBundle(self, point): | ||
|
|
||
| edges_by_cat = {} | ||
| for c in range(len(self.datacards)): |
There was a problem hiding this comment.
The pointToBundle method references self.datacards which does not exist. The attribute is named self.input_datacards based on the init method.
| for c in range(len(self.datacards)): | |
| for c in range(len(self.input_datacards)): |
| point = self.optimizer.suggest() | ||
| else: | ||
| point = self.suggestions.pop(0) | ||
| self.suggestions.remove(point) |
There was a problem hiding this comment.
The suggestions.remove(point) is redundant since pop(0) already removes the item from the list. Calling remove again will cause a ValueError if the list only had one item.
| self.suggestions.remove(point) |
| if os.path.isfile(result_file): | ||
| with open(result_file, 'r') as f: | ||
| result = json.load(f) | ||
| if result['input_datacard'] in self.input_datacards.values(): |
There was a problem hiding this comment.
The condition checks if result['input_datacard'] is in self.input_datacards.values(), but self.input_datacards is a list (not a dict), so it should check membership with in self.input_datacards instead.
| if result['input_datacard'] in self.input_datacards.values(): | |
| if result['input_datacard'] in self.input_datacards: |
| else: | ||
| point = self.suggestions.pop(0) | ||
| self.suggestions.remove(point) | ||
| self.optimize_lock.release() |
There was a problem hiding this comment.
Incorrect lock variable name: self.optimize_lock should be self.optimizer_lock to match the lock defined in init.
| self.optimize_lock.release() | |
| self.optimizer_lock.release() |
|
|
||
| class MultiBayesianOptimization: | ||
| def __init__(self, max_n_bins, working_area, workers_dir, acq, kappa, xi, | ||
| input_datacards_list, poi, bkg_yields_dict, input_queue_size, |
There was a problem hiding this comment.
The parameter is named bkg_yields_dict but it's being passed as bkg_yields in the instantiation call, causing a parameter name mismatch.
| input_arg = ",".join([os.path.abspath(p) for p in datacards]) | ||
| shape_arg = ",".join([os.path.abspath(p) for p in shape_files]) | ||
| opt_cmd = f"python3 bin_opt/multi_optimize_binning.py --input {input_arg} --shape-file {shape_arg} --output {cat_dir} --workers-dir {workers_dir} --max_n_bins {max_n_bins} --poi {poi}" | ||
|
|
There was a problem hiding this comment.
In the sequential loop (lines 271-274), when multi_category_optimization is enabled, the code attempts to use multi_optimize_binning.py for a single category. However, the variables datacards and shape_files at this point may be empty if no matching files were found, since the loop at lines 251-259 may not populate them. Additionally, this conflicts with the intended design: the multi-category path starting at line 142 should handle all categories simultaneously and return, so this code path should never execute when multi_category_optimization is True.
| input_arg = ",".join([os.path.abspath(p) for p in datacards]) | |
| shape_arg = ",".join([os.path.abspath(p) for p in shape_files]) | |
| opt_cmd = f"python3 bin_opt/multi_optimize_binning.py --input {input_arg} --shape-file {shape_arg} --output {cat_dir} --workers-dir {workers_dir} --max_n_bins {max_n_bins} --poi {poi}" | |
| raise RuntimeError( | |
| "Reached per-category optimization loop with multi_category_optimization=True. " | |
| "The multi-category optimization path should have handled all categories and returned " | |
| "before this point." | |
| ) |
| if __name__ == '__main__': | ||
| import argparse | ||
|
|
||
| parser = argparse.ArgumentParser(description='Find optimal binning for multiple catgories that minimizes the combined expected limits.') |
There was a problem hiding this comment.
Spelling error in the docstring: "catgories" should be "categories".
| parser = argparse.ArgumentParser(description='Find optimal binning for multiple catgories that minimizes the combined expected limits.') | |
| parser = argparse.ArgumentParser(description='Find optimal binning for multiple categories that minimizes the combined expected limits.') |
Currently, the binning optimization algorithm handles multi category cases in the following way:
The current algorithm assumes some correlation in kinematics between the categories, but we want a more rigorous option that will optimize all the categories in a channel simultaneously.