Skip to content

Conversation

@gbeane
Copy link
Collaborator

@gbeane gbeane commented Oct 17, 2025

Overview: Probability Calibration via CalibratedClassifierCV

What it is
CalibratedClassifierCV wraps any classifier (RF/GBT/XGBoost, etc.) and learns a mapping from raw model scores to well-calibrated probabilities. It does this with an internal cross-validation loop: in each fold, it fits the base model on train-fold data and learns a calibration function on the fold’s held-out data. Two calibration methods are supported:

  • isotonic (non-parametric, flexible; best with enough data)
  • sigmoid (Platt scaling; smoother, works better on smaller data)

Why calibration matters
Tree models often output overconfident probabilities (lots of ~0.0 or ~1.0). Calibration fixes this so that predicted probabilities reflect reality (e.g., among samples with p≈0.7, ~70% are positive). Better calibration improves:

  • Thresholding
  • Loss-based metrics: log-loss / Brier score reflect actual probability quality.
  • User trust: fewer “certain-but-wrong” predictions in JABS.

How we use it in JABS

  • Added optional settings, which are saved in the project.json file: calibrate_probabilities: bool, calibration_method: "isotonic"|"sigmoid", calibration_cv: int.
  • During training (including LOGO cross-validation), probability calibration is fit separately inside each fold. The calibrator is trained only on that fold’s training data and never sees the validation data, which prevents data leakage and keeps validation metrics honest.
  • For feature importance, when calibrated we aggregate importances across the calibrated folds’ base estimators
  • UI: a JABS Settings dialog lets users toggle calibration and choose the method/CV. The settings are persisted in the JABS project.json file.

Practical guidance

  • Reasonable Default: calibrate_probabilities=True, method="isotonic", calibration_cv=3. (however, this PR does not change current behavior, so calibrate_probabilities defaults to False)
  • Use "sigmoid" if folds are small; isotonic needs more data.
  • Avoid very high calibration_cv—typically 3–5 is enough.
  • Always calibrate during validation if you’ll deploy a calibrated final model (metrics should reflect deployed classifier).

Trade-offs

  • Extra compute (fits base model multiple times).
  • Slight variance increase; mitigated by sensible CV (3–5).
  • For extremely imbalanced or tiny folds, prefer "sigmoid" or reduce CV.

Net impact for JABS

  • More honest probabilities -> cleaner threshold selection for behaviors / improved search for low-confidence predictions.
  • Better UX and trust in “confidence” displays.
  • Fewer brittle 0/1 outputs; improved stability across datasets and sessions.

See Also

User Guide

I'm going to hold off on updating the user guide until I'm sure we're going to merge these changes

Settings Dialog Screenshots

image image

@gbeane gbeane marked this pull request as draft October 17, 2025 21:42
@gbeane gbeane self-assigned this Nov 7, 2025
@gbeane gbeane requested review from bergsalex and ptuan5 November 7, 2025 14:58
Copy link

@ptuan5 ptuan5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in the scripts overall look good to me.
Cannot comment on the changes to the ui part because it's beyond my comprehension.

with warnings.catch_warnings():
warnings.simplefilter("ignore", category=FutureWarning)
self._classifier = self._fit_xgboost(features, labels, random_seed=random_seed)
self._classifier.fit(self._clean_features_for_training(features), labels)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you use clean_features here for the calibrated branch, but not for the old, original branch?

Args:
truth: Binary ground truth labels (0 or 1).
probabilities: Predicted probabilities (2D array where second column is positive class).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for brier_score, accepting both 1D and 2D arrays, but here you only accept 2D arrays. Is there a reason for this? Should we enforce 2D altogether?

if pos == 0 or neg == 0:
warnings.warn(
"plot_reliability: need both positive and negative labels.", stacklevel=2
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we need both positive and negative for this, should we either raise an error or return early?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants