V1.2 by natalie-23-gill · Pull Request #29 · gladstone-institutes/clustOpt

natalie-23-gill · 2026-03-13T02:47:37Z

closes #28 and #10

New features

Added KL divergence (calculate_kl_divergence()) and Hellinger distance
(calculate_hellinger_distance()) metrics for evaluating cluster distribution
consistency between training and held-out subjects.
Added modularity score (calculate_modularity()) computed on the precomputed
SNN graph.
Added MSE and MAD scores (calculate_mse_score()) for centroid-based cluster
quality evaluation.
New suggest_resolution() function that ranks resolutions using two
complementary methods: direct rank aggregation across four metrics, and
curvature-based local optima detection via second-order finite differences.
New summarize_cv_metrics() for per-resolution metric summaries.
New visualization functions plot_rank_metrics() and plot_mean_rank().
Logging improvements to make clustOpt messages distinct from its dependencies.

Performance improvements

Pre-allocate the results list in the main cross-validation loop instead of
growing it with c(), reducing memory allocation overhead for many subjects.
Use crossprod() instead of explicit t() %*% in PCA projections, avoiding
materialization of transposed gene-by-cell matrices.
Use Matrix::crossprod() in modularity calculation to stay in sparse matrix
space and avoid dense transposition of the cluster indicator matrix.
Replace do.call(rbind, lapply(...)) with t(vapply(...)) in MAD
calculation to avoid intermediate list allocation.
Precompute SNN graph and distance matrix once per held-out subject (shared
across resolutions) instead of recomputing per resolution.
Pre-extract cluster assignments from metadata by resolution to avoid repeated
column lookups.

Logging and verbosity

verbose parameter now accepts integer levels (0-3) for fine-grained control:
0 = silent, 1 = key milestones, 2 = detailed progress, 3 = Seurat output.
Backward-compatible: verbose = TRUE maps to level 1, FALSE to 0.
Added per-step timing via [step_name] Xs log messages at verbose >= 1.

Bug fixes

Fixed flipped sign in KL divergence calculation.
Fixed ARI estimate for comparing singleton clusters.
Handled edge cases in sample validation and metric computation.

Package reorganization

Split monolithic clustOpt.R and utils.R into focused modules:
clust_opt.R, data_preparation.R, metrics.R, sketching.R,
validation.R, visualization.R.
Removed vignette build cache from version control.

…n with visualizations

natalie-23-gill · 2026-03-26T19:34:07Z

@reubenthomas I added the changes to suggest_resolution for the local optima filtering step, let me know if it looks good and I'll merge this in

clustOpt/R/visualization.R

Lines 223 to 229 in b40b66b

    
           # Scale thresholds dynamically based on number of subjects 
        
           n_subjects <- max(table(cv_results$resolution)) 
        
           scale_factor <- sqrt(11 / n_subjects) 
        
           if (is.null(upper_Hell_score_thresh)) 
        
             upper_Hell_score_thresh <- 0.1 * scale_factor 
        
           if (is.null(upper_Hell_score_thresh_relaxed)) 
        
             upper_Hell_score_thresh_relaxed <- 0.2 * scale_factor

reubenthomas · 2026-03-26T19:52:29Z

@natalie-23-gill, what would be the reason to scale the threshold based on the number of subjects? The Hellinger distance takes values between 0 and 1 - the threshold you have employed would make the reproducibility requirement more stringent with more number of samples, specifically using the square root of the reciprocal of number of samples. The choice for the specific function (square root of reciprocal) would probably need to optimized further. For now, I think it maybe easier to have fixed thresholds of 0.1 and 0.2 (given that they have reasonable interpretation in terms of what the Hellinger distance is capturing). Of course, open to change after hearing your thoughts.

natalie-23-gill · 2026-03-26T20:22:06Z

@reubenthomas That makes sense, I misunderstood how the thresholds were derived for the simulations. I reverted to just the simple thresholds:

clustOpt/R/visualization.R

Lines 230 to 242 in 51a5456

    
           # --- Filter for reproducible resolutions --- 
        
           min_res <- min(min_resolutions, nrow(summ)) 
        
           keep <- summ$upper_Hell_95ci < upper_Hell_score_thresh 
        
           if (sum(keep) < min_res) 
        
             keep <- summ$upper_Hell_95ci < upper_Hell_score_thresh_relaxed 
        
           if (sum(keep) == 0) { 
        
             stop("No resolutions passed the Hellinger reproducibility filter ", 
        
                  "(upper_Hell_95ci < ", round(upper_Hell_score_thresh_relaxed, 3), 
        
                  "). Consider increasing upper_Hell_score_thresh_relaxed or adding ", 
        
                  "more subjects to reduce the confidence interval width.") 
        
           } 
        
           summ <- summ[keep, ]

Reuben Thomas and others added 29 commits August 30, 2025 09:09

add MSE and MSD evaluation

7eb09f9

updated NAMESPACE

82302e2

add Hellinger and KL divergence metrics

a17e0a9

add functions to compute modularity score

e525ee4

correct ARI estimate for comparing singleton clusters

eab57c1

v1.1 updates, fix CyTOF

814320e

try fix for large package files due to vignette caching

b3929ea

Merge branch 'add_mse_assessment' into v1.1

dc37d05

fix docs

6aea877

Reorganize package, fix the flipped sign for KL divergence

0895652

Add KL/Hellinger metric functions and rank-based resolution suggestio…

fc4c02b

…n with visualizations

fixes

58ed899

account for edge case

55d126f

another edge case

d23470f

improve verbosity

8959093

add another method for suggest resolution

ac9b94a

logging and performance improvements

8d99744

performance improvements

9434aa8

bump version to 1.2, add changelog

15408de

logging improvements

7275e3a

bump R version

9e29d63

Add vignette-check CI workflow and contribution guide

e7a11e9

Update visualization, suggest_resolution docs, and rebuild site

c5b4851

Add code of conduct, vignette cache, and update gitignore

23ae916

Add vignette cache to gitignore

867cc10

remove vignette cache

e69a818

fix tests

4cfe3d4

fix CI

e18b030

fix tests

b40b66b

natalie-23-gill requested a review from reubenthomas March 26, 2026 19:29

use simpler thresholds

51a5456

natalie-23-gill merged commit fbb1306 into main Mar 26, 2026
4 checks passed

natalie-23-gill deleted the v1.2 branch March 26, 2026 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V1.2#29

V1.2#29
natalie-23-gill merged 30 commits into
mainfrom
v1.2

natalie-23-gill commented Mar 13, 2026

Uh oh!

natalie-23-gill commented Mar 26, 2026

Uh oh!

reubenthomas commented Mar 26, 2026

Uh oh!

natalie-23-gill commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

natalie-23-gill commented Mar 13, 2026

New features

Performance improvements

Logging and verbosity

Bug fixes

Package reorganization

Uh oh!

natalie-23-gill commented Mar 26, 2026

Uh oh!

reubenthomas commented Mar 26, 2026

Uh oh!

natalie-23-gill commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants