Skip to content

V1.2#29

Merged
natalie-23-gill merged 30 commits into
mainfrom
v1.2
Mar 26, 2026
Merged

V1.2#29
natalie-23-gill merged 30 commits into
mainfrom
v1.2

Conversation

@natalie-23-gill
Copy link
Copy Markdown
Collaborator

closes #28 and #10

New features

  • Added KL divergence (calculate_kl_divergence()) and Hellinger distance
    (calculate_hellinger_distance()) metrics for evaluating cluster distribution
    consistency between training and held-out subjects.
  • Added modularity score (calculate_modularity()) computed on the precomputed
    SNN graph.
  • Added MSE and MAD scores (calculate_mse_score()) for centroid-based cluster
    quality evaluation.
  • New suggest_resolution() function that ranks resolutions using two
    complementary methods: direct rank aggregation across four metrics, and
    curvature-based local optima detection via second-order finite differences.
  • New summarize_cv_metrics() for per-resolution metric summaries.
  • New visualization functions plot_rank_metrics() and plot_mean_rank().
  • Logging improvements to make clustOpt messages distinct from its dependencies.

Performance improvements

  • Pre-allocate the results list in the main cross-validation loop instead of
    growing it with c(), reducing memory allocation overhead for many subjects.
  • Use crossprod() instead of explicit t() %*% in PCA projections, avoiding
    materialization of transposed gene-by-cell matrices.
  • Use Matrix::crossprod() in modularity calculation to stay in sparse matrix
    space and avoid dense transposition of the cluster indicator matrix.
  • Replace do.call(rbind, lapply(...)) with t(vapply(...)) in MAD
    calculation to avoid intermediate list allocation.
  • Precompute SNN graph and distance matrix once per held-out subject (shared
    across resolutions) instead of recomputing per resolution.
  • Pre-extract cluster assignments from metadata by resolution to avoid repeated
    column lookups.

Logging and verbosity

  • verbose parameter now accepts integer levels (0-3) for fine-grained control:
    0 = silent, 1 = key milestones, 2 = detailed progress, 3 = Seurat output.
  • Backward-compatible: verbose = TRUE maps to level 1, FALSE to 0.
  • Added per-step timing via [step_name] Xs log messages at verbose >= 1.

Bug fixes

  • Fixed flipped sign in KL divergence calculation.
  • Fixed ARI estimate for comparing singleton clusters.
  • Handled edge cases in sample validation and metric computation.

Package reorganization

  • Split monolithic clustOpt.R and utils.R into focused modules:
    clust_opt.R, data_preparation.R, metrics.R, sketching.R,
    validation.R, visualization.R.
  • Removed vignette build cache from version control.

@natalie-23-gill
Copy link
Copy Markdown
Collaborator Author

@reubenthomas I added the changes to suggest_resolution for the local optima filtering step, let me know if it looks good and I'll merge this in

clustOpt/R/visualization.R

Lines 223 to 229 in b40b66b

# Scale thresholds dynamically based on number of subjects
n_subjects <- max(table(cv_results$resolution))
scale_factor <- sqrt(11 / n_subjects)
if (is.null(upper_Hell_score_thresh))
upper_Hell_score_thresh <- 0.1 * scale_factor
if (is.null(upper_Hell_score_thresh_relaxed))
upper_Hell_score_thresh_relaxed <- 0.2 * scale_factor

@reubenthomas
Copy link
Copy Markdown
Contributor

@natalie-23-gill, what would be the reason to scale the threshold based on the number of subjects? The Hellinger distance takes values between 0 and 1 - the threshold you have employed would make the reproducibility requirement more stringent with more number of samples, specifically using the square root of the reciprocal of number of samples. The choice for the specific function (square root of reciprocal) would probably need to optimized further. For now, I think it maybe easier to have fixed thresholds of 0.1 and 0.2 (given that they have reasonable interpretation in terms of what the Hellinger distance is capturing). Of course, open to change after hearing your thoughts.

@natalie-23-gill
Copy link
Copy Markdown
Collaborator Author

@reubenthomas That makes sense, I misunderstood how the thresholds were derived for the simulations. I reverted to just the simple thresholds:

clustOpt/R/visualization.R

Lines 230 to 242 in 51a5456

# --- Filter for reproducible resolutions ---
min_res <- min(min_resolutions, nrow(summ))
keep <- summ$upper_Hell_95ci < upper_Hell_score_thresh
if (sum(keep) < min_res)
keep <- summ$upper_Hell_95ci < upper_Hell_score_thresh_relaxed
if (sum(keep) == 0) {
stop("No resolutions passed the Hellinger reproducibility filter ",
"(upper_Hell_95ci < ", round(upper_Hell_score_thresh_relaxed, 3),
"). Consider increasing upper_Hell_score_thresh_relaxed or adding ",
"more subjects to reduce the confidence interval width.")
}
summ <- summ[keep, ]

@natalie-23-gill natalie-23-gill merged commit fbb1306 into main Mar 26, 2026
4 checks passed
@natalie-23-gill natalie-23-gill deleted the v1.2 branch March 26, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for Layers in Seurat objects

2 participants