Skip to content

Baysor: cell segmentation #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lpantano opened this issue Feb 20, 2025 · 2 comments
Open

Baysor: cell segmentation #14

lpantano opened this issue Feb 20, 2025 · 2 comments
Assignees

Comments

@lpantano
Copy link
Contributor

webpage: https://github.com/kharchenkolab/Baysor
10x instructions: https://www.10xgenomics.com/analysis-guides/using-baysor-to-perform-xenium-cell-segmentation

Risk: medium.
It has linux binaries and it is a Julia package. They have docker as well, possible problem with docker is only able to use in FAS.
Usage: it seems easy from the documentation.

Notes: it requires to pre-process the data and the examples I saw is with Julia scripts. So that could be another block in the process.

Successfully installed in o2:

/n/app/bcbio/baysor/bin/baysor/bin/baysor

And the downstream scripts:

/n/app/bcbio/baysor/script/map_transcripts.py
@lpantano lpantano added this to Tools Feb 3, 2025
@lpantano lpantano converted this from a draft issue Feb 20, 2025
@lpantano
Copy link
Contributor Author

RunningBaysor.md

@lpantano
Copy link
Contributor Author

# Billingsley 3_7_25
# Creating a Seurat Object with Spatial data taken from Baysor
library(Seurat)
library(tidyverse)
library(Matrix)
library(RCurl)
library(knitr)
library(SeuratDisk)
library(patchwork)
library(reshape2)
library(data.table) # Fast data loading
# Load the transcript-level data from Baysor output (segmentation.csv)
TrxDat <- fread("~/Desktop/Test5/segmentation.csv")
length(rownames(TrxDat)) # 3e6
# segmentation.csv has quality information for each transcript that we can use to filter out poor quality transcripts, noise or transcripts not assigned to cells with high confidence
noiseF <- TrxDat[which(TrxDat$is_noise == FALSE), ]
noiseT <- TrxDat[which(TrxDat$is_noise == TRUE), ]
table(TrxDat$is_noise)
# FALSE    TRUE
# 2882348  250945
# We could filter using $is_noise, which assigns each transcript as noise or not, but we're going to filter the noise == TRUE cells out with other quality metrics below...so we won't need to filter on Baysor's noise calls per se. But check your own data.
hist(noiseF$confidence) # $confidence assigns a probability that a transcript is real
hist(noiseT$confidence) # NoiseT has most transcripts very close to confidence 0
hist(TrxDat$assignment_confidence) # $assignment_confidence assigns a probability that a transcript is associated with a particular called cell.
length(which(TrxDat$assignment_confidence >= 0.5))
length(which(TrxDat$assignment_confidence >= 0.9))
length(which(TrxDat$assignment_confidence >= 0.99)) # would remove half the transcripts
# TrxDat <- TrxDat %>% filter(assignment_confidence >= 0.9)# Adjust as needed. The default in the python script provided by 10x used 0.9 .   I've found this is probably too low for my CosMx data, I'll used 0.99.
TrxDat <- TrxDat %>% filter(assignment_confidence >= 0.99)
length(rownames(TrxDat)) # 1.66e6
# $confidence is the liklihood that a transcript is "real"
length(which(TrxDat$confidence >= 0.97))
# A script from 10x Genomics to facilitate Baysor to Seurat script did not use this "confidence" metric for filtering, but with my data, I've found using a high threshold here helped clean up the final output. Try a few values and see how many transcripts remain. Think about whether you want sensitivity or specificity in final output in your experiment.
TrxDat <- TrxDat %>% filter(confidence >= 0.97)
length(rownames(TrxDat)) # 878171
table(TrxDat$is_noise) # still some True noise
# in the $cell columns, there are blanks indiccating the transcript was not assigned to a cell, i.e. is noise. Here we remove transcripts with no cell called, blank values.
TrxDat$cell[TrxDat$cell == ""] <- NA
TrxDat <- TrxDat %>% filter(!is.na(cell))
length(rownames(TrxDat)) # 878025
table(TrxDat$is_noise) # this removed the is_noise == T transcripts
# examine some other transcript annotation
table(TrxDat$cluster) # this would be the umap cluster the transcript is assigned to.
tb <- table(TrxDat$ncv_color)
str(tb) # these are the ncv clusters, ie "pseudocells"
str(TrxDat$prior_segmentation) # identifies the prior used, if you used one
length(rownames(TrxDat)) # 878025
# create transcript count matrix
gene_counts <- TrxDat %>%
  count(gene, cell) %>%
  pivot_wider(names_from = cell, values_from = n, values_fill = list(n = 0))
gene_matrix <- as.matrix(gene_counts[, -1]) # Remove gene column for matrix conversion
rownames(gene_matrix) <- gene_counts$gene
str(gene_matrix) # 27617 cells
sum(gene_matrix) # 878025
# Remove control genes. (You can remove them at this step or later, and consider removing high expressing "nuisance" genes like MALAT1, and maybe rpls etc. I'll leave these in for now)
CtrlTranscripts <- grep("^Sys|^Neg", rownames(gene_matrix))
str(CtrlTranscripts)
gene_matrix <- gene_matrix[-CtrlTranscripts, ]
grep("^Sys|^Neg", rownames(gene_matrix))
# Create the Seurat object
SO_b5 <- CreateSeuratObject(counts = gene_matrix)
print(sum(GetAssayData(SO_b5, slot = "counts"))) # 861681
# Read in cell segmentation data , these will be used to create xy cell centroid coordinates later.
scs <- fread(file = "~/Desktop/Test5/segmentation_cell_stats.csv")
length(rownames(scs))
colnames(scs)
range(na.omit(scs$area)) # 30070 is not qute high enough for cells Im interested in, probably will go back and redo Baysor with different scale parameter.
length(which(scs$cell %in% colnames(SO_b5)))
scs <- scs %>%
  filter(cell %in% colnames(SO_b5)) # keep only cells called in transcript annotation
rownames(scs) <- scs$cell
SO_b5 <- AddMetaData(SO_b5, metadata = scs)
# Filtering out poor quality cells
# Filter by *cell area*, this is a major filtering step
range(na.omit(SO_b5$area)) # 1 px^2 to 30070 px^2
hist(na.omit(SO_b5$area), breaks = 1000) # most "cells" are below 5000 px^2
# With CosMx we're using pixels, so will need to convert to microns to think about cell diameters and radii
# there are 0.12028 uM per pixel for CosMx output, so 0.0144 uM^2 per px^2.
# so at the low end of the "cell" areas in my data...
1 * .12028^2 # == .0144 uM^2. that's not a cell
# at the high end
30070 * .12028^2 # == 435 that's a pretty small cell for a keratinocyte, I probably need to increase the scale parameter in Baysor
# Some AtomX output I have show an area range of 457.0 75800.5 px^2
457.0 * .12028^2 # == 6.6 uM^2. that's seems a reasonable size for a low threshold.
# the high end
75800.5 * .12028^2 # == 1096.6 uM^2 this is probably better
# I want to make sure I find the pericytes which have a diameter of 5-10 uM
# lets try a 3uM filter threshold
3 / .12028^2 # == 207
length(colnames(SO_b5)[which(SO_b5$area <= 207)]) # 1237
low_area_threshold_pixels <- 3 / .12028^2
low_area_threshold_pixels # 207
filtered_cells_lost <- [email protected] %>%
  filter(area <= low_area_threshold_pixels)
length(rownames(filtered_cells_lost)) # 1238
filtered_cells_kept <- [email protected] %>%
  filter(area >= low_area_threshold_pixels)
length(rownames(filtered_cells_kept)) # 26379. adjust the above transcript filtering stringencies thinking abut numbers of cells vs noise
SO_b5 <- subset(SO_b5, cells = filtered_cells_kept$cell)
print(sum(GetAssayData(SO_b5, slot = "counts"))) # 857044 keeping an eye on number of transcripts in the data set
hist(SO_b5$area, breaks = 1000, xlim = c(0, 20000)) # we might want to pick an inflection point for filtering on the histogram instead?
range(SO_b5$area)
# high range is 300070
30070 * .12028^2 # == 435uM^2, which is fairly small for some keratinocytes, so I will leave intact, but you could filter out abnormally high area cells as well if needed
# What about $avg_confidence and $avg_assignment_confidence in the metadata? These averages were assigned to each called cell *before* I manually filtered transcripts above so are not accurate any longer per cell in the Seurat Object
# some additional plots
countMat <- data.frame(GetAssayData(SO_b5))
MeanVec <- apply(countMat, 1, mean) # mean counts per feature
SrtMeanVec <- sort(MeanVec, decreasing = T)
head(SrtMeanVec) # "MALAT1", "MHC I", "B2M" these are commonly seen as high expressors,  also KRT1 and KRT10 in my data. Think about removing some of the "niusance" genes before running FAM. Or before PCA?
cmc <- data.frame(GetAssayData(SO_b5, assay = "RNA", layer = "counts"))
# features are in rows, let's count number of positives per row, i.e. positive cells pre feature. Do we need to remove any poorly expressed features?
RS <- rowSums(cmc > 0)
range(RS) # "No feature has fewer than 72 positive cells in my data"
CS <- colSums(cmc > 0)
range(CS) # number of counts per cell
length(which(CS < 4)) # we will remove these low count cells
hist(MeanVec, breaks = 1000, main = "mean counts per feature")
plot(sort(CS), log = "y", main = "onscale features per cell") # Im gonna remove the cells with fewer than 4 features, some remove cells with fewer than 10 onscale features...
SO_b5$novelty <- log10([email protected]$nFeature_RNA) / log10([email protected]$nCount_RNA)
VlnPlot(object = SO_b5, features = c("nCount_RNA"), ncol = 1, pt.size = 0.05) +
  xlab("") +
  ylab("Number of counts per cell") #
VlnPlot(object = SO_b5, features = c("nCount_RNA"), ncol = 1, pt.size = 0.05) +
  ylim(0, 10)
xlab("") +
  ylab("Number of counts per cell") # some zero count cells
VlnPlot(object = SO_b5, features = c("nFeature_RNA"), ncol = 1, pt.size = 0.05) +
  xlab("") +
  ylab("Number of genes detected")
[email protected] %>%
  ggplot(aes(x = nCount_RNA, y = nFeature_RNA, color = novelty)) +
  geom_point() +
  stat_smooth(method = lm) +
  scale_x_log10() +
  scale_y_log10() +
  xlab("nCount") +
  ylab("nFeature") +
  facet_wrap(~orig.ident)
length(which(CS < 4)) # cells to remove
BadCellIndex <- which(CS < 4)
SubS1 <- subset(SO_b5, cells = BadCellIndex, inv = T)
fDatBay5 <- subset(SubS1, subset = nFeature_RNA > 3 & novelty > 0.7) #
print(sum(GetAssayData(fDatBay5, slot = "counts"))) # 829468
fDatBay5 <- SCTransform(fDatBay5)
fDatBay5 <- RunPCA(fDatBay5, npcs = 30, features = rownames(fDatBay5))
Idents(fDatBay5) <- "cluster"
DimPlot(fDatBay5, reduction = "pca")
fDatBay5 <- RunUMAP(fDatBay5, dims = 1:30, return.model = T)
DimPlot(fDatBay5, reduction = "umap", group.by = "cluster") +
  ggtitle("Baysor clusters") # pretty squirrely, try other umap dims
# Try a number of different umap dims, find one that shows good separation of clusters without being overly-fit and ragged
u10 <- RunUMAP(fDatBay5, dims = 1:10, return.model = T)
DimPlot(u10, reduction = "umap", group.by = "cluster") +
  ggtitle("Baysor clusters") # looks a lot better.
FeaturePlot(u10, reduction = "umap", features = "n_transcripts", cols = c("lightblue", "red"), max.cutoff = "q99")
FeaturePlot(u10, reduction = "umap", features = "area", cols = c("lightblue", "red"), max.cutoff = "q99")
FeaturePlot(u10, reduction = "umap", features = "avg_confidence")
u10 <- FindNeighbors(object = u10, reduction = "umap", dims = 1:2)
u10 <- FindClusters(object = u10, algorithm = 1, resolution = c(0.01, 0.025, 0.05, 0.075, 0.1), random.seed = 42)
for (res in c(0.01, 0.025, 0.05, 0.075, 0.1)) {
  cat("\n")
  cat("", "Cluster resolution: ", res, "\n")
  Idents(object = u10) <- paste0("SCT_snn_res.", res)
  p <- DimPlot(u10,
    reduction = "umap",
    label = TRUE
  ) +
    ggtitle(paste0("SCT_snn.", res))
  print(p)
  cat("\n")
}
# go with 0.1 for 12 clusters
DimPlot(u10, reduction = "umap", group.by = "SCT_snn_res.0.1", label = T)
Idents(u10) <- "SCT_snn_res.0.1"
FAMc12 <- FindAllMarkers(u10, only.pos = T)
write.table(FAMc12, file = "~/Desktop/FAMc12.txt", sep = "\t", col.names = NA)
FeaturePlot(u10, reduction = "umap", features = "ACTA2", order = T)
# not high enough cluster resolution to resolve a good pericyte cluster which would be a subcluster of c4.
# Loading spatial data
centroids <- data.frame(x = [email protected][["x"]], y = [email protected][["y"]], cell = colnames(u10))
cents <- CreateCentroids(centroids)
coords <- CreateFOV(coords = list("centroids" = cents), type = "centroids")
u10[["FOV"]] <- coords
ImageDimPlot(u10, fov = "FOV")
ImageDimPlot(u10, fov = "FOV", cells = colnames(u10)[which(u10$SCT_snn_res.0.1 == 2)], cols = "red") +
  ggtitle("Baysor u10 c4")
# Also look at original cluster calls from Baysor
Idents(u10) <- "cluster"
ImageDimPlot(u10, fov = "FOV")

@lpantano lpantano self-assigned this Mar 10, 2025
@lpantano lpantano moved this from In review to Ready in Tools Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready
Development

No branches or pull requests

1 participant