-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Hi,
I am running STACAS on a large Seurat object (≈281k cells) with some batches and very unbalanced sample sizes and I would like to clarify whether the behaviour I observe is expected or indicates a problem in the integration.
Below I summarize three different approaches I tried. In all cases, the pipeline runs to completion, but in some cases STACAS prints an error during anchor finding, even though the integration and downstream UMAP are produced.
Dataset / setup
Seurat / SeuratObject recently updated to SeuratV5, RNA assay, log-normalized
Total cells of the object: ~281,600
Number of samples (orig.ident, used as batch): 57
Highly unbalanced batches
11/57 samples have < 1,000 cells
Smallest sample: 265 cells
Largest sample: 13,669 cells
anchor.features = 1000
Cell labels used for semi-supervised mode
"clusters" metadata contains ~38 annotated clusters
Annotated cells: ~38,800
Unannotated cells (NA): 242,828 (majority of the dataset)
Method 1 – Stepwise STACAS, semisupervised, ndim = 28
ndim was chosen based on PCA variance (95% cumulative variance).
library(STACAS)
nfeatures = 1000
ndim = 28
obj.list <- SplitObject(All1, split.by = "orig.ident")
for (n in 1:length(obj.list)) {
print(n)
print(obj.list[[n]])
Idents(obj.list[[n]]) <- "clusters"
}
stacas_anchors <- FindAnchors.STACAS(obj.list,
anchor.features = nfeatures,
dims = 1:ndim,
cell.labels = "clusters")
st1 <- SampleTree.STACAS(
anchorset = stacas_anchors,
obj.names = names(obj.list))
object_integrated <- IntegrateData.STACAS(stacas_anchors,
sample.tree = st1,
dims=1:ndim)
object_integrated <- object_integrated %>% ScaleData() %>%
RunPCA(npcs=28) %>% RunUMAP(dims=1:28)
This finishes successfully, but during FindAnchors.STACAS I observe errors like:
Error in if (totalCols == 0) return(NULL) : argument is of length zero
The pipeline does not stop and produces an integrated object.
Method 2: One-liner Run.STACAS, semi-supervised, ndim = 20
library(STACAS)
nfeatures = 1000
ndim = 20
Idents(All1) = "clusters"
object_integrated1 <- All1 %>% SplitObject(split.by = "orig.ident") %>%Run.STACAS(dims = 1:ndim, anchor.features = nfeatures, cell.labels = "clusters") %>% RunUMAP(dims = 1:ndim)
This also finishes to the end, but I again see the same error message during the run:
Warning: sparse->dense coercion: allocating vector of size 1.0 GiBWarning: pseudoinverse used at -2.2162Warning: neighborhood radius 0.30103Warning: reciprocal condition number 1.4523e-14Preparing PCA embeddings for objects...
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=44s
|+++++++ | 13% ~07h 26m 18s Error in if (totalCols == 0) return(NULL) : argument is of length zero
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=06h 03m 47s
Method 3: One-liner Run.STACAS, unsupervised, ndim = 20
object_integrated2 <- All1 %>% SplitObject(split.by = "orig.ident") %>%Run.STACAS(dims = 1:ndim, anchor.features = nfeatures) %>% RunUMAP(dims = 1:ndim)
This version:
finishes without explicit errors
produces an integrated object and UMAP
Questions
Is it expected that:
STACAS completes even when FindAnchors.STACAS encounters cases where totalCols == 0?
When using cell.labels, does this error indicate that some batch pairs have no compatible anchors and are effectively skipped?
Is there a recommended way to diagnose which datasets or batch pairs fail to form anchors?
For large, heterogeneous datasets, is semi-supervised STACAS still recommended, or should the unsupervised mode be preferred?
Thank you very much for your help, and for developing STACAS!