Celltypist symbol col assignmnet fix #222

ECM893 · 2025-08-27T16:19:15Z

PR Notes

I fully do NOT intend for this PR to be merged as is.

Problem 1: When using the .h5ed from scrnaseq pipeline, you must set symbol_col to a column that is categorical, but this is incompatible with anndata objects's var_names (when calling the function to make unique indexes). The first fix is to simply cast the pd.Series to a list when assigning it to adata.var_names.

Problem 2: This gets a little more dicey, When trying to boot strap scanvi from celltypist, it's hard to know/non-obvious the name of the celltypist column before hand that's needed for Celltypist labels, as it expect them to be in the 'label_col' column, which is empty. I still believe this can be changed elsewhere, or maybe I'm just wrong in my approach, Please correct me, im not an expert with scrnaseq.

Problem 3: Similar to 1, adata.var_names becomes categorical again, There should be a nicer place to change this upstream, but I cant find it right now.

Additional: I couldn't the debug profile to behave nicely, maybe something is wrong?

PR checklist

nf-core/scrnaseq saves the scanvi compatable symbols in the column "gene_symbols" (index is the ensbl IDs). So to be compatible with nf-core/scdownstream You must define "symbol_col" in the sample sheet and "gene_symbol" as the value for each sample in the column. However, the .h5ad from nf-core/scrnaseq this column as categorical. in line 50 of celltypist.py is simple replaces the ver_names (index, type = Object) with the categorical column. However, in line 60 celltypist.annotate will attempt to make the var_names unique. This will error as it cannot handle categorical restructuring. It must be converted to a different column type, To fix this simple, call .to_list() for the reassignment.

…. It's possible this comes from an error somewhere else and should be fixed there..

nictru

So what you want to do is use the celltypist annotation for scANVI integration. More generally, you want to use an anndata column other than label_col for scANVI integration.

You probably want this because you do not have any proper annotation prior to running the pipeline - otherwise you would use this. This means, the label column contains only unknown during pipeline execution, making scANVI unusable.

To make scANVI usable in this situation, you want to utilize one of the available automated annotation methods (in your case celltypist, could also be singleR) and use it as annotation for scANVI.

The label_col is currently only used in two places in the pipeline:

UNIFY: To rename all the label_cols to a common name, so that they match during concatenation
SCANVI then utilizes the result of the concatenation

An important fact is that the CELLTYPE_ASSIGNMENT subworkflow (which contains celltypist and singleR) happens before the UNIFY takes place - so before the first access to the label_col. This makes it possible to set any column that will be created during CELLTYPE_ASSIGNMENT as a label_col without the pipeline noticing that it wasn't there from the beginning. You just have to know the name of the created column before running the pipeline, but with a bit of documentation this should be possible.

I think this replaces your complex logic in the scanvi.py file pretty much, except for the confidence thresholding. But this should go into the CELLTYPIST module anyway.

What do you think @ECM893 ?

nictru · 2025-09-10T18:18:48Z

modules/local/celltypes/celltypist/templates/celltypist.py

    if symbol_col not in adata_celltypist.var.columns:
        raise ValueError(f"Symbol column {symbol_col} not found in adata.var.columns")
-    adata_celltypist.var_names = adata_celltypist.var[symbol_col]
+    adata_celltypist.var_names = adata_celltypist.var[symbol_col].to_list()


This looks fine

nictru · 2025-09-10T18:20:50Z

modules/local/liana/rankaggregate/templates/rank_aggregate.py

+# Ensure that var_names are strings (not categorical) and unique
+# For some reason this happens when coming from scanvi, maybe also given the other changes
+adata.var_names = adata.var_names.astype(str)
+adata.var_names_make_unique()
+


Making the var names unique in this location should not be necessary, after unification anndata objects can not contain duplicate var names. The casting to string also looks odd to me, but I can do some testing in that matter. If it is necessary, this would still not be the optimal location

I think this is fixed with your suggestion over all. It was caused as a downstream byproduct of my hack, I believe.

ECM893 · 2025-09-10T20:24:56Z

You are correct in the assumption that I'm trying to do celltyping WITHOUT my own reference dataset. From Literature, it seems like well defined cell references that celltypist is trained on perform very well. I'm trying to see if a thresholding on celltypists confidence to seed into scANVI preforms well, and I assumed from your pipeline that this might have been part of the design due to the stacking nature. Was this an intended design or have I bastardized this workflow? If not, what is the intention of having celltypist separate from scANVI etc.? I'm somewhat new to scRNA-seq, so I apologize for my ignorance.

Perhaps the documentation and fore-knowledge of the celltypist column should be added like you say, or maybe a simpler flag like "--use-celltypist-labels" and have the logic about the column name handled in the backend?

Ill soon remove this draft or make additional edits when I have time for this project later next week (probably)

nictru · 2025-09-14T09:38:59Z

scANVI is a semi-supervised method, that considers annotations for its integration, and aims on bringing the cells with the same annotation together. If you perform the annotation with celltypist, you will end up with an scANVI embedding that reflects the celltypist annotation.

Now imagine celltypist does not work well in a particular setting, then you will maybe end up with an scANVI embedding that looks good, but does not really represent the underlying biology, but rather the problematic annotations.

So I don't say you can't do it this way, but you should pay close attention to what is going on, also maybe investigate the differences between multiple integration methods etc.

Originally celltypist was placed after the integration in the pipeline workflow, but at this point all the datasets are already merged into a single object, which comes with two problems:

Object is large, takes a lot of memory (single datasets can be processed in parallel or serial)
We can either use the union or the intersection of genes, both are not optimal

These problems were overcome by putting the cell type annotation before the merging.

or maybe a simpler flag like "--use-celltypist-labels" and have the logic about the column name handled in the backend

As you already noticed, users can potentially have multiple annotation columns, not only celltypist but also singleR. So in some way, the user needs to provide the name of the actual annotation model that should be used as label_col. Now if the user needs to provide the full name anyway, we can also use the label_col field for this purpose

What could be annoying is that it needs to be set for all input samples individually, but I still think that's preferable over having both the samplesheet column and a pipeline parameter

ECM893 added 3 commits August 24, 2025 14:55

Hack to fix celltypist to scanvi cell label column issue.

678438d

change to liana to handle the case where var_names become categorical…

2411346

…. It's possible this comes from an error somewhere else and should be fixed there..

nictru reviewed Sep 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Celltypist symbol col assignmnet fix #222

Celltypist symbol col assignmnet fix #222

Uh oh!

ECM893 commented Aug 27, 2025

Uh oh!

nictru left a comment

Uh oh!

nictru Sep 10, 2025

Uh oh!

nictru Sep 10, 2025

Uh oh!

ECM893 Sep 10, 2025

Uh oh!

ECM893 commented Sep 10, 2025 •

edited

Loading

Uh oh!

nictru commented Sep 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Celltypist symbol col assignmnet fix #222

Are you sure you want to change the base?

Celltypist symbol col assignmnet fix #222

Uh oh!

Conversation

ECM893 commented Aug 27, 2025

PR Notes

PR checklist

Uh oh!

nictru left a comment

Choose a reason for hiding this comment

Uh oh!

nictru Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

nictru Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

ECM893 Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

ECM893 commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nictru commented Sep 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ECM893 commented Sep 10, 2025 •

edited

Loading