Data-centric feature distillation for robust zero-shot cell-type annotation.
- Docs: https://llm-sccurator.readthedocs.io/
- Paper reproducibility: Zenodo v0.1.0 (DOI: 10.5281/zenodo.17970494).
- Cite: bioRxiv preprint (DOI: 10.64898/2025.12.28.696778). Newer tags (see GitHub Releases) add usability features without changing paper benchmarks.
LLM-scCurator is a Large Language Modelβbased curator for single-cell and spatial transcriptomics. It performs noise-aware marker distillationβsuppressing technical programs (e.g., ribosomal/mitochondrial), clonotype signals (TCR/Ig), and stress signatures while rescuing lineage markersβand applies leakage-safe lineage filters before prompting an LLM. It supports hierarchical annotation (coarse-to-fine clustering and labeling) for single-cell and spatial transcriptomics data.
- π‘οΈ Noise-aware filtering: Automatically removes lineage-specific noise (TCR/Ig) and state-dependent noise (ribosomal/mitochondrial).
- π§ Context-aware inference: Automatically infers lineage context (e.g., "T cell") to guide LLM reasoning.
- π¬ Hierarchical discovery: One-line function to dissect complex tissues into major lineages and fine-grained subtypes.
- π Spatial ready: Validated on scRNA-seq (10x) and spatial transcriptomics (Xenium, Visium).
- π Privacy-first, institutional-ready: Feature distillation runs locally; annotation works with cloud or local LLM backends, or institution-approved chat UIs (no tool-side API calls).
Fig. 1a. Overview of LLM-scCurator: data-centric, adaptive feature distillation recovers identity markers despite biological-noise programs (e.g., ribosomal, cell-cycle, stress). (bioRxiv preprint)
-
pip install llm-sc-curator
(See PyPI project page: https://pypi.org/project/llm-sc-curator/)
-
# 1. Clone the repository git clone https://github.com/kenflab/LLM-scCurator.git # 2. Navigate to the directory cd LLM-scCurator # 3. Install the package (and dependencies) pip install .
Notes:
If you already have a Scanpy/Seurat pipeline environment, you can install it into that environment.
We provide an official Docker environment (Python + R + Jupyter), sufficient to run LLM-scCurator and most paper figure generation.
Optionally includes Ollama for local LLM annotation (no cloud API key required).
-
Use the published image from GitHub Container Registry (GHCR).
# from the repo root (optional, for notebooks / file access) docker pull ghcr.io/kenflab/llm-sc-curator:officialRun Jupyter:
docker run --rm -it \ -p 8888:8888 \ -v "$PWD":/work \ -e GEMINI_API_KEY \ -e OPENAI_API_KEY \ ghcr.io/kenflab/llm-sc-curator:officialOpen Jupyter: http://localhost:8888
(Use the token printed in the container logs.)
Notes:For manuscript reproducibility, we also provide versioned tags (e.g., :0.1.0). Prefer a version tag when matching a paper release.
-
-
# from the repo root docker compose -f docker/docker-compose.yml build docker compose -f docker/docker-compose.yml upB-1.1) Open Jupyter
- http://localhost:8888
Workspace mount:
/work
B-1.2) If prompted for "Password or token"
- Get the tokenized URL from container logs:
docker compose -f docker/docker-compose.yml logs -f llm-sc-curator
- Then either:
- open the printed URL (contains
?token=...) in your browser, or - paste the token value into the login prompt.
- open the printed URL (contains
- http://localhost:8888
Workspace mount:
-
# from the repo root docker build -f docker/Dockerfile -t llm-sc-curator:official .
B-2.1) Run Jupyter
docker run --rm -it \ -p 8888:8888 \ -v "$PWD":/work \ -e GEMINI_API_KEY \ -e OPENAI_API_KEY \ llm-sc-curator:officialB-2.2) Open Jupyter
- http://localhost:8888
Workspace mount:
/work
- http://localhost:8888
Workspace mount:
-
-
Use the published image from GitHub Container Registry (GHCR).
apptainer build llm-sc-curator.sif docker://ghcr.io/kenflab/llm-sc-curator:official
-
docker compose -f docker/docker-compose.yml build apptainer build llm-sc-curator.sif docker-daemon://llm-sc-curator:official
Run Jupyter (either image):
apptainer exec --cleanenv \
--bind "$PWD":/work \
llm-sc-curator.sif \
bash -lc 'jupyter lab --ip=0.0.0.0 --port=8888 --no-browser We respect the sensitivity of clinical and biological data. LLM-scCurator is designed so that raw expression matrices and cell-level metadata can remain within your local environment.
- Local execution: Preprocessing, confounder masking, and feature ranking run locally on your machine.
- Minimal transmission (optional): If you choose to use an external LLM API, only anonymized, cluster-level marker lists (e.g., top 50 gene symbols) and minimal tissue context are sent.
- User control: You decide what additional context (e.g., disease state, treatment, platform) to include. Always follow institutional policy and the LLM providerβs terms before sharing any information.
Many institutions restrict which AI tools can be used with internal clinical or research datasets. To support these real-world constraints, we provide two end-to-end workflows that keep raw matrices and cell-level metadata local and avoid external LLM API calls unless explicitly permitted:
-
Fully local LLM (Ollama): Curate features and optionally annotate clusters using a local LLM backend (no external transmission).
examples/local/local_quickstart_ollama.ipynb -
Local feature distillation β Approved chat LLM annotation (no external LLM API calls): Curate features locally, export a curated clusterβgenes table, then annotate it via an institution-approved chat interface (e.g., Microsoft Copilot βWorkβ) by uploading the CSV/Excel or pasting markers.
examples/local/local_quickstart_approved_ai_workflow.ipynb
Documentation (βGetting started, Concepts, User guide, and the full API referenceβ): https://llm-sccurator.readthedocs.io/
- Set your API key (simplest: paste in the notebook)
import scanpy as sc
from llm_sc_curator import LLMscCurator
GEMINI_API_KEY = "PASTE_YOUR_KEY_HERE"
# OPENAI_API_KEY = "PASTE_YOUR_KEY_HERE" # optional
# Load your data
adata = sc.read_h5ad("my_data.h5ad")
# Initialize with your API Key (Google AI Studio)
curator = LLMscCurator(api_key=GEMINI_API_KEY)
curator.set_global_context(adata)- Run LLM-scCurator
-
Option A: hierarchical discovery mode(iterative coarse-to-fine clustering and labeling)
# Fully automated hierarchical annotation (includes clustering) adata = curator.run_hierarchical_discovery(adata) # Visualize sc.pl.umap(adata, color=['major_type', 'fine_type'])
-
Option B: Annotate your existing clusters (cluster β table/CSV β per-cell labels)
Use this when you already have clusters (e.g., Seuratseurat_clusters,Leiden, etc.) and want to annotate each cluster once, then propagate labels to cells.# v0.1.1+ from llm_sc_curator import ( export_cluster_annotation_table, apply_cluster_map_to_cells, ) cluster_col = "seurat_clusters" # change if needed # 1) Annotate each cluster (once) clusters = sorted(adata.obs[cluster_col].astype(str).unique()) cluster_results = {} genes_by_cluster = {} for cl in clusters: genes = curator.curate_features( adata, group_col=cluster_col, target_group=str(cl), use_statistics=True, ) genes_by_cluster[str(cl)] = genes or [] if genes: cluster_results[str(cl)] = curator.annotate(genes, use_auto_context=True) else: cluster_results[str(cl)] = { "cell_type": "NoGenes", "confidence": "Low", "reasoning": "Curated gene list empty", } # 2) Export a shareable cluster table (CSV/DataFrame) df_cluster = export_cluster_annotation_table( adata, cluster_col=cluster_col, cluster_results=cluster_results, genes_by_cluster=genes_by_cluster, prefix="Curated", ) df_cluster.to_csv("cluster_curated_map.csv", index=False) # 3) Propagate cluster labels to per-cell labels apply_cluster_map_to_cells( adata, cluster_col=cluster_col, df_cluster=df_cluster, label_col="Curated_CellType", new_col="Curated_CellType", )
Notes:
Manuscript results correspond to v0.1.0; later minor releases add user-facing utilities without changing core behavior.
You can use LLM-scCurator in two ways:
-
Option A (recommended): Export β run in Python We provide a helper script
examples/R/export_to_curator.Rto export your Seurat object seamlessly for processing in Python.source("examples/R/export_to_curator.R") Rscript examples/R/export_to_curator.R \ --in_rds path/to/seurat_object.rds \ --outdir out_seurat \ --cluster_col seurat_clusters
Output:
counts.mtx(raw counts; recommended)features.tsv(gene list)obs.csv(cell metadata; includes seurat_clusters)umap.csv(optional, if available)
Notes:
- The folder will contain: counts.mtx, features.tsv, obs.csv (and umap.csv if available).
- Then continue in the Python/Colab tutorial to run LLM-scCurator and write cluster_curated_map.csv,
- which can be re-imported into Seurat for plotting.
-
If you prefer to stay in R, you can invoke the Python package via reticulate (Python-in-R). This is more sensitive to Python environment configuration, so we recommend Option A for most users.
- Use the official Docker (Python + R + Jupyter) and follow the step-by-step tutorial notebook: π
examples/R/run_llm_sccurator_R_reticulate.ipynb
The notebook includes:- Use LLM-scCurator for robust marker distillation (no API key required)
- π Optional: For annotation, use Gemini/OpenAI APIs (API key required) or Ollama (no API key) .
- Use the official Docker (Python + R + Jupyter) and follow the step-by-step tutorial notebook: π
For manuscript-facing verification (benchmarks, figures, and Source Data), use the versioned assets under paper/. See paper/README.md for the primary instructions.
Notes:
- Figures are supported by exported Source Data in
paper/source_data/(seepaper/FIGURE_MAP.csvfor panel β file mapping).- Re-running LLM/API calls or external reference annotators is optional; LLM API outputs may vary across runs even with temperature=0.
- For transparency, we include read-only provenance notebooks with example run logs in
paper/notebooks/
-
Python / Scanpy quickstart (recommended: colab_quickstart.ipynb)
-
βοΈ Runs end-to-end on a public Scanpy dataset (no API key required by default).- π Optional: If an API key is provided (replace
GEMINI_API_KEY = "YOUR_KEY_HERE"), the notebook can also run LLM-scCurator automatic hierarchical cell annotation.
- π Optional: If an API key is provided (replace
-
OpenAI quickstart (OpenAI backend: colab_quickstart_openai.ipynb)
-
βοΈ Same workflow as the Python / Scanpy quickstart, but configured for the OpenAI backend.- π Optional: If an API key is provided (replace
OPENAI_API_KEY= "YOUR_KEY_HERE"), the notebook can also run LLM-scCurator automatic hierarchical cell annotation.OPENAI_API_KEYrequires OpenAI API billing (paid API credits).
- π Optional: If an API key is provided (replace
-
-
R / Seurat quickstart (export β Python LLM-scCurator β back to Seurat: colab_quickstart_R.ipynb)
βοΈ Runs a minimal Seurat workflow in R, exports a Seurat object to an AnnData-ready folder, runs LLM-scCurator in Python, then re-imports labels into Seurat for visualization and marker sanity checks.- π Optional: Requires an API key for LLM-scCurator annotation (same setup as above).
- Recommended for Seurat users who want to keep Seurat clustering/UMAP but use LLM-scCurator for robust marker distillation and annotation.
LLM-scCurator supports both cloud LLM APIs (Gemini / OpenAI) and a local LLM backend (Ollama).
No manual installation is required: the official Docker environment already includes LLM-scCurator and its dependencies. If you use the local Ollama backend, no API key is needed.
Set your provider API key as an environment variable (Cloud LLM APIs):
GEMINI_API_KEYfor Google GeminiOPENAI_API_KEYfor OpenAI API
See each providerβs documentation for how to obtain an API key and for current usage policies.

-
Option A (Gemini steps):
A-1. Go to Google AI Studio.
A-2. Log in with your Google Account.
A-3. Click Get API key (top-left)$\rightarrow$ Create API key.
A-4. Copy the key and use it in your code. -
Option B (OpenAI steps):
B-1. Go to OpenAI Platform.
B-2. Log in with your OpenAI Account.
B-3. Click Create new secret key$\rightarrow$ Create secret key.
B-4. Copy the key and use it in your code.
Notes:
Google Gemini can be used within its free-tier limits.
OpenAI API usage requires enabling billing (paid API credits); ChatGPT subscriptions (e.g. Plus) do NOT include API usage.
- bioRxiv preprint: 10.64898/2025.12.28.696778
- Zenodo archive (v0.1.0): 10.5281/zenodo.17970494
- GitHub release tag: v0.1.0
