Skip to content

kenflab/LLM-scCurator

Repository files navigation

LLM-scCurator

LLM-scCurator Data-centric feature distillation for robust zero-shot cell-type annotation.

Docs bioRxiv DOI License: MIT


πŸš€ Overview

LLM-scCurator is a Large Language Model–based curator for single-cell and spatial transcriptomics. It performs noise-aware marker distillationβ€”suppressing technical programs (e.g., ribosomal/mitochondrial), clonotype signals (TCR/Ig), and stress signatures while rescuing lineage markersβ€”and applies leakage-safe lineage filters before prompting an LLM. It supports hierarchical annotation (coarse-to-fine clustering and labeling) for single-cell and spatial transcriptomics data.

Key Features

  • πŸ›‘οΈ Noise-aware filtering: Automatically removes lineage-specific noise (TCR/Ig) and state-dependent noise (ribosomal/mitochondrial).
  • 🧠 Context-aware inference: Automatically infers lineage context (e.g., "T cell") to guide LLM reasoning.
  • πŸ”¬ Hierarchical discovery: One-line function to dissect complex tissues into major lineages and fine-grained subtypes.
  • 🌍 Spatial ready: Validated on scRNA-seq (10x) and spatial transcriptomics (Xenium, Visium).
  • πŸ”’ Privacy-first, institutional-ready: Feature distillation runs locally; annotation works with cloud or local LLM backends, or institution-approved chat UIs (no tool-side API calls).

LLM-scCurator overview: data-centric feature distillation

Fig. 1a. Overview of LLM-scCurator: data-centric, adaptive feature distillation recovers identity markers despite biological-noise programs (e.g., ribosomal, cell-cycle, stress). (bioRxiv preprint)


πŸ“¦ Installation

PyPI version Python 3.9+ R Docker Jupyter

  • Option A (recommended): Install from PyPI

    pip install llm-sc-curator
    

    (See PyPI project page: https://pypi.org/project/llm-sc-curator/)

  • Option B: Install from GitHub (development)

    # 1. Clone the repository
    git clone https://github.com/kenflab/LLM-scCurator.git
    
    # 2. Navigate to the directory
    cd LLM-scCurator
    
    # 3. Install the package (and dependencies)
    pip install .

Notes:

If you already have a Scanpy/Seurat pipeline environment, you can install it into that environment.


🐳 Docker (official environment)

We provide an official Docker environment (Python + R + Jupyter), sufficient to run LLM-scCurator and most paper figure generation.
Optionally includes Ollama for local LLM annotation (no cloud API key required).

  • Option A: Prebuilt image (recommended)

    Use the published image from GitHub Container Registry (GHCR).

    # from the repo root (optional, for notebooks / file access)
    docker pull ghcr.io/kenflab/llm-sc-curator:official

    Run Jupyter:

    docker run --rm -it \
      -p 8888:8888 \
      -v "$PWD":/work \
      -e GEMINI_API_KEY \
      -e OPENAI_API_KEY \
      ghcr.io/kenflab/llm-sc-curator:official
    

    Open Jupyter: http://localhost:8888

    (Use the token printed in the container logs.)

    Notes:

    For manuscript reproducibility, we also provide versioned tags (e.g., :0.1.0). Prefer a version tag when matching a paper release.

  • Option B: Build locally (development)

    • Option B-1: Build locally with Compose (recommended for dev)
      # from the repo root
      docker compose -f docker/docker-compose.yml build
      docker compose -f docker/docker-compose.yml up

      B-1.1) Open Jupyter

      B-1.2) If prompted for "Password or token"

      • Get the tokenized URL from container logs:
        docker compose -f docker/docker-compose.yml logs -f llm-sc-curator
      • Then either:
        • open the printed URL (contains ?token=...) in your browser, or
        • paste the token value into the login prompt.
    • Option B-2: Build locally without Compose (alternative)
      # from the repo root
      docker build -f docker/Dockerfile -t llm-sc-curator:official .

      B-2.1) Run Jupyter

      docker run --rm -it \
        -p 8888:8888 \
        -v "$PWD":/work \
        -e GEMINI_API_KEY \
        -e OPENAI_API_KEY \
        llm-sc-curator:official

      B-2.2) Open Jupyter


πŸ–₯️ Apptainer / Singularity (HPC)

  • Option A: Prebuilt image (recommended)

    Use the published image from GitHub Container Registry (GHCR).

    apptainer build llm-sc-curator.sif docker://ghcr.io/kenflab/llm-sc-curator:official
  • Option B: a .sif from the Docker image (development)

    docker compose -f docker/docker-compose.yml build
    apptainer build llm-sc-curator.sif docker-daemon://llm-sc-curator:official

Run Jupyter (either image):

apptainer exec --cleanenv \
  --bind "$PWD":/work \
  llm-sc-curator.sif \
  bash -lc 'jupyter lab --ip=0.0.0.0 --port=8888 --no-browser 

πŸ”’ Privacy

We respect the sensitivity of clinical and biological data. LLM-scCurator is designed so that raw expression matrices and cell-level metadata can remain within your local environment.

  • Local execution: Preprocessing, confounder masking, and feature ranking run locally on your machine.
  • Minimal transmission (optional): If you choose to use an external LLM API, only anonymized, cluster-level marker lists (e.g., top 50 gene symbols) and minimal tissue context are sent.
  • User control: You decide what additional context (e.g., disease state, treatment, platform) to include. Always follow institutional policy and the LLM provider’s terms before sharing any information.

Institutional-friendly workflows

Many institutions restrict which AI tools can be used with internal clinical or research datasets. To support these real-world constraints, we provide two end-to-end workflows that keep raw matrices and cell-level metadata local and avoid external LLM API calls unless explicitly permitted:

  • Fully local LLM (Ollama): Curate features and optionally annotate clusters using a local LLM backend (no external transmission). examples/local/local_quickstart_ollama.ipynb

  • Local feature distillation β†’ Approved chat LLM annotation (no external LLM API calls): Curate features locally, export a curated clusterβ†’genes table, then annotate it via an institution-approved chat interface (e.g., Microsoft Copilot β€œWork”) by uploading the CSV/Excel or pasting markers. examples/local/local_quickstart_approved_ai_workflow.ipynb


⚑ Quick Start

Documentation (β€”Getting started, Concepts, User guide, and the full API referenceβ€”): https://llm-sccurator.readthedocs.io/

🐍 For Python / Scanpy Users

  1. Set your API key (simplest: paste in the notebook)
import scanpy as sc
from llm_sc_curator import LLMscCurator

GEMINI_API_KEY = "PASTE_YOUR_KEY_HERE"
# OPENAI_API_KEY = "PASTE_YOUR_KEY_HERE"  # optional

# Load your data
adata = sc.read_h5ad("my_data.h5ad")
  
# Initialize with your API Key (Google AI Studio)
curator = LLMscCurator(api_key=GEMINI_API_KEY)
curator.set_global_context(adata)
  1. Run LLM-scCurator
  • Option A: hierarchical discovery mode(iterative coarse-to-fine clustering and labeling)

    # Fully automated hierarchical annotation (includes clustering)
    adata = curator.run_hierarchical_discovery(adata)
    
    # Visualize
    sc.pl.umap(adata, color=['major_type', 'fine_type'])
  • Option B: Annotate your existing clusters (cluster β†’ table/CSV β†’ per-cell labels)
    Use this when you already have clusters (e.g., Seurat seurat_clusters, Leiden, etc.) and want to annotate each cluster once, then propagate labels to cells.

    # v0.1.1+
    from llm_sc_curator import (
        export_cluster_annotation_table,
        apply_cluster_map_to_cells,
    )
    
    cluster_col = "seurat_clusters"  # change if needed
    
    # 1) Annotate each cluster (once)
    clusters = sorted(adata.obs[cluster_col].astype(str).unique())
    cluster_results = {}
    genes_by_cluster = {}
    
    for cl in clusters:
        genes = curator.curate_features(
            adata,
            group_col=cluster_col,
            target_group=str(cl),
            use_statistics=True,
        )
        genes_by_cluster[str(cl)] = genes or []
    
        if genes:
            cluster_results[str(cl)] = curator.annotate(genes, use_auto_context=True)
        else:
            cluster_results[str(cl)] = {
                "cell_type": "NoGenes",
                "confidence": "Low",
                "reasoning": "Curated gene list empty",
            }
    
    # 2) Export a shareable cluster table (CSV/DataFrame)
    df_cluster = export_cluster_annotation_table(
        adata,
        cluster_col=cluster_col,
        cluster_results=cluster_results,
        genes_by_cluster=genes_by_cluster,
        prefix="Curated",
    )
    df_cluster.to_csv("cluster_curated_map.csv", index=False)
    
    # 3) Propagate cluster labels to per-cell labels
    apply_cluster_map_to_cells(
        adata,
        cluster_col=cluster_col,
        df_cluster=df_cluster,
        label_col="Curated_CellType",
        new_col="Curated_CellType",
    )

Notes:

Manuscript results correspond to v0.1.0; later minor releases add user-facing utilities without changing core behavior.

πŸ“Š For R / Seurat Users

You can use LLM-scCurator in two ways:

  • Option A (recommended): Export β†’ run in Python We provide a helper script examples/R/export_to_curator.R to export your Seurat object seamlessly for processing in Python.

    source("examples/R/export_to_curator.R")
    Rscript examples/R/export_to_curator.R \
      --in_rds path/to/seurat_object.rds \
      --outdir out_seurat \
      --cluster_col seurat_clusters

    Output:

    • counts.mtx (raw counts; recommended)
    • features.tsv (gene list)
    • obs.csv (cell metadata; includes seurat_clusters)
    • umap.csv (optional, if available)

    Notes:

    • The folder will contain: counts.mtx, features.tsv, obs.csv (and umap.csv if available).
    • Then continue in the Python/Colab tutorial to run LLM-scCurator and write cluster_curated_map.csv,
    • which can be re-imported into Seurat for plotting.
  • Option B: Run from R via reticulate (advanced)

    If you prefer to stay in R, you can invoke the Python package via reticulate (Python-in-R). This is more sensitive to Python environment configuration, so we recommend Option A for most users.


πŸ“„ Manuscript reproduction

For manuscript-facing verification (benchmarks, figures, and Source Data), use the versioned assets under paper/. See paper/README.md for the primary instructions.

Notes:

  • Figures are supported by exported Source Data in paper/source_data/ (see paper/FIGURE_MAP.csv for panel β†’ file mapping).
  • Re-running LLM/API calls or external reference annotators is optional; LLM API outputs may vary across runs even with temperature=0.
  • For transparency, we include read-only provenance notebooks with example run logs in paper/notebooks/

πŸ““ Colab notebooks

  • Python / Scanpy quickstart (recommended: colab_quickstart.ipynb)

    • Open In Colab
      ☝️ Runs end-to-end on a public Scanpy dataset (no API key required by default).

      • πŸ”‘ Optional: If an API key is provided (replace GEMINI_API_KEY = "YOUR_KEY_HERE"), the notebook can also run LLM-scCurator automatic hierarchical cell annotation.
    • OpenAI quickstart (OpenAI backend: colab_quickstart_openai.ipynb)

    • Open In Colab
      ☝️ Same workflow as the Python / Scanpy quickstart, but configured for the OpenAI backend.

      • πŸ”‘ Optional: If an API key is provided (replace OPENAI_API_KEY= "YOUR_KEY_HERE"), the notebook can also run LLM-scCurator automatic hierarchical cell annotation. OPENAI_API_KEY requires OpenAI API billing (paid API credits).
  • R / Seurat quickstart (export β†’ Python LLM-scCurator β†’ back to Seurat: colab_quickstart_R.ipynb)

    • Open In Colab
      ☝️ Runs a minimal Seurat workflow in R, exports a Seurat object to an AnnData-ready folder, runs LLM-scCurator in Python, then re-imports labels into Seurat for visualization and marker sanity checks.
      • πŸ”‘ Optional: Requires an API key for LLM-scCurator annotation (same setup as above).
      • Recommended for Seurat users who want to keep Seurat clustering/UMAP but use LLM-scCurator for robust marker distillation and annotation.

πŸ”‘ Backends setup (API keys or local Ollama)

LLM-scCurator supports both cloud LLM APIs (Gemini / OpenAI) and a local LLM backend (Ollama).
No manual installation is required: the official Docker environment already includes LLM-scCurator and its dependencies. If you use the local Ollama backend, no API key is needed.

Set your provider API key as an environment variable (Cloud LLM APIs):

  • GEMINI_API_KEY for Google Gemini
  • OPENAI_API_KEY for OpenAI API

See each provider’s documentation for how to obtain an API key and for current usage policies. Get API Key GIF

  • Option A (Gemini steps):
    A-1. Go to Google AI Studio.
    A-2. Log in with your Google Account.
    A-3. Click Get API key (top-left) $\rightarrow$ Create API key.
    A-4. Copy the key and use it in your code.

  • Option B (OpenAI steps):
    B-1. Go to OpenAI Platform.
    B-2. Log in with your OpenAI Account.
    B-3. Click Create new secret key $\rightarrow$ Create secret key.
    B-4. Copy the key and use it in your code.

Notes:

Google Gemini can be used within its free-tier limits.
OpenAI API usage requires enabling billing (paid API credits); ChatGPT subscriptions (e.g. Plus) do NOT include API usage.


Citation