SCAML

Machine-learning for single-cell RNA-seq microglia. Train scikit-learn models (LR / RF / XGB / MLP) to predict cell types/states directly from expression, benchmarked across harvests (H4 vs H8) and with HEK novelty detection. This extends the Seurat analysis from scrna_ipsc and turns label transfer into a full ML benchmark.

What’s new (high-level)

Robust splits & labels
- Train/test by harvest, culture, etc. with consistent label encoding.
- Works even when train/test see different class sets (reverse split). XGBoost label remapping + probability expansion included.
Safer metrics
- classification_report aligned to the full class list.
- Macro OVR ROC-AUC computed robustly; returns NaN when test has <2 classes.
Saved artifacts for plotting
- Per-model: predictions.csv, confusion_matrix.csv, metrics.json, proba.csv.
- Top-level: summary_metrics.json, classes.txt, features.txt.
- New utility: scaml_plot.py renders model comparison + confusion matrices.
Seurat v5 → AnnData export guidance
- Reliable .h5ad creation even with Seurat v5 multi-layer assays.

Data expectations (AnnData `.h5ad`)

adata.X — recommended: raw counts (SCAML log1p’s internally for HVGs). If you prefer, you can skip HVGs and pass an embedding via --embedding.
adata.obs (columns expected):
- label_ref: string labels (e.g. Olah 2020 microglia clusters).
- harvest: e.g. "H4" / "H8" (or any split key you choose).
- is_hek (optional): boolean for novelty detection.
adata.obsm (optional but supported):
- X_pca, X_umap, and optionally X_harmony (if you ran Harmony in Seurat).
- Use --embedding X_harmony (or X_pca, etc.) to train on embeddings instead of genes.

Export from Seurat v5 → `.h5ad` (robust path)

Seurat v5’s multi-layer assays can break older converters. A reliable approach is to write the AnnData directly from R using the anndata R package:

# In R
library(Seurat)
library(anndata)

obj <- readRDS("seurat_with_labels.rds")

# Map fields SCAML expects
stopifnot("olah_label" %in% colnames(obj@meta.data))
obj$label_ref <- as.character(obj$olah_label)
obj$is_hek    <- if ("mg_like" %in% colnames(obj@meta.data)) !obj$mg_like else FALSE

# Choose expression for adata.X (SCT preferred if present; otherwise RNA)
assay <- if ("SCT" %in% names(obj@assays)) "SCT" else "RNA"
DefaultAssay(obj) <- assay
# Ensure 'data' exists; recompute if needed
obj <- NormalizeData(obj, assay = assay, verbose = FALSE)

# Build AnnData
X <- GetAssayData(obj, assay = assay, slot = "data")  # log-normalized
obsm <- list()
if ("pca" %in% names(obj@reductions))     obsm[["X_pca"]]     <- as.matrix(Embeddings(obj, "pca"))
if ("umap" %in% names(obj@reductions))    obsm[["X_umap"]]    <- as.matrix(Embeddings(obj, "umap"))
if ("harmony" %in% names(obj@reductions)) obsm[["X_harmony"]] <- as.matrix(Embeddings(obj, "harmony"))

ad <- anndata::AnnData(
  X   = X,
  obs = obj@meta.data,
  var = data.frame(gene = rownames(obj), row.names = rownames(obj)),
  obsm = obsm
)

dir.create("data", showWarnings = FALSE)
ad$write_h5ad("data/ipsc_mg_all.h5ad", compression = "gzip")

Verify the file (size >> 1MB and h5ls shows groups like obs, var, X, obsm). If you prefer HVGs on raw counts, set adata.layers["counts"] too; otherwise SCAML will still work (it log1p’s the selected genes).

⚠️ SeuratDisk sometimes creates tiny (invalid) .h5ad files under Seurat v5. If your file is ~800 bytes or h5ls shows no keys, use the anndata route above.

Setup

# Python (aligns with CI)
conda create -y -n scaml python=3.10
conda activate scaml

# Install
pip install --upgrade pip
pip install -r SCAML/requirements.txt

Train & Evaluate

1) Baseline cross-harvest split

python SCAML/scaml_train.py --adata data/ipsc_mg_all.h5ad \
  --label-key label_ref \
  --split-key harvest --split-train H4 --split-test H8 \
  --features hvg --n-hvg 3000 \
  --models lr rf xgb mlp \
  --outdir results/baseline_H4train_H8test

2) Reverse split (robust to class-set differences)

python SCAML/scaml_train.py --adata data/ipsc_mg_all.h5ad \
  --label-key label_ref \
  --split-key harvest --split-train H8 --split-test H4 \
  --features hvg --n-hvg 3000 \
  --models lr rf xgb mlp \
  --outdir results/baseline_H8train_H4test

3) Learning curves (group-aware; avoids leakage)

python SCAML/scaml_learning_curve.py \
  --adata data/ipsc_mg_all.h5ad \
  --label-key label_ref --group-key culture \
  --features hvg --n-hvg 3000 \
  --model lr \
  --outdir results/learning_curves

4) Novelty detection (HEK outliers)

python SCAML/scaml_novelty.py \
  --adata data/ipsc_mg_all.h5ad \
  --hek-key is_hek --hek-positive true \
  --features hvg --n-hvg 3000 \
  --method iforest \
  --outdir results/novelty

5) Plots

# Compare models + per-model confusion matrices
python SCAML/scaml_plot.py \
  --results results/baseline_H4train_H8test \
  --outdir  results/baseline_H4train_H8test/plots \
  --models  lr rf xgb mlp

python SCAML/scaml_plot.py \
  --results results/baseline_H8train_H4test \
  --outdir  results/baseline_H8train_H4test/plots \
  --models  lr rf xgb mlp

Tip: If you exported Harmony to obsm['X_harmony'], you can train on it:

python SCAML/scaml_train.py \
  --adata data/ipsc_mg_all.h5ad \
  --label-key label_ref \
  --split-key harvest --split-train H4 --split-test H8 \
  --embedding X_harmony \
  --models lr rf xgb mlp \
  --outdir results/with_harmony_H4train_H8test

Outputs

Top-level (per run)

summary_metrics.json — per-model summary: accuracy, F1 (macro), ROC-AUC (macro OVR; may be NaN if test has one class)
classes.txt — integer ↔︎ label mapping
features.txt — HVG list or embedding column names

Per model (e.g., lr/, rf/, xgb/, mlp/)

classification_report.txt — precision/recall/F1 by class
confusion_matrix.csv — rows = true, cols = predicted
predictions.csv — cell, true/pred labels + indices
proba.csv — per-class probabilities (expanded to the full class set)
metrics.json — accuracy, F1 (macro), ROC-AUC (macro OVR)

Plots (from scaml_plot.py)

plots/model_compare.png — accuracy & F1 bars per model
plots/<model>_confusion_matrix.png — heatmaps (row-normalized)

Models

LR — LogisticRegression (with MaxAbsScaler)
RF — RandomForestClassifier
XGB — XGBClassifier (auto-remaps labels when train/test class sets differ)
MLP — MLPClassifier (dense path with standardization)

Notes & gotchas

HVGs vs counts: The HVG selector is Seurat-style. For cleanest results, put raw counts in adata.X; SCAML applies log1p on the selected genes. If your adata.X is already log data, consider using --embedding to avoid HVG warnings.
ROC-AUC showing NaN: This is expected if your test split contains only one class. Accuracy/F1 are still valid.
SeuratDisk tiny .h5ad: If h5ad is ~800 bytes or missing groups (obs, var, etc.), re-export with the R anndata approach above.
Group-aware splits: Prefer --split-key culture (or donor/patient) for fair generalization tests; avoid random cell splits.

Dev / CI

Local:

pip3 install flake8 pytest
flake8 .
pytest

GitHub Actions (name: SCAML) runs:

Lint: flake8 (syntax/complexity)
Tests: pytest on Python 3.10

Acknowledgements

Built on top of your Seurat v5 pipeline (SCT, optional Harmony, label transfer to Olah 2020).
Uses: Scanpy/AnnData, scikit-learn, (optional) XGBoost, and R packages for export.

Example results

🧬 Context

SCAML here is being used as a machine learning framework for transcriptomic classification — evaluating how well expression patterns from induced pluripotent stem cell (iPSC)-derived microglia at one differentiation stage (harvest) can predict or distinguish those from another.

H4 (Harvest 4) = Early differentiation stage.
H8 (Harvest 8) = Later, more mature differentiation stage.
Each has three biological replicates, and models are trained on one harvest and tested on the other to measure cross-harvest generalization — i.e., whether transcriptomic signatures of maturation are reproducible and not chip-specific.

Analysis of SCAML H4-trained, H8-tested model results from summary_metrics.json:

Algorithm	Accuracy	F1-macro	ROC-AUC (macro, OVR)	Interpretation
Logistic Regression (lr)	0.764	0.509	0.898	Strong linear baseline; balanced precision/recall; good generalization.
Random Forest (rf)	0.731	0.363	0.900	Slightly lower accuracy and poor F1 → likely class imbalance issue.
XGBoost (xgb)	0.771	0.544	0.901	Best performer overall — highest accuracy, F1, and AUC; robust ensemble learner.
MLP (mlp)	0.745	0.389	0.871	Decent but underperforms on F1 and AUC; possibly undertrained or overfitting.

Model Performance (iPSC-microglia scRNA-seq)

Cross-harvest generalization

Train on H4 → Test on H8

Train on H8 → Test on H4

🧠 Interpretation

Best Algorithm: ✅ XGBoost
- Highest accuracy (77.1%), F1-macro (0.54), and ROC-AUC (0.90) → excellent balance between precision and recall.
- Performs slightly better than logistic regression, meaning nonlinear relationships in your SCAML data are important.
Random Forest shows good AUC but poor F1 → indicates it's predicting the majority class too often (class imbalance sensitivity).
MLP (neural net) underperforms, suggesting either limited data, suboptimal architecture, or insufficient epochs.

📊 Recommended next steps

Confirm class distribution — the low F1 for RF/MLP hints at imbalance; consider class weighting or SMOTE.
Inspect confusion matrices for XGB vs LR to see which cluster classes are misclassified.
Cross-validate XGB on multiple train/test splits (not just H4→H8) to check consistency across harvests/chips.
Model explainability: use SHAP to interpret XGB feature contributions — useful for biological relevance in transcript curation.

Conclusion:

For the SCAML dataset (H4-trained, H8-tested), XGBoost is the most effective algorithm — it achieves the best balance of accuracy, class-balanced F1, and overall discrimination (ROC-AUC ≈ 0.90).

Analysis of SCAML run where models were trained on H8 and tested on H4:

Algorithm	Accuracy	F1-macro	ROC-AUC (macro OVR)	Interpretation
Logistic Regression (lr)	0.752	0.423	0.904	Strong baseline; high AUC; slightly reduced F1 due to imbalance.
Random Forest (rf)	0.744	0.350	0.899	Similar to before — AUC fine, but poor recall on minority classes.
XGBoost (xgb)	0.779	0.571	0.905	Best across all metrics; consistent top performance between H4↔H8 tests.
MLP (mlp)	0.757	0.420	0.875	Weaker AUC and F1 — again suggests underfitting or sensitivity to training size.

🔍 Comparison to H4-trained → H8-tested results

Direction	Best Model	Accuracy	F1-macro	ROC-AUC	Notes
H4 → H8	XGBoost	0.771	0.544	0.901	Strong cross-harvest generalization
H8 → H4	XGBoost	0.779	0.571	0.905	Slightly improved in both F1 and AUC

✅ XGBoost again emerges as the top model — and importantly, it generalizes well in both directions (H4→H8 and H8→H4). This symmetry suggests the learned patterns are stable across harvests/chips, which supports biological rather than purely technical signal.

🧠 Interpretation summary

Cross-harvest stability: XGBoost achieves AUC ≈ 0.90 in both directions — excellent discriminative power.
F1-macro ≈ 0.55–0.57: indicates balanced detection of both true/false or multi-class Seurat cluster labels.
MLP underperformance likely due to limited data or architecture mismatch.
Logistic regression performs surprisingly well given simplicity → worth keeping as a fast baseline.

🧠 Biological Interpretation

Cross-harvest reproducibility: High AUC values (~0.90) in both directions confirm that transcriptomic maturation signatures of iPSC-microglia are robust to chip and replicate variation.
F1-macro (~0.54–0.57) shows the model captures both early and late populations with balanced recall.
XGBoost consistently outperforms simpler (LR) and more complex (MLP) models — suggesting that nonlinear but structured gene-expression relationships dominate differentiation dynamics.

🏁 Conclusion

Across both training directions (H4 → H8 and H8 → H4), XGBoost is the most effective algorithm for distinguishing early and late iPSC-microglial transcriptomic states. It achieves the highest accuracy, balanced F1, and ROC-AUC, demonstrating that the maturation-associated gene-expression signatures it learns are biologically meaningful and reproducible across independent harvests and chips.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
figures		figures
scaml		scaml
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scaml_learning_curve.py		scaml_learning_curve.py
scaml_novelty.py		scaml_novelty.py
scaml_plot.py		scaml_plot.py
scaml_predict.py		scaml_predict.py
scaml_train.py		scaml_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCAML

What’s new (high-level)

Data expectations (AnnData `.h5ad`)

Export from Seurat v5 → `.h5ad` (robust path)

Setup

Train & Evaluate

1) Baseline cross-harvest split

2) Reverse split (robust to class-set differences)

3) Learning curves (group-aware; avoids leakage)

4) Novelty detection (HEK outliers)

5) Plots

Outputs

Models

Notes & gotchas

Dev / CI

Acknowledgements

Example results

🧬 Context

Model Performance (iPSC-microglia scRNA-seq)

Train on H4 → Test on H8

Train on H8 → Test on H4

🧠 Interpretation

📊 Recommended next steps

🔍 Comparison to H4-trained → H8-tested results

🧠 Interpretation summary

🧠 Biological Interpretation

🏁 Conclusion

About

Uh oh!

Releases

Packages

Languages

License

pasted/SCAML

Folders and files

Latest commit

History

Repository files navigation

SCAML

What’s new (high-level)

Data expectations (AnnData .h5ad)

Export from Seurat v5 → .h5ad (robust path)

Setup

Train & Evaluate

1) Baseline cross-harvest split

2) Reverse split (robust to class-set differences)

3) Learning curves (group-aware; avoids leakage)

4) Novelty detection (HEK outliers)

5) Plots

Outputs

Models

Notes & gotchas

Dev / CI

Acknowledgements

Example results

🧬 Context

Model Performance (iPSC-microglia scRNA-seq)

Train on H4 → Test on H8

Train on H8 → Test on H4

🧠 Interpretation

📊 Recommended next steps

🔍 Comparison to H4-trained → H8-tested results

🧠 Interpretation summary

🧠 Biological Interpretation

🏁 Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Data expectations (AnnData `.h5ad`)

Export from Seurat v5 → `.h5ad` (robust path)

Packages