Machine-learning for single-cell RNA-seq microglia. Train scikit-learn models (LR / RF / XGB / MLP) to predict cell types/states directly from expression, benchmarked across harvests (H4 vs H8) and with HEK novelty detection. This extends the Seurat analysis from scrna_ipsc and turns label transfer into a full ML benchmark.
-
Robust splits & labels
- Train/test by
harvest,culture, etc. with consistent label encoding. - Works even when train/test see different class sets (reverse split). XGBoost label remapping + probability expansion included.
- Train/test by
-
Safer metrics
classification_reportaligned to the full class list.- Macro OVR ROC-AUC computed robustly; returns
NaNwhen test has <2 classes.
-
Saved artifacts for plotting
- Per-model:
predictions.csv,confusion_matrix.csv,metrics.json,proba.csv. - Top-level:
summary_metrics.json,classes.txt,features.txt. - New utility:
scaml_plot.pyrenders model comparison + confusion matrices.
- Per-model:
-
Seurat v5 → AnnData export guidance
- Reliable
.h5adcreation even with Seurat v5 multi-layer assays.
- Reliable
-
adata.X— recommended: raw counts (SCAML log1p’s internally for HVGs). If you prefer, you can skip HVGs and pass an embedding via--embedding. -
adata.obs(columns expected):label_ref: string labels (e.g. Olah 2020 microglia clusters).harvest: e.g."H4"/"H8"(or any split key you choose).is_hek(optional): boolean for novelty detection.
-
adata.obsm(optional but supported):X_pca,X_umap, and optionallyX_harmony(if you ran Harmony in Seurat).- Use
--embedding X_harmony(orX_pca, etc.) to train on embeddings instead of genes.
Seurat v5’s multi-layer assays can break older converters. A reliable approach is to write the AnnData directly from R using the anndata R package:
# In R
library(Seurat)
library(anndata)
obj <- readRDS("seurat_with_labels.rds")
# Map fields SCAML expects
stopifnot("olah_label" %in% colnames(obj@meta.data))
obj$label_ref <- as.character(obj$olah_label)
obj$is_hek <- if ("mg_like" %in% colnames(obj@meta.data)) !obj$mg_like else FALSE
# Choose expression for adata.X (SCT preferred if present; otherwise RNA)
assay <- if ("SCT" %in% names(obj@assays)) "SCT" else "RNA"
DefaultAssay(obj) <- assay
# Ensure 'data' exists; recompute if needed
obj <- NormalizeData(obj, assay = assay, verbose = FALSE)
# Build AnnData
X <- GetAssayData(obj, assay = assay, slot = "data") # log-normalized
obsm <- list()
if ("pca" %in% names(obj@reductions)) obsm[["X_pca"]] <- as.matrix(Embeddings(obj, "pca"))
if ("umap" %in% names(obj@reductions)) obsm[["X_umap"]] <- as.matrix(Embeddings(obj, "umap"))
if ("harmony" %in% names(obj@reductions)) obsm[["X_harmony"]] <- as.matrix(Embeddings(obj, "harmony"))
ad <- anndata::AnnData(
X = X,
obs = obj@meta.data,
var = data.frame(gene = rownames(obj), row.names = rownames(obj)),
obsm = obsm
)
dir.create("data", showWarnings = FALSE)
ad$write_h5ad("data/ipsc_mg_all.h5ad", compression = "gzip")
Verify the file (size >> 1MB and h5ls shows groups like obs, var, X, obsm).
If you prefer HVGs on raw counts, set adata.layers["counts"] too; otherwise SCAML will still work (it log1p’s the selected genes).
⚠️ SeuratDisk sometimes creates tiny (invalid).h5adfiles under Seurat v5. If your file is ~800 bytes orh5lsshows no keys, use theanndataroute above.
# Python (aligns with CI)
conda create -y -n scaml python=3.10
conda activate scaml
# Install
pip install --upgrade pip
pip install -r SCAML/requirements.txt
python SCAML/scaml_train.py --adata data/ipsc_mg_all.h5ad \
--label-key label_ref \
--split-key harvest --split-train H4 --split-test H8 \
--features hvg --n-hvg 3000 \
--models lr rf xgb mlp \
--outdir results/baseline_H4train_H8test
python SCAML/scaml_train.py --adata data/ipsc_mg_all.h5ad \
--label-key label_ref \
--split-key harvest --split-train H8 --split-test H4 \
--features hvg --n-hvg 3000 \
--models lr rf xgb mlp \
--outdir results/baseline_H8train_H4test
python SCAML/scaml_learning_curve.py \
--adata data/ipsc_mg_all.h5ad \
--label-key label_ref --group-key culture \
--features hvg --n-hvg 3000 \
--model lr \
--outdir results/learning_curves
python SCAML/scaml_novelty.py \
--adata data/ipsc_mg_all.h5ad \
--hek-key is_hek --hek-positive true \
--features hvg --n-hvg 3000 \
--method iforest \
--outdir results/novelty
# Compare models + per-model confusion matrices
python SCAML/scaml_plot.py \
--results results/baseline_H4train_H8test \
--outdir results/baseline_H4train_H8test/plots \
--models lr rf xgb mlp
python SCAML/scaml_plot.py \
--results results/baseline_H8train_H4test \
--outdir results/baseline_H8train_H4test/plots \
--models lr rf xgb mlp
Tip: If you exported Harmony to obsm['X_harmony'], you can train on it:
python SCAML/scaml_train.py \
--adata data/ipsc_mg_all.h5ad \
--label-key label_ref \
--split-key harvest --split-train H4 --split-test H8 \
--embedding X_harmony \
--models lr rf xgb mlp \
--outdir results/with_harmony_H4train_H8test
Top-level (per run)
summary_metrics.json— per-model summary: accuracy, F1 (macro), ROC-AUC (macro OVR; may beNaNif test has one class)classes.txt— integer ↔︎ label mappingfeatures.txt— HVG list or embedding column names
Per model (e.g., lr/, rf/, xgb/, mlp/)
classification_report.txt— precision/recall/F1 by classconfusion_matrix.csv— rows = true, cols = predictedpredictions.csv— cell, true/pred labels + indicesproba.csv— per-class probabilities (expanded to the full class set)metrics.json— accuracy, F1 (macro), ROC-AUC (macro OVR)
Plots (from scaml_plot.py)
plots/model_compare.png— accuracy & F1 bars per modelplots/<model>_confusion_matrix.png— heatmaps (row-normalized)
- LR —
LogisticRegression(withMaxAbsScaler) - RF —
RandomForestClassifier - XGB —
XGBClassifier(auto-remaps labels when train/test class sets differ) - MLP —
MLPClassifier(dense path with standardization)
-
HVGs vs counts: The HVG selector is Seurat-style. For cleanest results, put raw counts in
adata.X; SCAML applieslog1pon the selected genes. If youradata.Xis already log data, consider using--embeddingto avoid HVG warnings. -
ROC-AUC showing
NaN: This is expected if your test split contains only one class. Accuracy/F1 are still valid. -
SeuratDisk tiny
.h5ad: Ifh5adis ~800 bytes or missing groups (obs,var, etc.), re-export with the Ranndataapproach above. -
Group-aware splits: Prefer
--split-key culture(or donor/patient) for fair generalization tests; avoid random cell splits.
Local:
pip3 install flake8 pytest
flake8 .
pytest
GitHub Actions (name: SCAML) runs:
- Lint:
flake8(syntax/complexity) - Tests:
pyteston Python 3.10
- Built on top of your Seurat v5 pipeline (SCT, optional Harmony, label transfer to Olah 2020).
- Uses: Scanpy/AnnData, scikit-learn, (optional) XGBoost, and R packages for export.
SCAML here is being used as a machine learning framework for transcriptomic classification — evaluating how well expression patterns from induced pluripotent stem cell (iPSC)-derived microglia at one differentiation stage (harvest) can predict or distinguish those from another.
- H4 (Harvest 4) = Early differentiation stage.
- H8 (Harvest 8) = Later, more mature differentiation stage.
- Each has three biological replicates, and models are trained on one harvest and tested on the other to measure cross-harvest generalization — i.e., whether transcriptomic signatures of maturation are reproducible and not chip-specific.
Analysis of SCAML H4-trained, H8-tested model results from summary_metrics.json:
| Algorithm | Accuracy | F1-macro | ROC-AUC (macro, OVR) | Interpretation |
|---|---|---|---|---|
| Logistic Regression (lr) | 0.764 | 0.509 | 0.898 | Strong linear baseline; balanced precision/recall; good generalization. |
| Random Forest (rf) | 0.731 | 0.363 | 0.900 | Slightly lower accuracy and poor F1 → likely class imbalance issue. |
| XGBoost (xgb) | 0.771 | 0.544 | 0.901 | Best performer overall — highest accuracy, F1, and AUC; robust ensemble learner. |
| MLP (mlp) | 0.745 | 0.389 | 0.871 | Decent but underperforms on F1 and AUC; possibly undertrained or overfitting. |
Cross-harvest generalization
-
Best Algorithm: ✅ XGBoost
- Highest accuracy (77.1%), F1-macro (0.54), and ROC-AUC (0.90) → excellent balance between precision and recall.
- Performs slightly better than logistic regression, meaning nonlinear relationships in your SCAML data are important.
-
Random Forest shows good AUC but poor F1 → indicates it's predicting the majority class too often (class imbalance sensitivity).
-
MLP (neural net) underperforms, suggesting either limited data, suboptimal architecture, or insufficient epochs.
- Confirm class distribution — the low F1 for RF/MLP hints at imbalance; consider class weighting or SMOTE.
- Inspect confusion matrices for XGB vs LR to see which cluster classes are misclassified.
- Cross-validate XGB on multiple train/test splits (not just H4→H8) to check consistency across harvests/chips.
- Model explainability: use SHAP to interpret XGB feature contributions — useful for biological relevance in transcript curation.
Conclusion:
For the SCAML dataset (H4-trained, H8-tested), XGBoost is the most effective algorithm — it achieves the best balance of accuracy, class-balanced F1, and overall discrimination (ROC-AUC ≈ 0.90).
Analysis of SCAML run where models were trained on H8 and tested on H4:
| Algorithm | Accuracy | F1-macro | ROC-AUC (macro OVR) | Interpretation |
|---|---|---|---|---|
| Logistic Regression (lr) | 0.752 | 0.423 | 0.904 | Strong baseline; high AUC; slightly reduced F1 due to imbalance. |
| Random Forest (rf) | 0.744 | 0.350 | 0.899 | Similar to before — AUC fine, but poor recall on minority classes. |
| XGBoost (xgb) | 0.779 | 0.571 | 0.905 | Best across all metrics; consistent top performance between H4↔H8 tests. |
| MLP (mlp) | 0.757 | 0.420 | 0.875 | Weaker AUC and F1 — again suggests underfitting or sensitivity to training size. |
| Direction | Best Model | Accuracy | F1-macro | ROC-AUC | Notes |
|---|---|---|---|---|---|
| H4 → H8 | XGBoost | 0.771 | 0.544 | 0.901 | Strong cross-harvest generalization |
| H8 → H4 | XGBoost | 0.779 | 0.571 | 0.905 | Slightly improved in both F1 and AUC |
✅ XGBoost again emerges as the top model — and importantly, it generalizes well in both directions (H4→H8 and H8→H4). This symmetry suggests the learned patterns are stable across harvests/chips, which supports biological rather than purely technical signal.
- Cross-harvest stability: XGBoost achieves AUC ≈ 0.90 in both directions — excellent discriminative power.
- F1-macro ≈ 0.55–0.57: indicates balanced detection of both true/false or multi-class Seurat cluster labels.
- MLP underperformance likely due to limited data or architecture mismatch.
- Logistic regression performs surprisingly well given simplicity → worth keeping as a fast baseline.
- Cross-harvest reproducibility: High AUC values (~0.90) in both directions confirm that transcriptomic maturation signatures of iPSC-microglia are robust to chip and replicate variation.
- F1-macro (~0.54–0.57) shows the model captures both early and late populations with balanced recall.
- XGBoost consistently outperforms simpler (LR) and more complex (MLP) models — suggesting that nonlinear but structured gene-expression relationships dominate differentiation dynamics.
Across both training directions (H4 → H8 and H8 → H4), XGBoost is the most effective algorithm for distinguishing early and late iPSC-microglial transcriptomic states. It achieves the highest accuracy, balanced F1, and ROC-AUC, demonstrating that the maturation-associated gene-expression signatures it learns are biologically meaningful and reproducible across independent harvests and chips.