Accurate stage classification of bladder cancer using gene expression and molecular signatures
stageClassifieR applies a Random Forest-based model to distinguish muscle-invasive (MIBC) from non-muscle-invasive (NMIBC) bladder cancer. Leveraging gene expression profiles and molecular subtype information, the package delivers reliable stage predictions to support translational research and biomarker discovery.
stageClassifieR is an R package for predicting bladder cancer stage classification, specifically distinguishing between muscle-invasive bladder cancer (MIBC) and non-muscle-invasive bladder cancer (NMIBC). The package uses a random forest classifier trained on gene expression data and molecular signatures to provide accurate stage predictions.
- π― Accurate Classification: Predicts MIBC vs NMIBC with high precision using machine learning
- 𧬠Molecular Integration: Incorporates gene expression and molecular signatures
- π LundTaxR Compatible: Seamlessly integrates with LundTaxR molecular subtyping
- π Probability Scores: Returns prediction probabilities for confidence assessment
- βοΈ Flexible Thresholds: Customizable classification thresholds for different use cases
You can install the development version of stageClassifieR from GitHub with:
# install.packages("devtools")
devtools::install_github("mattssca/stageClassifieR")stageClassifieR requires the following packages:
- LundTaxR: For molecular subtype classification
- dplyr: For data manipulation
- tibble: For data frame operations
- randomForest: For the underlying classification model
library(stageClassifieR)
library(LundTaxR)
# First, get molecular classifications from LundTaxR
lundtax_predictions <- classify_samples(expression_data = your_expression_matrix)
# Predict bladder cancer stage
stage_results <- predict_stage(lundtax_predictions)
# View results
head(stage_results)
#> sample_id prediction probability
#> 1 sample1 mibc 0.85
#> 2 sample2 nmibc 0.23
#> 3 sample3 mibc 0.92
# Summary of predictions
table(stage_results$prediction)
#> mibc nmibc
#> 45 55# Select specific samples for prediction
selected_samples <- c("sample1", "sample5", "sample10")
stage_results <- predict_stage(
these_predictions = lundtax_predictions,
these_sample_ids = selected_samples
)# Use your own expression data instead of data from LundTaxR
stage_results <- predict_stage(
these_predictions = lundtax_predictions,
expression_data = my_custom_expression_matrix
)# Use stricter threshold for MIBC classification
stage_results <- predict_stage(
these_predictions = lundtax_predictions,
this_threshold = 0.7 # Default is 0.596
)
# More samples will be classified as NMIBC with higher thresholdYour gene expression data should be formatted as:
- Rows: Genes (with gene symbols as rownames)
- Columns: Samples (with sample IDs as column names)
- Values: Log2-transformed expression values
# Example expression matrix format
expression_data[1:5, 1:3]
#> sample1 sample2 sample3
#> ACTB 12.5 11.8 12.1
#> GAPDH 13.2 13.0 13.4
#> TP53 8.9 9.2 8.7
#> BRCA1 7.5 7.8 7.2
#> MYC 10.1 10.5 9.8The predict_stage() function requires a LundTaxR prediction object containing:
- Molecular subtype classifications (5-class system)
- Signature scores (proliferation, progression, molecular grades)
- Optional: expression data
The function returns a data frame with three columns:
| Column | Type | Description |
|---|---|---|
sample_id |
Character | Sample identifier |
prediction |
Character | Stage prediction ("mibc" or "nmibc") |
probability |
Numeric | Probability of MIBC classification (0-1) |
- Model Type: Random Forest classifier
- Training Features:
- Top predictive genes from expression data
- Molecular subtype classifications (5-class)
- Signature scores (proliferation, progression, molecular grades)
- Default Threshold: 0.596 (optimized for balanced accuracy)
The classifier was trained and validated on Uroscanseq data:
- Training Set: 200 samples (100 NMIBC and 100 MIBC)
- Accuracy: 87%
- Sensitivity: 78% (MIBC detection)
- Specificity: 87% (NMIBC detection)
- Balanced Accuracy: 83%
- More aggressive cancer type
- Typically requires radical treatment
- High probability scores (β₯0.596) suggest MIBC
- Less aggressive cancer type
- Often managed with conservative treatment
- Low probability scores (<0.596) suggest NMIBC
Error: Missing LundTaxR predictions
# Make sure to run LundTaxR classification first
lundtax_predictions <- LundTaxR::classify_samples(your_data)
stage_results <- predict_stage(lundtax_predictions)Error: Sample ID mismatch
# Ensure sample IDs match between expression data and predictions
colnames(expression_data) <- make.names(colnames(expression_data))Warning: Missing features
# Missing features are automatically imputed using training data means
# This warning is informational and doesn't affect resultsIf you use stageClassifieR in your research, please cite:
Adam Mattsson et al. (2025). stageClassifieR: Machine Learning-Based
Bladder Cancer Stage Classification. R package version 0.1.0.
https://github.com/yourusername/stageClassifieR
- LundTaxR: Molecular subtype classification for bladder cancer
- randomForest: Random forest implementation
- dplyr: Data manipulation tools
Contributions are welcome! Please see our Contributing Guide for details.
# Clone the repository
git clone https://github.com/yourusername/stageClassifieR.git
cd stageClassifieR
# Install development dependencies
devtools::install_dev_deps()
# Run tests
devtools::test()
# Check package
devtools::check()This project is licensed under the MIT License - see the LICENSE file for details.
- π Bug reports: GitHub Issues
- π¬ Questions: GitHub Discussions
- π§ Email: adam.mattsson@med.lu.se
Developed by Adam Mattsson | 2025
