This is the machine learning pipeline and web service of the research:
If you use this pipeline, please cite :
Wang, YC., Wu, PH., Ting, WC. et al. Single-cell signaling network profiling during redox stress reveals dynamic redox regulation in immune cells. Nat Commun 16, 5600 (2025). https://doi.org/10.1038/s41467-025-60727-z
To use the pipeline, please follow the steps below.
conda create -n ros_ml_env python=3.9.5
conda activate ros_ml_envpip install numpy==1.26.3
pip install pandas==1.5.3
pip install scipy==1.11.4
pip install scikit-learn==1.2.1
pip install xgboost==1.6.2
pip install Catboost==1.1
pip install traitlets==5.9.0
pip install ipywidgets==8.0.2
pip install scikit-plot
pip install mljar-scikit-plot
pip install ipython
pip install shap
pip install tqdm
pip install seabornOur tool 'ml_classifier.py' was a command-line tool. If you want to use its web service, please install sections 2.1 and 2.2.
pip install flask==3.1.0
pip install flask-cors==5.0.0
pip install celery==5.4.0
pip install redis==5.2.1
pip install amqp==5.1.0
pip install flask-socketio==5.5.0
pip install sqlitedictFollow the commands to download data or vist here or here .
To prepare data of the type predict ,use the following commandands :
mkdir data_type_predict
cd data_type_predict
curl -L -o raw_data_norm.zip "https://www.dropbox.com/scl/fi/d4fhf4r7ilc69tt4vd92b/data_type_predict_raw_data_norm.zip?rlkey=og54o827spe8d8ik9pq2t8xya&st=iveyauxg&dl=1" && echo "Download Complete"
unzip raw_data_norm.zipor simply
chmod +x make_data_type_predict.sh./make_data_type_predict.shThe path of './data_type_predict' should be like:
./data_type_predict
├── raw_data_norm
│  ├── donor_0_norm.csv
│  ├── donor_1_norm.csv
│  ├── donor_2_norm.csv
│  ├── donor_3_norm.csv
│  ├── donor_4_norm.csv
│  ├── donor_5_norm.csv
│  ├── donor_6_norm.csv
│  ├── donor_7_norm.csv
│  ├── donor_8_norm.csv
│  └── donor_9_norm.csvthan create all combinations of 8 donors for training and 2 donors for testing from a 10 donors dataset by run :
cd source
python data_type_predict_preprocess.pythen will get:
./data_type_predict
│
├── raw_data_norm
│  ├── donor_0_norm.csv
│  ├── donor_1_norm.csv
│  ├── donor_2_norm.csv
│  ├── donor_3_norm.csv
│  ├── donor_4_norm.csv
│  ├── donor_5_norm.csv
│  ├── donor_6_norm.csv
│  ├── donor_7_norm.csv
│  ├── donor_8_norm.csv
│  └── donor_9_norm.csv
└── train_data
├── 0_1
│  ├── test_0_1.csv
│  └── train_0_1.csv
├── 0_2
...
To prepare data of the time predict ,use the following commandands :
mkdir data_time_predict
cd data_time_predict
curl -L -o raw_data_norm.zip "https://www.dropbox.com/scl/fi/1nd992lyalrxadopg3dhw/data_time_predict_raw_data_norm.zip?rlkey=l6cydbfgyvwidgj277nqfsxfk&st=71b1h5h4&dl=1" && echo "Download Complete"
unzip raw_data_norm.zip
curl -L -o test_data.zip "https://www.dropbox.com/scl/fi/vu56ntoi4dku1gzk5r4kw/data_time_predict_test_data.zip?rlkey=iqiv10s64ejw9q6rim6edvz4t&e=1&st=3mtt67rd&dl=1" && echo "Download Complete"
unzip test_data.zipor simply
chmod +x make_data_time_predict.sh./make_data_time_predict.shThe path of './data_time_predict' should be like:
./data_time_predict
├── raw_data_norm
│  ├── In_vitro_1st.csv
│  ├── In_vitro_2nd.csv
│  └── In_vitro_3rd.csv
├── test_data
│  ├── 2023_01_09_MC38_1st_norm.csv
│  ├── 2023_01_09_MC38_2nd_norm.csv
│  └── 2023_01_09_MC38_3rd_norm.csvWe need to combine 3 Invitro OT-1 CD8 T cells ROS markers dataset as training data by run :
cd source
python data_time_predict_preprocess.pythen will get:
./data_time_predict
├── raw_data_norm
│  ├── In_vitro_1st.csv
│  ├── In_vitro_2nd.csv
│  └── In_vitro_3rd.csv
├── test_data
│  ├── 2023_01_09_MC38_1st_norm.csv
│  ├── 2023_01_09_MC38_2nd_norm.csv
│  └── 2023_01_09_MC38_3rd_norm.csv
└── train_data
└── MC38_In_vitro.csv
read panel table (csv) and output selected panel genes.
usage: panel_reader.py [-h] -i INPUT_CSV -s SELECT_PANEL [SELECT_PANEL ...]
options:
-h, --help show this help message and exit
-i INPUT_CSV, --input INPUT_CSV
dataset (csv) file path.
-s SELECT_PANEL [SELECT_PANEL ...], --select_panel SELECT_PANEL [SELECT_PANEL ...]
for example
python panel_reader.py -i MC38_panel.csv -s ROS signalingwill output :
NNT KEAP1 HSP70 PCYXL PRDX4 SOD2 53bp1 GPX4 NRF2 PDI HO1 MTH1 OLR1 CD36 catalase ACOX3 AQP8 Ref_APE ERO1B QSOX1 oxPTP oxDJ1 p53 CD163 mTOR pS6 HIF1a p38MAPK pNFkB pERK c_JunThis program is a wrapper of Python's common machine learning classifier, which trains models and predicts. We used it to work on our ROS prediction task.
usage: ml_classifier.py [-h] -i INPUT_CSV [-ix INPUT_INDEX_COL] -o
OUTPUT_FOLDER -f FEATURES [FEATURES ...] [-y Y_NAME]
-mod {prediction,training} [-m MODEL_JSON]
[-s RANDOM_SEED] [-no_encode] [-p] [-c CORE_USED]
[-ts TEST_SIZE] [-fi {SHAP,Permutation}]
[-pre_m PRETRAINED_MODEL_PATH]
[-pre_out_col OUTPUT_PREDICT_COL_NAME]
optional arguments:
-h, --help show this help message and exit
-i INPUT_CSV, --input INPUT_CSV
dataset (csv) file path.
-ix INPUT_INDEX_COL, --input_index_col INPUT_INDEX_COL
input dataset (csv) index col number.
-o OUTPUT_FOLDER, --output OUTPUT_FOLDER
output folder path where results will be saved.
-f FEATURES [FEATURES ...], --features FEATURES [FEATURES ...]
A list of features for train and prediction
-y Y_NAME, --y_name Y_NAME
column to predict (defult is 'None')
-mod {prediction,training}, --run_mode {prediction,training}
run with training mode or prediction mode . defult is
training mode
-m MODEL_JSON, --model MODEL_JSON
machine learning model config (json) file path or json
array in training mode.
-s RANDOM_SEED, --random_seed RANDOM_SEED
Random seed for reproducibility in training default is
666 (use 'auto' for randomly set 'random seeds' by
system time).
-no_encode, --no_label_encoding
set this flag to training model without label-encoding
in training mode.
-p, --parallel parallel run train and evaluation mmachine models
-c CORE_USED, --core CORE_USED
parallel procress number in training mode.
-ts TEST_SIZE, --test_size TEST_SIZE
size (percentage) of test size (validation data size)
in training mode. defult=0.2
-fi {SHAP,Permutation}, --feature_importance {SHAP,Permutation}
use SHAP or Permutation to explain feature importance
-pre_m PRETRAINED_MODEL_PATH, --pretrained_model_path PRETRAINED_MODEL_PATH
load pretrained model from path only for prediction
mode.
-pre_out_col OUTPUT_PREDICT_COL_NAME, --output_predict_col_name OUTPUT_PREDICT_COL_NAME
output predict col name to save for prediction mode.
And the -m should input a JSON format file include with the following key-values paire:
| KEY | VALUES |
|---|---|
| name | classifier name |
| model | classifier class Name |
| module | classifier module Name |
| params | classifier params |
[
{
"name": <classifier name>,
"model": <classifier class Name>,
"module": <classifier module Name>,
"params": {
<classifier parms key > : <classifier parms value>,
...
}
}
,...
]example 'ml_models_catboost.json':
[
{
"name": "CatBoostClassifier",
"model": "CatBoostClassifier",
"module": "catboost",
"params": {
"loss_function": "MultiClass",
"custom_metric": ["Accuracy"],
"random_seed": 666,
"logging_level": "Verbose",
"eval_metric": "TotalF1",
"use_best_model": true,
"iterations": 1500,
"depth": 8
}
}
]for example in command line type the following commandss to train a model of iris data set :
python ml_classifier.py -i ../test/iris.csv -o ../test/output -f sepal_length sepal_width petal_length petal_width -y species -mod training -no_encode -m '[ { "name": "Catboost", "model": "CatBoostClassifier", "module": "catboost", "params": { "loss_function": "MultiClass", "custom_metric": ["Accuracy"], "random_seed": 666, "logging_level": "Verbose", "eval_metric": "TotalF1", "use_best_model": true, "iterations": 100, "depth": 10 } } ]'or for our project
python ml_classifier.py -mod training -o ../test/output -i ../data_time_predict/train_data/MC38_In_vitro.csv -y day -m '[ { "name": "Catboost", "model": "CatBoostClassifier", "module": "catboost", "params": { "loss_function": "MultiClass", "custom_metric": ["Accuracy"], "random_seed": 666, "logging_level": "Verbose", "eval_metric": "TotalF1", "use_best_model": true, "iterations": 1500, "depth": 8 } } ]' -f NNT KEAP1 HSP70 PCYXL PRDX4 SOD2 53bp1 GPX4 NRF2 PDI HO1 MTH1 OLR1 CD36 catalase ACOX3 AQP8 Ref_APE ERO1B QSOX1 oxPTP oxDJ1 p53 CD163 mTOR pS6 HIF1a p38MAPK pNFkB pERK c_Junor use JSON config file as model setting:
python ml_classifier.py -mod training -o ../test/output -i ../data_time_predict/train_data/MC38_In_vitro.csv -y day -m ml_models_catboost.json -f NNT KEAP1 HSP70 PCYXL PRDX4 SOD2 53bp1 GPX4 NRF2 PDI HO1 MTH1 OLR1 CD36 catalase ACOX3 AQP8 Ref_APE ERO1B QSOX1 oxPTP oxDJ1 p53 CD163 mTOR pS6 HIF1a p38MAPK pNFkB pERK c_JunThis script will automatic run ros day prediction procress and output to ../data_time_predict/ml_models
set the execute bit :
chmod +x run_day_predict_ml.shand run :
./run_day_predict_ml.sh This script will automatic run ros cell-type prediction procress and output to ../data_type_predict/ml_models
set the execute bit :
chmod +x run_type_predict_ml.shand run :
./run_type_predict_ml.sh python run_accu_report.py -i ../data_time_predict/ml_models -o ../data_time_predict/ml_models/ml_performance.csvor run with script:
chmod +x run_data_time_predict_run_accu_report.sh./run_data_time_predict_run_accu_report.shIn our project of time prediction , we got the best model by ued CatBoostClassifier :
(ml_performance.csv)
| 0 | 1 | 2 | 3 | 4 | 5 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|---|---|---|---|
| CatBoostClassifier_f1-score | 0.9805 | 0.8055 | 0.6882 | 0.6744 | 0.6961 | 0.6008 | 0.7327 | 0.7409 | 0.7320 |
| Xgboost_f1-score | 0.9773 | 0.7854 | 0.6806 | 0.6660 | 0.6922 | 0.5935 | 0.7256 | 0.7325 | 0.7248 |
| Random Forest_f1-score | 0.9666 | 0.7023 | 0.6246 | 0.6227 | 0.6752 | 0.5347 | 0.6862 | 0.6877 | 0.6842 |
| Neural Net_f1-score | 0.9602 | 0.7230 | 0.6299 | 0.5865 | 0.6454 | 0.5322 | 0.6728 | 0.6795 | 0.6704 |
| QDA_f1-score | 0.9409 | 0.6749 | 0.6139 | 0.6218 | 0.6277 | 0.5171 | 0.6668 | 0.6661 | 0.6659 |
| AdaBoost_f1-score | 0.9585 | 0.6429 | 0.6003 | 0.5496 | 0.6169 | 0.4976 | 0.6418 | 0.6443 | 0.6398 |
| Naive Bayes_f1-score | 0.9332 | 0.5693 | 0.5522 | 0.5565 | 0.5539 | 0.4901 | 0.6121 | 0.6092 | 0.6101 |
| Decision Tree_f1-score | 0.9473 | 0.5583 | 0.5783 | 0.5370 | 0.5869 | 0.4091 | 0.6136 | 0.6028 | 0.6108 |
| KNN_f1-score | 0.9562 | 0.6450 | 0.5449 | 0.5213 | 0.5409 | 0.3584 | 0.5992 | 0.5944 | 0.5929 |
To compute the accurancy statistics of the cell type prediction can run as :
python model_stat_train.pypython model_stat_test.pyor run by script:
./run_model_stat.shand will get
Train Accuracy Stats:
mean:
precision recall f1-score support
Bcell 0.6503 0.3391 0.4454 4095.5556
NKcell 0.7005 0.6465 0.6723 9104.8889
Tcell 0.8194 0.9011 0.8583 29158.4444
accuracy 0.9355 0.9355 0.9355 0.9355
basophil 0.7596 0.5848 0.6607 1516.2889
macro avg 0.8156 0.7392 0.7668 164895.5778
monocyte 0.9729 0.9717 0.9723 12660.6222
neutrophil 0.9912 0.9920 0.9916 108359.7778
weighted avg 0.9328 0.9355 0.9325 164895.5778
std:
precision recall f1-score support
Bcell 0.0103 0.0230 0.0214 407.7512
NKcell 0.0082 0.0217 0.0144 1145.2200
Tcell 0.0076 0.0092 0.0079 2027.9351
accuracy 0.0036 0.0036 0.0036 0.0036
basophil 0.0110 0.0239 0.0177 135.8217
macro avg 0.0026 0.0068 0.0052 8263.2131
monocyte 0.0017 0.0031 0.0021 1275.2812
neutrophil 0.0008 0.0005 0.0006 5168.4164
weighted avg 0.0036 0.0036 0.0036 8263.2131
Test Accuracy Stats:
mean:
precision recall f1-score support
Bcell 0.6112 0.3240 0.4185 5088.4000
NKcell 0.6592 0.6297 0.6338 11388.0000
Tcell 0.8149 0.8977 0.8528 36405.2000
accuracy 0.9326 0.9326 0.9326 0.9326
basophil 0.7162 0.5408 0.6075 1903.0000
macro avg 0.7932 0.7252 0.7452 206119.0000
monocyte 0.9675 0.9677 0.9676 15819.4000
neutrophil 0.9904 0.9912 0.9908 135515.0000
weighted avg 0.9319 0.9326 0.9297 206119.0000
std:
precision recall f1-score support
Bcell 0.0893 0.0549 0.0530 2008.1602
NKcell 0.1169 0.0610 0.0600 5690.6970
Tcell 0.0560 0.0223 0.0277 10142.7066
accuracy 0.0143 0.0143 0.0143 0.0143
basophil 0.1004 0.0913 0.0638 651.5431
macro avg 0.0315 0.0159 0.0178 41316.0854
monocyte 0.0141 0.0090 0.0101 6355.2965
neutrophil 0.0018 0.0035 0.0019 25640.8262
weighted avg 0.0150 0.0143 0.0157 41316.0854The low standard deviation showed the machine learning model's generalization ability between different donors (persons) when we used the ROS markers to train and predict cell type. And can Ignore differences between individuals.
Fig. 1i and Fig. 2i was from train_8_9_output and test_8_9_output .
To validate our prediction we use the following program to plot the distribution with ground truth labeling and the predicted labeling.
usage: plot_distribution.py [-h] -i INPUT_CSV -o OUTPUT_FOLDER -pre_out_col
OUTPUT_PREDICT_COL_NAME -g GROUP
optional arguments:
-h, --help show this help message and exit
-i INPUT_CSV, --input INPUT_CSV
dataset (csv) file with prediction.
-o OUTPUT_FOLDER, --output OUTPUT_FOLDER
output folder path.
-pre_out_col OUTPUT_PREDICT_COL_NAME, --output_predict_col_name OUTPUT_PREDICT_COL_NAME
output_predict_col_name
-g GROUP, --group GROUP
feature for group to plot , must be strWe plot the bar plot and the KDE plot of date time prediction as the following commands:
python plot_distribution.py -i ../data_time_predict/ml_predict/2023_01_09_MC38_1st_norm_add_predict.csv -o ../data_time_predict/ml_predict/2023_01_09_MC38_1st_norm -pre_out_col ros_predict_day -g group
python plot_distribution.py -i ../data_time_predict/ml_predict/2023_01_09_MC38_2nd_norm_add_predict.csv -o ../data_time_predict/ml_predict/2023_01_09_MC38_2nd_norm -pre_out_col ros_predict_day -g group
python plot_distribution.py -i ../data_time_predict/ml_predict/2023_01_09_MC38_3rd_norm_add_predict.csv -o ../data_time_predict/ml_predict/2023_01_09_MC38_3rd_norm -pre_out_col ros_predict_day -g groupor run by :
chmod +x run_data_time_predict_plot.sh./run_data_time_predict_plot.shFig. 2i was the result of combining those three results.
We also validate our prediction by plotting the mean values of the exhaustion markers from each predicted date that were from the ROS marker.
To make the plot from features , we used this program:
usage: plot_feature_mean_by_group.py [-h] -i INPUT_CSV -o OUTPUT_FOLDER -g
GROUP -f FEATURES [FEATURES ...]
optional arguments:
-h, --help show this help message and exit
-i INPUT_CSV, --input INPUT_CSV
dataset (csv) file with prediction.
-o OUTPUT_FOLDER, --output OUTPUT_FOLDER
output folder path.
-g GROUP, --group GROUP
feature for group to plot , must be str
-f FEATURES [FEATURES ...], --features FEATURES [FEATURES ...]
A list of features for observewe can plot the figure with:
python plot_feature_mean_by_group.py -i ../data_time_predict/ml_predict/2023_01_09_MC38_1st_norm_add_predict.csv -o ../data_time_predict/ml_predict/2023_01_09_MC38_1st_norm -g ros_predict_day -f TCF1_7 TOX PD1 LAG3 TIM3
python plot_feature_mean_by_group.py -i ../data_time_predict/ml_predict/2023_01_09_MC38_2nd_norm_add_predict.csv -o ../data_time_predict/ml_predict/2023_01_09_MC38_2nd_norm -g ros_predict_day -f TCF1_7 TOX PD1 LAG3 TIM3
python plot_feature_mean_by_group.py -i ../data_time_predict/ml_predict/2023_01_09_MC38_3rd_norm_add_predict.csv -o ../data_time_predict/ml_predict/2023_01_09_MC38_3rd_norm -g ros_predict_day -f TCF1_7 TOX PD1 LAG3 TIM3
or run with the script:
chmod +x run_plot_exh.sh./run_plot_exh.shFig. 2j was the result of combining those three results.
The whole training dataset and the machine learning models result can be download from here:
5.1 cell type prediction data_type_predict (7GB)
5.2 date prediction data_time_predict (220 MB)
We also created a web application as GUI for training machine learning models for classification. To use it
step1. cd to path :
cd ./web_appstep2. run the Celery queue:
celery -A tasks.celery worker --loglevel=info --concurrency=4step3. run run Flask api server :
python server.pythan will start the server at http://127.0.0.1:5000
step4. and vist index.html to use it.
The detail of web_app usage can be downlad at here
