Skip to content

Streamlit multipage app combining SBERT medical chatbot, biomarker disease classification, and survival prediction with SHAP explainability. Ships with datasets and pretrained models for non-diagnostic exploration.

Notifications You must be signed in to change notification settings

r-siddiq/RoboDoc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RoboDoc β€” Clinical Chatbot & Predictive Analytics (Streamlit + scikit-learn + SBERT)

Python Streamlit scikit-learn XGBoost LightGBM CatBoost SHAP SentenceTransformers spaCy NLTK

⚠️ Disclaimer: RoboDoc is a technical demo meant for education and exploration. Not medical advice.


Overview

Patients often struggle to interpret symptoms and lab results, leading to delays or reliance on unreliable sources. RoboDoc addresses this with a ready-to-run multipage Streamlit app that ships with datasets and pre-trained model artifacts already included in the repository. It combines:

  • 🩺 Retrieval-augmented medical chatbot β€” SBERT embeddings over patient symptom descriptions with matched doctor responses to simulate a consultation.
  • πŸ§ͺ Disease prediction from blood biomarkers β€” classic ML (Random Forest, Logistic Regression, Naive Bayes, Decision Tree; CatBoost/XGBoost available) with robust preprocessing and explainability.
  • 🧬 Survival prediction β€” 5/10/15-year survival regression using RF, SVR, XGBoost, LightGBM, CatBoost, and a Stacking ensemble, evaluated via MAE/RMSE/RΒ².
  • πŸ“ˆ Analytics & explainability β€” correlation filtering, SHAP-based feature importance, ROC, confusion matrices, interactive visualizations (Plotly).

Data Sources & Selection

1) Medical Chatbot

  • Primary Source: Sohaibsoussi/patient_doctor_chatbot
  • Scale: ~154,150 real-world patient–doctor dialogues (symptoms, diagnoses, treatment suggestions).
  • Use: Predict diagnostic clusters from free-text patient descriptions and surface matching doctor advice.
  • Note on Size: This repo uses a trimmed subset for compute efficiency; the full dataset is referenced above.

2) Medical Disease Prediction

  • Source: Kaggle – Multiple Disease Prediction
    Files included in repo: data/Blood_samples_dataset_balanced_2.csv, data/blood_samples_dataset_test.csv
  • Use: Predict underlying disease/condition from blood markers (e.g., hemoglobin, WBC/RBC, etc.).

3) Survival Prediction

  • Curated From (references):
    • Survival and relapse in TTP (PubMed: 20032506)
    • Improved survival in diabetes 1980–2004 (Sweden) (PMCID: PMC2586621)
    • Survival in beta-thalassemia (PMCID: PMC6335498)
    • Aplastic anemia outcomes (Sweden 2000–2011) (PubMed: 28751565)
  • Files included: data/survival_data.csv (+ biomarker files above)
  • Use: Estimate 5/10/15-year survival probabilities by disease type and demographics (Gender, Age_Group).

βœ… As-shipped: All required CSVs are present under data/. Pre-trained model artifacts are under models/ (and mirrored to models/models/ where referenced). The app runs out-of-the-box after pip install -r requirements.txt.


Data Processing

Text (Chatbot)

  • Cleaning: Lowercasing; removal of prefixes like "Description:" and "Q."; punctuation/stopword filtering.
  • NER: spaCy (en_core_web_sm) to extract symptom entities when available; otherwise falls back to cleaned text.
  • Embeddings: Sentence-BERT (pritamdeka/S-PubMedBERT-MS-MARCO).
  • Labeling: Diagnosis/cluster labels aligned to each description.
  • Splits: Balanced train/test (when training from scratch).
  • Retrieval: Cosine similarity over precomputed description embeddings.

Biomarkers (Disease & Survival)

  • Missing Values: Median imputation for numerical biomarkers and patient characteristics.
  • Standardization/Normalization: StandardScaler (and optional MinMax in visuals).
  • Feature Selection: SHAP-based importance + Recursive Feature Elimination (RFE) workflows available.
  • Splitting: Stratified/balanced splits for disease classification; appropriate splitting for survival regression.

Methods, Technologies & Tools

  • Languages & Environments: Python (Jupyter/Colab/Anaconda/Spyder supported)
  • Core Libraries: pandas, numpy, scikit-learn, joblib, plotly, matplotlib, seaborn, shap
  • NLP: sentence-transformers (SBERT), spacy, nltk
  • Gradient Boosting: xgboost, lightgbm, catboost
  • Optional Frameworks (experimentation paths): TensorFlow, PyTorch
  • App/UI: streamlit (multipage architecture)
  • Search/Retrieval: Cosine similarity over SBERT embeddings
  • Tuning & Validation: GridSearchCV, classic metrics for classification; MAE/RMSE/RΒ² for regression

Project Structure

RoboDoc/
β”œβ”€β”€ app.py                      # Medical Chatbot (main Streamlit entry)
β”œβ”€β”€ pages/
β”‚   β”œβ”€β”€ accuracy_comparison.py  # Train & compare classifiers interactively
β”‚   β”œβ”€β”€ advanced_visuals.py     # Correlations, scaling, SHAP explainability
β”‚   β”œβ”€β”€ diseases.py             # Marker distributions & heatmaps
β”‚   β”œβ”€β”€ disease_prediction.py   # Predict disease from biomarkers (uses saved models)
β”‚   β”œβ”€β”€ survival accuracy comparison.py  # Compare survival models (MAE/RMSE/RΒ²)
β”‚   └── survival_prediction.py  # Predict 5/10/15-year survival
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ train_models.py         # Train classifiers + scaler (optional)
β”‚   β”œβ”€β”€ train_survival_model.py # Train survival regressors + encoder/scaler (optional)
β”‚   β”œβ”€β”€ data/                   # Saved test splits (e.g., y_test_Survival_*.csv)
β”‚   β”œβ”€β”€ catboost_info/          # CatBoost logs
β”‚   β”œβ”€β”€ models/                 # Legacy mirror path for some pages
β”‚   └── *.pkl                   # βœ… Pre-trained artifacts (RF/LR/NB/DT; Survival; encoders; scalers)
β”œβ”€β”€ data/                       # βœ… Included datasets (chatbot, biomarkers, survival)
β”œβ”€β”€ medschool/                  # SBERT embedding cache (e.g., train_desc_embeddings.npy)
β”œβ”€β”€ requirements.txt
└── robodoc.ipynb               # Assembles app.py and can auto-launch Streamlit

Quick Start

1) Prerequisites

  • Python 3.9+

  • Internet (first run only) for model/corpus downloads:

    • spaCy en_core_web_sm
    • NLTK punkt, stopwords, wordnet
    • SentenceTransformer pritamdeka/S-PubMedBERT-MS-MARCO

2) Setup

git clone <this-repo-url>
cd RoboDoc
python -m venv .venv && source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -r requirements.txt
python -m spacy download en_core_web_sm
python - <<'PY'
import nltk; [nltk.download(x) for x in ['punkt','stopwords','wordnet']]
PY

3) Run

Option A β€” via Notebook (auto-writes & launches app):

  • Open robodoc.ipynb, run all cells. It writes app.py with %%writefile and launches: !streamlit run app.py.

Option B β€” direct:

streamlit run app.py

The app starts with embedded data/models. No retraining needed unless you want to update artifacts.


Pages & Capabilities

🩺 Medical Chatbot (app.py)

  • Flow: Clean β†’ NER (when available) β†’ SBERT embed β†’ cosine retrieve nearest description β†’ surface doctor response + similarity score.
  • Caching: @st.cache_resource for model; @st.cache_data for embeddings and results. Embeddings persist to medschool/*.npy.
  • Inputs: Free-text symptom description.
  • Outputs: Matched case, associated doctor advice, similarity.

πŸ§ͺ Disease Prediction (pages/disease_prediction.py)

  • Artifacts used: random_forest_model.pkl, naive_bayes_model.pkl, logistic_regression_model.pkl, decision_tree_model.pkl, scaler.pkl.
  • Inputs: Interactive biomarker fields (e.g., CBC metrics, HbA1c, CRP, etc.).
  • Outputs: Predicted disease class; uses scaler + selected pre-trained model.

πŸ“Š Accuracy Comparison (pages/accuracy_comparison.py)

  • In-app training: RF / NB / LR / DT; train_test_split + StandardScaler.
  • Metrics: Accuracy, Precision, Recall, F1; confusion matrix; sample predictions.

🧩 Advanced Visuals (pages/advanced_visuals.py)

  • EDA: Correlation matrix (thresholding), distribution plots.
  • Scaling: Standard vs. MinMax.
  • Explainability: SHAP summary plot for feature impact.
  • Modeling: Quick RF training on selected features.

🧬 Survival Prediction (pages/survival_prediction.py)

  • Horizons: 5, 10, 15 years.
  • Models: RF, Linear Regression, SVR, XGBoost, LightGBM, CatBoost, StackingRegressor (LR meta-learner).
  • Pipeline: Encode (survival_encoder.pkl) β†’ scale (survival_scaler.pkl) β†’ predict with chosen model.
  • Outputs: Predicted survival probability/time metric per horizon.

πŸ§ͺ Survival Accuracy Comparison (pages/survival accuracy comparison.py)

  • Goal: Side-by-side model evaluation for a selected horizon.
  • Metrics: MAE, RMSE, RΒ² on included test splits (models/data/y_test_Survival_*.csv).

πŸ“š Diseases (pages/diseases.py)

  • Visuals: Disease-wise marker distributions (violin/box), correlation heatmaps.
  • Data: Uses Blood_samples_dataset_balanced_2.csv and blood_samples_dataset_test.csv.

Training (Optional)

Artifacts are provided. Re-run these only to refresh models.

Classification

python -m models.train_models
mkdir -p models/models && cp models/*.pkl models/models/  # if a page expects the mirrored path

Survival

python -m models.train_survival_model
# Produces *_Survival_{5Y,10Y,15Y}.pkl, survival_encoder.pkl, survival_scaler.pkl

Shared Practices

  • Median imputation for missing values
  • Standardization of numeric features
  • Feature selection: SHAP + RFE workflows
  • Tuning: GridSearchCV (RF/LR), with holdout test evaluation
  • Metrics: Classification (Accuracy/Precision/Recall/F1, ROC); Survival (MAE/RMSE/RΒ²)

Requirements

  • streamlit, pandas, numpy, matplotlib, seaborn, plotly
  • scikit-learn, joblib, shap
  • nltk, spacy, sentence-transformers
  • xgboost, lightgbm, catboost
  • (optional) tensorflow, torch for experimentation
  • (optional) pydrive2, oauth2client, pyngrok

Install:

pip install -r requirements.txt

Responsible Use

  • No real PHI/PII in data.
  • Bias & validation: Verify on representative cohorts before any downstream or clinical use.
  • Safety: This is not a diagnostic device or a substitute for professional care.

Acknowledgments

  • Dataset: Sohaibsoussi/patient_doctor_chatbot (trimmed subset used)
  • Biomarker source: Kaggle Multiple Disease Prediction
  • Survival curation informed by cited literature (TTP, diabetes, thalassemia, aplastic anemia)
  • Embeddings: SentenceTransformers (pritamdeka/S-PubMedBERT-MS-MARCO)
  • Libraries: scikit-learn, XGBoost, LightGBM, CatBoost, SHAP, spaCy, NLTK, Streamlit

About

Streamlit multipage app combining SBERT medical chatbot, biomarker disease classification, and survival prediction with SHAP explainability. Ships with datasets and pretrained models for non-diagnostic exploration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published