β οΈ Disclaimer: RoboDoc is a technical demo meant for education and exploration. Not medical advice.
Patients often struggle to interpret symptoms and lab results, leading to delays or reliance on unreliable sources. RoboDoc addresses this with a ready-to-run multipage Streamlit app that ships with datasets and pre-trained model artifacts already included in the repository. It combines:
- π©Ί Retrieval-augmented medical chatbot β SBERT embeddings over patient symptom descriptions with matched doctor responses to simulate a consultation.
- π§ͺ Disease prediction from blood biomarkers β classic ML (Random Forest, Logistic Regression, Naive Bayes, Decision Tree; CatBoost/XGBoost available) with robust preprocessing and explainability.
- 𧬠Survival prediction β 5/10/15-year survival regression using RF, SVR, XGBoost, LightGBM, CatBoost, and a Stacking ensemble, evaluated via MAE/RMSE/RΒ².
- π Analytics & explainability β correlation filtering, SHAP-based feature importance, ROC, confusion matrices, interactive visualizations (Plotly).
- Primary Source: Sohaibsoussi/patient_doctor_chatbot
- Scale: ~154,150 real-world patientβdoctor dialogues (symptoms, diagnoses, treatment suggestions).
- Use: Predict diagnostic clusters from free-text patient descriptions and surface matching doctor advice.
- Note on Size: This repo uses a trimmed subset for compute efficiency; the full dataset is referenced above.
- Source: Kaggle β Multiple Disease Prediction
Files included in repo:data/Blood_samples_dataset_balanced_2.csv,data/blood_samples_dataset_test.csv - Use: Predict underlying disease/condition from blood markers (e.g., hemoglobin, WBC/RBC, etc.).
- Curated From (references):
- Survival and relapse in TTP (PubMed: 20032506)
- Improved survival in diabetes 1980β2004 (Sweden) (PMCID: PMC2586621)
- Survival in beta-thalassemia (PMCID: PMC6335498)
- Aplastic anemia outcomes (Sweden 2000β2011) (PubMed: 28751565)
- Files included:
data/survival_data.csv(+ biomarker files above) - Use: Estimate 5/10/15-year survival probabilities by disease type and demographics (Gender, Age_Group).
β As-shipped: All required CSVs are present under
data/. Pre-trained model artifacts are undermodels/(and mirrored tomodels/models/where referenced). The app runs out-of-the-box afterpip install -r requirements.txt.
- Cleaning: Lowercasing; removal of prefixes like
"Description:"and"Q."; punctuation/stopword filtering. - NER: spaCy (
en_core_web_sm) to extract symptom entities when available; otherwise falls back to cleaned text. - Embeddings: Sentence-BERT (
pritamdeka/S-PubMedBERT-MS-MARCO). - Labeling: Diagnosis/cluster labels aligned to each description.
- Splits: Balanced train/test (when training from scratch).
- Retrieval: Cosine similarity over precomputed description embeddings.
- Missing Values: Median imputation for numerical biomarkers and patient characteristics.
- Standardization/Normalization:
StandardScaler(and optional MinMax in visuals). - Feature Selection: SHAP-based importance + Recursive Feature Elimination (RFE) workflows available.
- Splitting: Stratified/balanced splits for disease classification; appropriate splitting for survival regression.
- Languages & Environments: Python (Jupyter/Colab/Anaconda/Spyder supported)
- Core Libraries:
pandas,numpy,scikit-learn,joblib,plotly,matplotlib,seaborn,shap - NLP:
sentence-transformers(SBERT),spacy,nltk - Gradient Boosting:
xgboost,lightgbm,catboost - Optional Frameworks (experimentation paths): TensorFlow, PyTorch
- App/UI:
streamlit(multipage architecture) - Search/Retrieval: Cosine similarity over SBERT embeddings
- Tuning & Validation:
GridSearchCV, classic metrics for classification; MAE/RMSE/RΒ² for regression
RoboDoc/
βββ app.py # Medical Chatbot (main Streamlit entry)
βββ pages/
β βββ accuracy_comparison.py # Train & compare classifiers interactively
β βββ advanced_visuals.py # Correlations, scaling, SHAP explainability
β βββ diseases.py # Marker distributions & heatmaps
β βββ disease_prediction.py # Predict disease from biomarkers (uses saved models)
β βββ survival accuracy comparison.py # Compare survival models (MAE/RMSE/RΒ²)
β βββ survival_prediction.py # Predict 5/10/15-year survival
βββ models/
β βββ train_models.py # Train classifiers + scaler (optional)
β βββ train_survival_model.py # Train survival regressors + encoder/scaler (optional)
β βββ data/ # Saved test splits (e.g., y_test_Survival_*.csv)
β βββ catboost_info/ # CatBoost logs
β βββ models/ # Legacy mirror path for some pages
β βββ *.pkl # β
Pre-trained artifacts (RF/LR/NB/DT; Survival; encoders; scalers)
βββ data/ # β
Included datasets (chatbot, biomarkers, survival)
βββ medschool/ # SBERT embedding cache (e.g., train_desc_embeddings.npy)
βββ requirements.txt
βββ robodoc.ipynb # Assembles app.py and can auto-launch Streamlit
-
Python 3.9+
-
Internet (first run only) for model/corpus downloads:
- spaCy
en_core_web_sm - NLTK
punkt,stopwords,wordnet - SentenceTransformer
pritamdeka/S-PubMedBERT-MS-MARCO
- spaCy
git clone <this-repo-url>
cd RoboDoc
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
python -m spacy download en_core_web_sm
python - <<'PY'
import nltk; [nltk.download(x) for x in ['punkt','stopwords','wordnet']]
PYOption A β via Notebook (auto-writes & launches app):
- Open
robodoc.ipynb, run all cells. It writesapp.pywith%%writefileand launches:!streamlit run app.py.
Option B β direct:
streamlit run app.pyThe app starts with embedded data/models. No retraining needed unless you want to update artifacts.
- Flow: Clean β NER (when available) β SBERT embed β cosine retrieve nearest description β surface doctor response + similarity score.
- Caching:
@st.cache_resourcefor model;@st.cache_datafor embeddings and results. Embeddings persist tomedschool/*.npy. - Inputs: Free-text symptom description.
- Outputs: Matched case, associated doctor advice, similarity.
- Artifacts used:
random_forest_model.pkl,naive_bayes_model.pkl,logistic_regression_model.pkl,decision_tree_model.pkl,scaler.pkl. - Inputs: Interactive biomarker fields (e.g., CBC metrics, HbA1c, CRP, etc.).
- Outputs: Predicted disease class; uses scaler + selected pre-trained model.
- In-app training: RF / NB / LR / DT;
train_test_split+StandardScaler. - Metrics: Accuracy, Precision, Recall, F1; confusion matrix; sample predictions.
- EDA: Correlation matrix (thresholding), distribution plots.
- Scaling: Standard vs. MinMax.
- Explainability: SHAP summary plot for feature impact.
- Modeling: Quick RF training on selected features.
- Horizons: 5, 10, 15 years.
- Models: RF, Linear Regression, SVR, XGBoost, LightGBM, CatBoost, StackingRegressor (LR meta-learner).
- Pipeline: Encode (
survival_encoder.pkl) β scale (survival_scaler.pkl) β predict with chosen model. - Outputs: Predicted survival probability/time metric per horizon.
- Goal: Side-by-side model evaluation for a selected horizon.
- Metrics: MAE, RMSE, RΒ² on included test splits (
models/data/y_test_Survival_*.csv).
- Visuals: Disease-wise marker distributions (violin/box), correlation heatmaps.
- Data: Uses
Blood_samples_dataset_balanced_2.csvandblood_samples_dataset_test.csv.
Artifacts are provided. Re-run these only to refresh models.
python -m models.train_models
mkdir -p models/models && cp models/*.pkl models/models/ # if a page expects the mirrored pathpython -m models.train_survival_model
# Produces *_Survival_{5Y,10Y,15Y}.pkl, survival_encoder.pkl, survival_scaler.pklShared Practices
- Median imputation for missing values
- Standardization of numeric features
- Feature selection: SHAP + RFE workflows
- Tuning:
GridSearchCV(RF/LR), with holdout test evaluation - Metrics: Classification (Accuracy/Precision/Recall/F1, ROC); Survival (MAE/RMSE/RΒ²)
streamlit,pandas,numpy,matplotlib,seaborn,plotlyscikit-learn,joblib,shapnltk,spacy,sentence-transformersxgboost,lightgbm,catboost- (optional)
tensorflow,torchfor experimentation - (optional)
pydrive2,oauth2client,pyngrok
Install:
pip install -r requirements.txt- No real PHI/PII in data.
- Bias & validation: Verify on representative cohorts before any downstream or clinical use.
- Safety: This is not a diagnostic device or a substitute for professional care.
- Dataset: Sohaibsoussi/patient_doctor_chatbot (trimmed subset used)
- Biomarker source: Kaggle Multiple Disease Prediction
- Survival curation informed by cited literature (TTP, diabetes, thalassemia, aplastic anemia)
- Embeddings: SentenceTransformers (
pritamdeka/S-PubMedBERT-MS-MARCO) - Libraries: scikit-learn, XGBoost, LightGBM, CatBoost, SHAP, spaCy, NLTK, Streamlit