This project was completed as part of my DHC Internship. The goal was to analyze the UCI Heart Disease dataset to identify key health trends and build a machine learning model capable of predicting heart disease risk with high clinical reliability.
- Data Inspection: Utilized
.head(),.info(), and.describe()to verify 303 patient records and 14 clinical features. - Visualizations: - Histograms: Revealed a peak in heart disease cases for patients aged 50–60.
- Boxplots: Identified significant outliers in Cholesterol (chol) and Resting Blood Pressure (trestbps).
- Handling Outliers: Implemented
StandardScalerto normalize feature ranges, ensuring extreme values (like high cholesterol) didn't skew the model's coefficients.
- Algorithm: Logistic Regression (Classification).
- Setup: 80% Training Data / 20% Testing Data.
- Optimization: Resolved convergence warnings by increasing
max_iterand applying feature scaling for mathematical efficiency.
The model achieved high performance, proving its reliability for medical screening:
- Final Accuracy: 85.25%
- F1-Score: 0.86 (indicates a strong balance between Precision and Recall).
- Confusion Matrix Performance:
- Correctly identified 25/29 healthy cases.
- Correctly identified 27/32 disease cases.
To run this project on your local machine:
- Clone the repository.
- Activate your virtual environment:
source venv/bin/activate - Install dependencies:
pip install -r requirements.txt - Run the notebook:
jupyter notebook Model_Evaluation.ipynb