This project aims to identify bank customers who are likely to "churn" (leave the bank) by analyzing their demographic and financial profiles.
The notebook Churn_pipeline.ipynb follows a comprehensive data science workflow, from raw data ingestion to model performance evaluation. It utilizes the Churn_Modelling.csv dataset, which contains records of 10,000 customers.
The dataset includes several features that influence a customer's decision to stay or leave:
1-Customer Profiles : Age, Gender, Geography (France, Germany, Spain).
2-Financial Metrics: Credit Score, Bank Balance, Estimated Salary.
3-Bank Relationship: Tenure (years with bank), Number of products, Active membership status, and Credit card ownership.
4-Target Variable: Exited (1 if the customer left, 0 if they stayed).
The pipeline includes rigorous cleaning steps to ensure data quality: 1-Duplicate Removal: Identified and removed 2 duplicate entries.
2-Missing Values: Dropped rows with null values in critical columns like Geography and Age.
3-Feature Engineering:
-Categorical variables (Gender, Geography) were transformed using One-Hot Encoding.
-Numerical features (CreditScore, Age, Balance, EstimatedSalary) were normalized using StandardScaler to improve model convergence.
4-Data Split: The processed data was split into training and testing sets (typically 70/30).
-Several classification algorithms were trained and compared. The notebook highlights a significant class imbalance in the target variable, which affects certain model metrics like the F1-score.
Model---------------------Accuracy---------- ROC-AUC Score
Random Forest--------------0.89-----------------0.96
K-Nearest Neighbors--------0.85-----------------0.92
Decision Tree--------------0.75-----------------0.83
Gaussian Naive Bayes-------0.75-----------------0.82
Logistic Regression--------0.74-----------------0.81
Support Vector Machine-----0.74-----------------0.81
The Random Forest Classifier is the best-performing model in this pipeline, achieving the highest accuracy and the best ability to distinguish between churners and non-churners (indicated by the 0.96 ROC-AUC). While KNN performs well after tuning, simpler linear models like Logistic Regression struggle to capture the non-linear complexities in this specific dataset.