Skip to content

Ramadan-Hagag/predict_customer_churn_using_Machine_Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

predict_customer_churn_using_Machine_Learning

This project aims to identify bank customers who are likely to "churn" (leave the bank) by analyzing their demographic and financial profiles.

Project Overview

The notebook Churn_pipeline.ipynb follows a comprehensive data science workflow, from raw data ingestion to model performance evaluation. It utilizes the Churn_Modelling.csv dataset, which contains records of 10,000 customers.

Dataset Description

The dataset includes several features that influence a customer's decision to stay or leave:

1-Customer Profiles : Age, Gender, Geography (France, Germany, Spain).

2-Financial Metrics: Credit Score, Bank Balance, Estimated Salary.

3-Bank Relationship: Tenure (years with bank), Number of products, Active membership status, and Credit card ownership.

4-Target Variable: Exited (1 if the customer left, 0 if they stayed).

Data Cleaning & Preprocessing

The pipeline includes rigorous cleaning steps to ensure data quality: 1-Duplicate Removal: Identified and removed 2 duplicate entries.

2-Missing Values: Dropped rows with null values in critical columns like Geography and Age.

3-Feature Engineering:

-Categorical variables (Gender, Geography) were transformed using One-Hot Encoding.

-Numerical features (CreditScore, Age, Balance, EstimatedSalary) were normalized using StandardScaler to improve model convergence.

4-Data Split: The processed data was split into training and testing sets (typically 70/30).

Models and Performance

-Several classification algorithms were trained and compared. The notebook highlights a significant class imbalance in the target variable, which affects certain model metrics like the F1-score.

Model---------------------Accuracy---------- ROC-AUC Score

Random Forest--------------0.89-----------------0.96

K-Nearest Neighbors--------0.85-----------------0.92

Decision Tree--------------0.75-----------------0.83

Gaussian Naive Bayes-------0.75-----------------0.82

Logistic Regression--------0.74-----------------0.81

Support Vector Machine-----0.74-----------------0.81

Conclusion

The Random Forest Classifier is the best-performing model in this pipeline, achieving the highest accuracy and the best ability to distinguish between churners and non-churners (indicated by the 0.96 ROC-AUC). While KNN performs well after tuning, simpler linear models like Logistic Regression struggle to capture the non-linear complexities in this specific dataset.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors