This project is focused on predicting customer churn using a machine learning pipeline. Churn prediction helps telecom companies retain customers by proactively identifying those likely to leave the service.
The goal of this project is to:
- Understand customer behavior through data
- Perform feature engineering and preprocessing
- Train and evaluate multiple machine learning models
- Build a predictive system to classify whether a customer is likely to churn
- The dataset contains customer details like tenure, monthly charges, total charges, and various service subscriptions.
- The
Churncolumn is the target variable (Yes/No).
- Removed irrelevant column (
CustomerID) - Handled missing values in
TotalCharges - Addressed class imbalance using SMOTE (Synthetic Minority Oversampling Technique)
- Distribution and outliers analyzed using boxplots
- Correlation heatmap created to understand relationships
- Countplots used to understand distributions
- Label encoding applied for model compatibility
- Label Encoding of categorical features
- Train-test split for evaluation
- SMOTE applied to handle class imbalance
Multiple classification models were trained:
- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost
Random Forest gave the best accuracy among all models with default parameters.
- The best performing model is saved using
joblib - A predictive system is built to classify new customer data using the saved model
To run this notebook, install the following:
pip install pandas numpy seaborn matplotlib scikit-learn imbalanced-learn xgboost