Customer Segmentation using Advanced Techniques

^{Unsupervised Machine Learning Project}

The project aims to analyze and cluster customers based on various features, including numerical and categorical variables. Three different methods are explored: K-Means, K-Prototype, and a combination of Sentence Embeddings with K-Means.

Concepts applied:

Univariate and Bivariate Analysis,
Handling outliers using ECOD (Empirical Cumulative Distribution Functions) class of PyOD (Python Outlier Detection) library
PCA, MCA, t-SNE for visual evaluation of models,
Silhouette Plots and Elbow Curve to get 'K',
LightGBM to see how well the clusters are distinguished (scoring done on F1 score),
SHAP values to gain an insight into which features are contributing more to the model.

Brief insights

KMeans

Clusters with job=blue-collar do not have distinct differences between their characteristics, except the age feature. This is not desirable since it is difficult to differentiate the clients in each cluster.
In the job=management case, we obtain better differentiation wrt education and balance.

KPrototype

KMeans with Sentence Embeddings

In management, we see that the single managers are younger than the older married ones. And they have comparitively lesser bank balance than the married ones. This might be due to experience but we don't have that data.
Overall, Kmeans + Sentence Embedding model is optimal since it needs fewer variables to be able to give good predictions.

Further Line of Thought:

The model we employed is not well-suited for comparing numerical values in fields. For instance, the sentence "Salary = 10000" yields embeddings that are more similar to those of the sentence "Salary = 100000" than to the embeddings of "Salary = 11000." This limitation arises because the model excels at comparing text but treats numbers as characters rather than quantities. Consequently, only sentences related to job and marital status proved to be significant, as the model's strength lies in comparing textual information rather than numerical data.
If you want to enhance the clustering performance on numerical data, you can consider the following modifications to the above method:

Hybrid Embedding: Use a combination of sentence embedding for text features and a different embedding method for numerical features. For numerical features, consider using methods like PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding) to create embeddings.

Feature Engineering for Numerical Data: Create additional text-based features from numerical fields. For example, instead of just "Salary = 10000," you could have "Salary is High," "Salary is Medium," "Salary is Low," etc. This way, numerical information is translated into text, making it compatible with the sentence embedding model.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
data		data
README.md		README.md
_1_DataAnalysis_BankCustomerSegmentationForMarketing.ipynb		_1_DataAnalysis_BankCustomerSegmentationForMarketing.ipynb
_2_KMeans_Clustering_BankCustomerSegmentationForMarketing.ipynb		_2_KMeans_Clustering_BankCustomerSegmentationForMarketing.ipynb
_3_KPrototypes_Clustering_BankCustomerSegmentationForMarketing.ipynb		_3_KPrototypes_Clustering_BankCustomerSegmentationForMarketing.ipynb
_4_Sentence_Embedding_with_KMeans_Clustering_BankCustomerSegmentationForMarketing.ipynb		_4_Sentence_Embedding_with_KMeans_Clustering_BankCustomerSegmentationForMarketing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Customer Segmentation using Advanced Techniques

Concepts applied:

Brief insights

Further Line of Thought:

About

Uh oh!

Releases

Languages

hrootscraft/customer-segmentation

Folders and files

Latest commit

History

Repository files navigation

Customer Segmentation using Advanced Techniques

Concepts applied:

Brief insights

Further Line of Thought:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages