Unsupervised Machine Learning Project
The project aims to analyze and cluster customers based on various features, including numerical and categorical variables. Three different methods are explored: K-Means, K-Prototype, and a combination of Sentence Embeddings with K-Means.
- Univariate and Bivariate Analysis,
- Handling outliers using ECOD (Empirical Cumulative Distribution Functions) class of PyOD (Python Outlier Detection) library
- PCA, MCA, t-SNE for visual evaluation of models,
- Silhouette Plots and Elbow Curve to get 'K',
- LightGBM to see how well the clusters are distinguished (scoring done on F1 score),
- SHAP values to gain an insight into which features are contributing more to the model.
- KMeans
- Clusters with job=blue-collar do not have distinct differences between their characteristics, except the age feature. This is not desirable since it is difficult to differentiate the clients in each cluster.
- In the job=management case, we obtain better differentiation wrt education and balance.
- KPrototype
- KMeans with Sentence Embeddings
- In management, we see that the single managers are younger than the older married ones. And they have comparitively lesser bank balance than the married ones. This might be due to experience but we don't have that data.
- Overall, Kmeans + Sentence Embedding model is optimal since it needs fewer variables to be able to give good predictions.
-
The model we employed is not well-suited for comparing numerical values in fields. For instance, the sentence "Salary = 10000" yields embeddings that are more similar to those of the sentence "Salary = 100000" than to the embeddings of "Salary = 11000." This limitation arises because the model excels at comparing text but treats numbers as characters rather than quantities. Consequently, only sentences related to job and marital status proved to be significant, as the model's strength lies in comparing textual information rather than numerical data.
-
If you want to enhance the clustering performance on numerical data, you can consider the following modifications to the above method:
Hybrid Embedding: Use a combination of sentence embedding for text features and a different embedding method for numerical features. For numerical features, consider using methods like PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding) to create embeddings.
Feature Engineering for Numerical Data: Create additional text-based features from numerical fields. For example, instead of just "Salary = 10000," you could have "Salary is High," "Salary is Medium," "Salary is Low," etc. This way, numerical information is translated into text, making it compatible with the sentence embedding model.










