This project implements the G-Means clustering algorithm, an adaptive version of K-Means that automatically determines the optimal number of clusters. The algorithm uses PCA for dimensionality reduction, Anderson–Darling tests for Gaussian validation, and silhouette scores to evaluate splits.
Two versions are implemented:
- Basic G-Means with silhouette constraint
- Enhanced G-Means with centroid initialization as described in the original paper
- Dimensionality Reduction using PCA
- Initialize with one cluster (global mean)
- Iteratively split clusters if they are not Gaussian-distributed
- Validate splits using silhouette score and minimum centroid distance
- Stop when no valid splits remain or max clusters reached
- Iris
- Digits
- Wine
- Breast Cancer
- Synthetic Blobs, Moons, and Circles
- Similarity Score: Compares predicted vs. true cluster count
- Visualization: Bar charts comparing true vs. predicted k
- Iteration Tracking: Measures convergence speed
- Python 3
- NumPy, scikit-learn, SciPy
- Matplotlib for visualization
- Jupyter Notebook for interactive analysis