We wanted to lessen the limitations of popular machine learning models that classify quantitative data: K-Nearest Neighbors and K-Means in particular. In order to do so, our approach included combining aspects of the two by taking the clustering aspect of K-Means and the classification based on the nearest neighbors aspect of KNN. Our methodology was composed of creating a k number of clusters within a dataset and appointing the centroid of each cluster with the majority class label of that cluster. Then, to classify, we implemented KNN with k=1, considering each centroid as a neighbor. This algorithm is known as K-Closest Clusters (KCC). The results show that KCC achieved a slightly better accuracy, precision and recall than the existing KNN algorithm and that KCC is significantly more efficient at classifying test instances than KNN. This indicates that KCC is more practical than KNN as a classifier, especially when used for large datasets.
Download the ml_q2_project.py file and run the file. This file contains our implementation of the K-Closest Clusters algorithm described in our report. A graph of the validation accuracies, a visualization of the clusters, the test accuracy, and the classification time will be displayed after running the code.