Cluster Analysis with the BigML Dashboard
2.2 G-means
Sometimes, it is hard to know in advance how many clusters can be identified in a given dataset (the K in K-means). To solve this issue, an alternative algorithm has been proposed, G-means, which uses a special technique for running K-means multiple times while adding centroids in a hierarchal fashion. G-means has the advantage of being relatively resilient to covariance in clusters and has no need to compute a global covariance (as is the case with using the Mahalanobis distance, a popular variant on classic K-means to deal with covariance).
BigML implementation differs from the original paper. First, we reuse our sample-based K-means rather than running full K-means, with the already described performance and scalability advantage. Additionally, BigML clusters choose new candidate clusters with kmeans|| rather than the PCA calculation from the paper. While this gives us better scalability, it means BigML version of G-means is no longer deterministic as in the paper. Given the same data, we may find a different number of clusters from one run to the next (if no seed is given).
Finally, BigML G-means have different stopping criteria than the original paper. We currently enforce a maximum limit of 128 clusters. Also, if BigML algorithm does not make sufficient progress in finding Gaussian clusters after multiple iterations of G-means, it stops early. Both techniques ensure that BigML algorithm returns fewer clusters.
G-means is almost parameter-free, except for one, the critical_value parameter. G-means iteratively takes existing clusters and tests whether the cluster’s neighborhood appears Gaussian. If it does not, the cluster is split into two. The critical_value sets how strict the test is when deciding whether data looks Gaussian. The current default is 5, but ranges between 1 and 20 can be reasonable depending on the dataset. A critical value of 1 means data must look very Gaussian to pass the test and can lead to more clusters being detected. Higher critical_values will tend to find fewer clusters.