Cluster Analysis with the BigML Dashboard

2.1 K-means

K-means is one of two algorithms that BigML provides for cluster analysis. K-means clustering aims to partition the data instances contained in your dataset in K clusters, such that each data instance belongs to the cluster with the nearest center.

BigML proprietary implementation of K-means is optimized for scalability, thus mitigating one of the major limitations of standard K-means. BigML adopted the mini-batch approach, which is known to reduce computation cost by orders of magnitude compared to the classic K-means algorithm.

One key factor when using K-means clustering is how the algorithm is initialized, which can greatly affect the quality of the identified clusters. In standard K-means, initial clusters are chosen at random. This means that the quality of the clusters identified by the algorithm usually varies a lot from run to run, so it is fair to say that standard K-means provides no guarantee of accuracy. Therefore, alternative approaches for the selection of the initial clusters have been described in literature, such as K-means++, which is a little too slow for BigML purposes. Instead, BigML preferred approach is K-means ||, which is similar to K-means++ but much faster.

Another dimension where BigML clusters improve on standard K-means is the way they handle categorical data. Instead of “binarizing” each category, meaning a field with 40 categories becomes 40 binary fields, BigML chose a technique called k-prototypes which modifies the distance function to be more category-friendly so each cluster chooses the most common category from its neigborhood. So, BigML Clusters use mode instead of mean for categorical fields.