Cluster Analysis with the BigML Dashboard

6.1 Cluster Summary

The cluster summary gives you a summarized view of your cluster, including the following metrics: data distribution, cluster metrics, centroids, and intercentroid distance. (See Figure 6.2 .)

  • Data distribution: data distribution within the clusters: for each cluster, the percentage of data instances that belong to that cluster is given. The “global” cluster always includes all of the data instances, i.e. it accounts for 100% of them.

  • Cluster metrics: a summary of the distances between the data instances expressed in terms of various aggreagate measures:

    • total_ss: the total sum of squares of the distances between each data instance and the global centroid;

    • within_ss: the total sum of squares of the distances between each data instance and the centroid it belongs to;

    • between_ss: the total sum of squares of the distances between each centroid and the global centroid;

    • ratio_ss: the ratio of between_ss and total_ss. This is a measure of how well your data instances can be grouped into clusters.

  • Centroids: general statistics for each of the idnetified clusters, including the global one.

  • Intercentroid distance: distribution of distances between centroids.

\includegraphics[]{images/clusters/cluster-summary}
Figure 6.2 Cluster summary report