Cluster Analysis with the BigML Dashboard

1 Introduction

There are problems that require separating datasets into subsets of Instancess bearing some similarities. Cluster analysis is a Machine Learning task that partitions a dataset and groups together those instances that are similar. It separates a set of instances into a number of groups so that instances in the same group, called cluster, are more similar to each other than to those in other groups. Cluster analysis does not require using previously labeled data. For this reason, it falls under the category of Unsupervised learning learning.

BigML clusters use proprietary learning algorithms to group together the instances according to a distance measure, computed using the values of the Fields as input. Each cluster group is represented by its center (or Centroids). All BigML field types are valid inputs for clustering, i.e. categorical, numeric, text and items fields, although there are a few caveats. First, numeric fields are automatically scaled to ensure that their different magnitudes do not bias the distance calculation. Second, clustering does not tolerate missing values for numeric fields, so BigML provides several strategies for dealing with them (see section 4.4 ), otherwise those instances are excluded to compute the clusters.

BigML clusters can be built using two different unsupervised learning algorithms:

K-means: the user needs to specify the number of clusters in advance. Learn more about k-means in section 2.1 .
G-means: the algorithm automatically learns the number of different clusters by iteratively taking existing cluster groups and testing whether the Cluster neighborhood appears Gaussian distribution in its distribution. Learn more about g-means in section 2.2 .

This chapter provides a comprehensive description of BigML clusters including how they can be created (Chapter 3 ) and configured (Chapter 4 ). Powerful visualizations are provided of the results of clustering data instances, which give insight into their internal structure (see Chapter 5 ). Besides their visual representations, clusters also provide a textual summary view of the most essential information about them (see Chapter 6 ). Clusters are actionable, since they allow you to identify the cluster that is closest to any given new data point (Chapter 7 ). You can even download and calculate the nearest cluster locally (see section 8.1 ). It is also worth noting that you can create, update, list, and delete clusters using the BigML API (see section 8.3 ).

In BigML, the fourth tab of the main menu of your Dashboard allows you to list all your available clusters. In the cluster list view (Figure 1.2 ), you can see, for each cluster, the dataset it was created from, as well as the cluster’s Name, Algorithm (either k-means or g-means), the number of cluster groups K, Age (time elapsed since it was created), Size, and number of centroids or batch centroids that have been created using that cluster. The search menu option in the top right corner of the cluster list view allows you to search your clusters by name.

\includegraphics[]{images/clusters/clusters-list-view} — Figure 1.1 Clusters list view

When you first create an account at BigML, or every time you start a new Project, your list view of clusters will be empty. (See Figure 1.2 .)

\includegraphics[]{images/clusters/empty-listing} — Figure 1.2 Empty Dashboard cluster view

Finally, in Figure 1.3 you can see the icon used to represent a cluster in BigML.