Cluster Analysis with the BigML Dashboard

4.4 Default numeric value

\includegraphics[]{images/clusters/cluster-default-numeric-value}
Figure 4.6 Cluster options: default numeric value

When training a cluster, BigML may encounter missing values, which can be either considered or ignored. Indeed, clusters compute the Euclidean distance. Since the distance to a missing value is undefined, instances containing missing values will be ignored.

BigML. though, lets you use a default value in place of any missing value, by setting a default numeric value. You can choose to replace the missing numeric values with the field’s maximum, mean, median, minimum or zero. There is one catch. If all the instances contain at least one missing value, this would invalidate the entire training set. In this situation, BigML automatically replaces the missing numeric values by the median.

Missing values for categorical fields are always considered valid categorical values, for example [red, green, blue, <missing>]. This means a cluster centroid may contain a missing value. (See Figure 4.7 .)

\includegraphics[]{images/clusters/missing-cluster}
Figure 4.7 Example of a centroid with missing values for some categorical fields