Cluster Analysis with the BigML Dashboard

7.3 Configuring Centroid Predictions

BigML provides several options to configure your centroids, such as defining the automatic fields mapping performed by BigML (subsection 7.3.1 ) and the output file settings (subsection 7.3.2 )

7.3.1 Field Mapping

By default, BigML maps fields based on their names. If there is a mismatch between the field names in your cluster and those in the input dataset you selected for the batch centroid, you can specify the right correspondence between the two sets of fields by explicitly assiging to each field appearing in the “Cluster fields” column its associated input field in the “Dataset fields” column. (See Figure 7.24 .)

If the dataset’s and cluster’s field names do not match but their IDs do, which happens when corresponding fields appear in the same order, you can tell BigML to use the field ID instead of the field name to map the fields. To this aim, click the green switcher shown in Figure 7.24 .

If you do not want some of the fields to be considered during the evaluation, you can also manually search for those fields and remove them from the “Dataset fields” column.

\includegraphics[]{images/cluster-predictions/fields-mapping}
Figure 7.24 Field Mapping for batch centroids

The field mapping from the BigML Dashboard has a limit of 200 fields. For batch centroids with higher number of fields, use the argument field_map from BigML API if you need to map your fields.

7.3.2 Output Settings

As mentioned, batch centroids can return a CSV file containing all input instances along with the predictions BigML calculated for each of them. Define the following settings to customize your output file:

  • Separator: this option allows you to choose a separator for your output file values. The default separator is the comma. You can also select the semicolon, the tab, or the space.

  • New line: this option allows you to set the new line character to use as the line break in the generated csv file: “LF”, “CRLF”.

  • Output fields: this option allows you to include or exclude any of your dataset fields from the output file from the preview shown in Figure 7.25 .

    Note: a maximum of 100 fields are displayed in the preview, but all your dataset fields are included in the output file by default unless you exclude them.

  • Headers: this option includes or excludes a first row in the output file (and in the output dataset) with the names of each column. By default, BigML includes the headers..

  • Distance: this option allows you to include an additional column in the output file with the distance between the instance and the centroid. By default, BigML does not include this column.

  • Centroid column name: this option allows you to customize the name for the distance column. By default BigML uses “distance”.

\includegraphics[]{images/cluster-predictions/output-settings}
Figure 7.25 Output settings for batch centroids