Datasets with the BigML Dashboard

8.2 Aggregating Instances

The aggregating instances option in BigML allows you to group the rows of a dataset by a given field. This is a very common transformation to prepare your data for Machine Learning models. For example, imagine you have customer data stored in a dataset where each purchase is a different row. If you want to use this dataset to train models to analyze customers’ purchase behaviors, you need a dataset where each row is a customer instead of a purchase. This is the case of the dataset in Figure 8.18 where we can aggregate the instances by the field “customerID” to get a row per unique customer. Apart from grouping the instances by customer, we also need to add the purchase information per customer. We can do this by defining some aggregation functions on top of the former fields per purchase. For example, in the image below you can find the total purchases per customer (“Count_customerID”), the total units purchased (“Sum_Quantity”), the first purchase date (“Min_Date”) and the average price per unit spent per customer (“Avg_UnitPrice”).

\includegraphics[]{images/aggregate1}
Figure 8.18 Aggregate instances by customer ID example

The example above can be easily executed in the BigML Dashboard by following these steps:

  • Find the Aggregate instances option in the dataset configuration menu (see Figure 8.19 ).

    \includegraphics[]{images/aggregate}
    Figure 8.19 Select the option to aggregate the instances
  • When the configuration panel has been displayed, select a field to aggregate your instances. You can select any type of field (numeric, categorical, text or datetime fields) and your instances will be grouped by the unique values of this field. In this case, we select “CustomerID” because we want a dataset with one row per customer (see Figure 8.20 ).

    \includegraphics[]{images/aggregate2}
    Figure 8.20 Select a field to aggregate the instances

    You can optionally add more aggregation fields by clicking on the option shown in Figure 8.21 . You can add up to five fields from the Dashboard, if you need to aggregate more fields you can use the API. This option is very useful when you need to aggregate fields in a nested format, e.g. you may want that each row represents a customer per day.

    \includegraphics[]{images/aggregate21}
    Figure 8.21 Add more aggregation fields
  • When you select the aggregating field, you can see that BigML automatically displays an operation which is the row count to aggregate the instances. This operation calculates the number of rows per value for the aggregating field. In the example below (see Figure 8.22 ), the count operation on top of the “CustomerID” allows us to know the total purchases per customer. You can also remove this operation if you are not insterested in it by clicking the remove icon on the right-hand side.

    \includegraphics[]{images/aggregate3}
    Figure 8.22 Row count operation by default

    At this point, we can go ahead and create a new dataset that only contains two fields, the “CustomerID” and the “row_count”. However, this new dataset with only two fields will not be very useful to train any Machine Learning model. We want to add more fields to the resulting dataset that gather as much information as we can about each customer’s purchase behavior.

  • You can add more fields to the dataset by defining additional aggregation operations. For example, imagine we want to know the total units purchased per customer, we can select Sum in the operation selector (see Figure 8.23 ):

    \includegraphics[]{images/aggregate4}
    Figure 8.23 Add more operations

    And then select the field “Quantity” as shown in Figure 8.24 :

    \includegraphics[]{images/aggregate5}
    Figure 8.24 Select the field for the chosen operation

    All operations have a prefix for fields defined. In the resulting dataset, all the fields that have a given operation applied will be renamed with the prefix before their actual names. This allows you to know the operation applied to a given field. You can edit this prefix name or remove it.

    You can select the following operations depending on the field type:

    • Count: counts the total rows per unique value of the aggregating field. It can be applied to all field types.

    • Count distinct: counts the rows that have distinct values per unique value of the aggregating field. It can be applied to all field types.

    • Count missings: counts the rows that have missing values per unique value of the aggregating field. It can be applied to all field types.

    • Sum: sums the values of the aggregated instances. Only for numeric fields.

    • Average: averages the values of the aggregated instances. Only for numeric fields.

    • Maximum: takes the maximum value of the aggregated instances. Only for numeric fields.

    • Minimum: takes the minimum value of the aggregated instances. Only for numeric fields.

    • Standard deviation: takes the standard deviation of the aggregated instances. Only for numeric fields.

    • Variance: takes the variance of the aggregated instances. Only for numeric fields.

    • Concatenate values: concatenates the values of the aggregated instances. Only for categorical, text, and items fields. You can also define the separator and the final field type (text, categorical or items field).

    • Concatenate distinct values: concatenates the distinct values of the aggregated instances. Only for categorical, text, and items fields. You can also define the separator and the final field type (text, categorical or items field).

    Note: the fields in the original dataset that do not have an operation defined will be dropped from the final dataset.

    For our example, we are defining more operations such as the total units purchased, the total price spent per customer, the average price per purchase per customer, and the concatenation of the purchased products descriptions (see Figure 8.25 ).

    \includegraphics[]{images/aggregate6}
    Figure 8.25 Define all the operations you want for the dataset fields
  • Finally, click Aggregate instances and a new dataset with the new aggregated instances and field calculations will be created.

The aggregation option in the Dashboard uses an SQL query underneath. Therefore, when the new dataset is created, you can view the SQL query by clicking the option shown in Figure 8.26 below.

\includegraphics[]{images/aggregate7}
Figure 8.26 View the SQL query of the aggregation performed