Datasets with the BigML Dashboard

8.3 Joining Datasets

It is very common to have the data scattered in two or more different datasets. BigML allows you to join several datasets to combine their fields and instances based on one or more related fields between them. For example, imagine we want to predict employee performance and we have two different sources of data: a dataset containing employees’ data (employee name, salary, age, etc.) and another dataset containing departments data (department name, budget, etc.). (See Figure 8.27 ). If we want to include the department data as additional predictor for our employees analysis, we can use a common field in both datasets (department_id) to add the department characteristics to the employee dataset.

\includegraphics[]{images/join-example}
Figure 8.27 Join example

The example above can be easily executed in the BigML Dashboard by following these steps:

  • First of all, you need to upload both Sources to BigML and create a Datasets from each source (see Chapter 3 ).

  • When the datasets are created, find the join datasets option in the employees dataset configuration menu as shown in Figure 8.28 . We use the employees dataset and not the departments dataset because our ultimate goal is to analyze employee performance, hence we need to use employee data to train a Machine Learning model.

    \includegraphics[]{images/join-dataset1}
    Figure 8.28 Join datasets
  • This option will display the join configuration panel in which you need to input the following parameters:

    • Type of join: you can perform four different types of join:

      • Left join: returns all the instances from the current (left) dataset, the employees dataset, and the matched instances from the selected (right) dataset, the departments dataset. If there are instances in the current dataset that do not have a matching instance in the selected dataset, the field values will be missing.

      • Right join: returns all the instances from the selected (right) dataset, and the matched instances from the current (left) dataset, the departments dataset. If there are instances in the selected dataset that do not have a matching instance in the current dataset, the field values will be missing.

      • Full join: returns the matched and unmatched instances in both datasets.

      • Inner join: returns the instances that have matching values in both datasets, the rest of instances will be droped.

      For our example we are performing a left join since we are interested in having all the employees data to make our predictive model and it is not so important if a given employee does not have a department assigned. In this case, the department information will be missing for that employee and the models in BigML can handle missing values afterward.

      \includegraphics[]{images/join2}
      Figure 8.29 Select the type of join
    • Select a dataset to join with: this is the dataset you want to join with the current dataset. Select a dataset that contains at least one field in common with the current dataset to perform the match between the instances.

      \includegraphics[]{images/join3}
      Figure 8.30 Select the dataset to make the join
    • Join fields (current dataset): select one or more fields from the current dataset (the employees dataset) to match the instances with the selected dataset (the departments dataset). These fields should have the same values in both datasets so the instances can be matched. Usually a field with unique values per instance such as an ID field is used here.

      \includegraphics[]{images/join4}
      Figure 8.31 Select the join field from the current dataset
    • Join fields (selected dataset): select one or more fields from the selected dataset (the departments dataset) to match the instances with the current dataset (the employees dataset). These fields should have the same values in both datasets so the instances can be matched. Usually a field with unique values per instance such as an ID field is used here.

      \includegraphics[]{images/join5}
      Figure 8.32 Select the join field from the selected dataset
    • Choose the fields from the selected dataset to be included in the final output: you can choose to include all the fields from the selected dataset or select a subset of them.

      \includegraphics[]{images/join6}
      Figure 8.33 Choose the fields from the selected dataset
  • Optionally, you can filter the current and/or the selected dataset before creating the new joined dataset. You can add up to six different filters. You can filter any type of field except full date-time fields. Please read more about filtering datasets in section 7.3 .

    \includegraphics[]{images/join7}
    Figure 8.34 Filter one or more fields from the current and/or the selected dataset
  • Click the Join datasets button.

    \includegraphics[]{images/join8}
    Figure 8.35 Join datasets
  • A new dataset with the matched instances and the new fields will be created.

    Note: if each instance has one single match in both datasets, i.e., the join fields have unique values per instance, the resulting dataset will have a maximum number of instances equal to the dataset with most instances. However, if the instances in one or more datasets have repeated values for the join fields, each instance in a given dataset will be matched as many times as it finds the same matching value in the other dataset. Therefore, the final number of instances may be much larger than the number of instances in both original datasets.

The join option in the Dashboard uses an SQL query underneath. Therefore, when the joined dataset is created, you can view the SQL query by clicking the option shown in Figure 8.36 below.

\includegraphics[]{images/join9}
Figure 8.36 View join SQL query