Datasets with the BigML Dashboard
8.4 Merging Datasets
In case you have instances in different datasets and you want to merge them all into one single dataset, you can do it using the merging datasets option in BigML. This functionality can be very useful when you use multiple sources of data. Imagine, for example, that you collect data on an hourly basis and want to create a dataset aggregating data collected over the whole day. You only need to send the new data generated each hour to BigML, create a Source and a Dataset for each one and then merge all the individual datasets into one at the end of the day.
For example, imagine we have employees data in two different datasets and we want to merge them into one dataset (see Figure 8.37 ).
You can easily merge datasets in the BigML Dashboard by following these steps:
From one of the datasets, open the Configure dataset menu (see Figure 8.38 ). By convention, this first dataset defines the final dataset fields. All datasets should have the same field names and IDs. If this first dataset has fields not found in the other datasets, the merge will give an error. However, if the other datasets have some fields that are not found in the first dataset, you can still excute the merge and these fields will be dropped from the final dataset. You can map the fields from different datasets using the merging option from the API for the moment.
Select the datasets you want to merge (see Figure 8.39 ).
You can select up to 32 datasets (see Figure 8.40 ).
You can sample each one of the selected datasets (see section 7.2 to find an explanation for each sampling option).
Click
to create a new dataset with all the merged instances.
From the resulting dataset you can click the option shown in Figure 8.42 to see the merge configuration of each dataset.
Note: the merging option is the only transformation option that does not use SQL query behind the scenes.