Datasets with the BigML Dashboard

7.4 Remove Duplicates

Duplicated instances in a dataset can be problematic when training Machine Learning models. For example, if you make a random split of your dataset and you take one subset for training and other for testing, it is likely that these duplicated instances appear in both subsets, which will give you an unrealistically good performance of your model. By removing the duplicated instances, you ensure each dataset has unique instances (see Figure 7.39 ).

\includegraphics[]{images/remove-duplicates-example}
Figure 7.39 Remove duplicated instances example

With BigML you can easily remove the duplicated instances in your datasets following the steps below:

  • Find the remove duplicates option in the dataset configuration menu as shown in Figure 7.40 .

    \includegraphics[]{images/remove-duplicates}
    Figure 7.40 Remove duplicates option
  • A configuration panel will be displayed where you have only one parameter, the new dataset name. Then click on the “Remove duplicates” button (see Figure 7.41 ).

    \includegraphics[]{images/remove-duplicates2}
    Figure 7.41 Remove duplicates
  • When the process has finished, you will see an orange message on top of the dataset indicating how many duplicated instances have been removed (see Figure 7.42 ). If there were no duplicated instances to remove in your dataset, you will see it in the message too.

    \includegraphics[]{images/remove-duplicates3}
    Figure 7.42 Number of duplicates removed

The remove duplicates option in the Dashboard uses an SQL query underneath. Therefore, when the new dataset is created, you can view the SQL query by clicking the option shown in Figure 7.43 below.

\includegraphics[]{images/remove-duplicates4}
Figure 7.43 View the SQL query of the operation performed