Datasets with the BigML Dashboard

7.1 Splitting Datasets in Training/Test

For most Machine Learning tasks, it is essential to evaluate your model to get an estimate of its performance. To do so, you need to split your dataset into two different subsets. Use the bigger subset (called training data) to build your model, and later test the performance of your model against the smaller subset (called test data). It is important to note that the test data is data that the algorithm never saw when building the model. By doing this, you will able to measure the real performance of your model when a new case appears. The complete evaluation process is explained in the Classification and Regression with the BigML Dashboard [ 4 ] .

This section explains how to split your dataset into two different subsets. BigML offers you two ways: using the corresponding 1-click action that lets you get a selected training dataset containing 80% of the data, and another one containing the remaining 20% for testing; or configuring those ratios by using the configuration option. The below sections cover both options.

7.1.1 Splitting Datasets with 1-Click

This option divides your dataset in two subsets, 80% of your data to train the model and the 20% left to test it. BigML provides two different splitting options: a random and a linear option. If you are training a Classification or Regression model, you usually use the random split which randomly takes instances for each subset. If you are training a time series model, you need to use the linear split which assumes that the instances are chronologically oredered in the dataset and takes the first 80% for training and the last 20% for testing.

From the dataset view, select the most suitable option for your use case 1-click random training|test or 1-click linear training|test. (See Figure 7.3 .)

\includegraphics[]{images/one-click-split}
Figure 7.3 One-click training|test split

When BigML processes this request, both subsets are automatically created and displayed in your Dashboard. You can see the two separate subsets in the dataset list view. (See Figure 7.4 .)

\includegraphics[]{images/train-test-split}
Figure 7.4 Training|test subsets in the dataset list view

7.1.2 Configuring Training/Test Split Options

BigML lets you select the percentage of your data for training and for testing.

From the dataset view, click on the configure option menu and select Training and test set split. (See Figure 7.5 .)

\includegraphics[]{images/configure-split-menu}
Figure 7.5 Access to configure the training|test split

You can configure the percentage for training and testing using the slider shown in Figure 7.6 . In this example we choose 80% and 20% respectively. You can also input any string to the seed parameter to generate deterministic samples and get repeatable results. If you use the same seed for a given dataset, each time you make the training/test split the training and test subsets will contain the same instances. Otherwise, the instances for each subset will be randomly selected and you will get different training and test sets each time you make a split for a given dataset. BigML also provides an option so you can make the split linear instead of random, i.e., the subsets will be created taking into account the order of the instances in your dataset (the first subset of instances for training and the last subset for testing). This option needs to be activated in case you want to train and test a time series model since the instances are chronologically distributed. You can also name your training and test sets differently.

\includegraphics[]{images/split-conf}
Figure 7.6 Training|test splits configuration panel