Datasets with the BigML Dashboard

7.2 Sampling Datasets

Most of the time, you do not need all the data to generate your models. If you have very large datasets, sampling may be a good way of getting results and iterating faster. Sampling your data is straight forward with BigML. Simply open the configure option menu and select Sample Dataset. (See Figure 7.7 .)

\includegraphics[]{images/access-sampling}
Figure 7.7 Access to sample your dataset

Find in the sections below a detailed explanation of all the configuration options that BigML offers to sample your dataset.

7.2.1 Sampling

You can easily configure the sampling rate by moving the slider in the configuration panel for sampling, or by typing the percentage in the tiny input box, both highlighted in Figure 7.8 . The rate is the proportion of instances to include in your sample. After that, you can also name your sampled dataset differently.

\includegraphics[]{images/sampling}
Figure 7.8 Configuration panel for sampling

7.2.2 Advanced Sampling

If you prefer to sample differently your dataset, configure the following advanced options in the configuration panel for advanced sampling: (See Figure 7.9 )

Range

Specify a subset of instances, when the instances are ordered, from which to sample. For example, choose a range from instances 100 to 200. The specified rate will be applied over the subset configured. This option may be useful when you have temporal data, and you want to train your model with historical data and test it with the most recent one to check if it can predict based on time.

Sampling

By default, BigML selects your instances for the sample by using a random number generator, which means two samples from the same dataset will likely be different even when using the same rates and row ranges, except when the rate is 100% and do not use repetition. If you choose deterministic sampling, the random-number generator will always use the same seed, thus producing repeatable results. This lets you work with identical samples from the same dataset.

Replacement

Sampling with replacement allows a single instance to be selected multiple times. Sampling without replacement ensures that each instance cannot be selected more than once. By default, BigML generates samples without replacement.

Out of Bag

Create a sample containing only out-of-bag instances for the currently defined rate, the final total number of instances for your sample will be one minus the rate configured for your sample (when replacement is false). This can be useful for splitting a dataset into training and testing subsets. It is only selectable when a sample rate is less than 100%.

\includegraphics[]{images/advanced-sampling}
Figure 7.9 Configuration panel for advanced sampling