Association Discovery with the BigML Dashboard

4.8 Sampling Options

If you do not want to use all your dataset to create associations, BigML lets you create associations for a sample of your dataset. You may configure the sampling options explained in the following subsections. (See Figure 4.9 ).

\includegraphics[]{images/assoc-sampling}
Figure 4.9 Configuration panels to sample your dataset

4.8.1 Rate

The Rate option allows you to set the proportion of instances to include in your sample. It is a value between 0% and 100% and it defaults to 100%. You can change this value by moving the rate slider shown in Figure 4.9 or by typing the percentage in the input box.

4.8.2 Range

The Range option lets you specify a linear subset of the instances that you want to consider for your sample, e.g., from instance 100 to instance 500. Select the desired range by moving the range slider shown in Figure 4.9 or by typing the percentage in the input box. The rate value that you set will only be computed over the range you specify.

4.8.3 Sampling

By default, BigML selects your instances for the sample by using a random number generator, which means two samples from the same dataset will likely be different even when using the same rates and row ranges. Choose between a random sampling or deterministic sampling. If you choose deterministic sampling, the random-number generator will always use the same seed, thus producing repeatable results. This lets you work with identical samples from the same dataset.

4.8.4 Replacement

Sampling with replacement allows a single instance to be selected multiple times. Sampling without replacement ensures that each instance cannot be selected more than once. By default, BigML generates samples without replacement.

4.8.5 Out of Bag

This option creates a sample containing only out-of-bag instances for the currently defined rate. If an instance is not selected as part of a sampling, it is considered an out-of-bag instance. Thus, the final total percentage of instances for your sample will be 100% minus the rate configured for your sample (when replacement is false). This can be useful for splitting a dataset into training and testing subsets. It is only electable when a sample rate is less than 100%.