Cluster Analysis with the BigML Dashboard

4.8 Sampling Options

Sometimes you do not need all the instances contained in your testing dataset to build your cluster. If you have a very large dataset, Sampling may be a good way of getting faster results. (See Figure 4.11 .)

The same sampling options described in the Datasets with the BigML Dashboard document [ 23 ] to sample datasets are also available when building clusters. They are divided in two groups: sampling and advanced sampling options.

\includegraphics[]{images/clusters/sampling-params}
Figure 4.11 Sampling options for clusters

4.8.1 Rate

The sampling rate is the frequency of instances being extracted from the dataset and included in your sample. A sampling rate of 100% means that all instances are included; a rate of 10% means that only every tenth instance is included. This option may take any value between 0% and 100%. You can easily configure the rate by moving the slider in the configuration panel for sampling, or by typing the percentage in the tiny input box, both highlighted in Figure 4.11 .

By default, BigML uses a 100% rate.

4.8.2 Range

The sampling range is the linear subset of the dataset instances that you want to include in the sample, e.g., from instance 5 to instance 1,000. The rate will be applied over the range configured.

By default, all instances are included, i.e., the range is (1, num. rows in dataset).

4.8.3 Sampling

The sampling option represents the type of the sampling process, which can be either random or deterministic.

When using deterministic sampling the random-number generator will always use the same seed, producing repeatable results.

By default, BigML uses random sampling.

4.8.4 Replacement

The replacement option controls whether a single instance can be selected multiple times or not. Sampling without replacement ensures that each instance cannot be selected more than once.

By default, BigML generates samples without replacement.

4.8.5 Out of Bag

The out of bag option allows you to include in your sample only those instances that were not selected in the first place, thus effectively inverting the sampling outcome. It is only selectable when a sample is deterministic and the sample rate is less than 100%. The total percentage of instances included in your sample will be one minus the rate (when replacement is not allowed). This can be useful for splitting a dataset into training and testing subsets.

By default, BigML will not use out of bag instances.