Anomaly Detection with the BigML Dashboard

4.5 Sampling Options

Sometimes you do not need all the data contained in your testing dataset to generate your anomalies. If you have a very large dataset, sampling may be a good way of getting faster results. (See Figure 4.6 .) You can configure the sampling options explained in the following sections.

4.5.1 Rate

The rate is the proportion of instances to include in your sample. Set any value between 0% and 100%. It defaults to 100%.

4.5.2 Range

Specifies a subset of instances from which to sample, e.g., choose from instance 1 until 200. The Rate you set will be computed over the Range configured.

4.5.3 Sampling

By default, BigML selects your instances for the sample by using a random number generator, which means two samples from the same dataset will likely be different even when using the same rates and row ranges. If you choose deterministic sampling, the random-number generator will always use the same seed, thus producing repeatable results. This lets you work with identical samples from the same dataset.

4.5.4 Replacement

Sampling with replacement allows a single instance to be selected multiple times. Sampling without replacement ensures that each instance cannot be selected more than once. By default BigML generates samples without replacement.

4.5.5 Out of bag

This argument will create a sample containing only out-of-bag instances for the currently defined rate. If an instance is not selected as part of a sample, it is considered out of bag. Thus, the final total percentage of instances for your sample will be 100% minus the rate configured for your sample (when replacement is false). This can be useful for splitting a dataset into training and testing subsets. It is only electable when a sample rate is less than 100%.

\includegraphics[]{images/an-sampling}
Figure 4.6 Sampling options for anomalies