Anomaly Detection with the BigML Dashboard

2.1 Isolation Forest

BigML anomaly detector is an optimized implementation of the Isolation Forest [ 29 ] algorithm to help users detect anomalies in their datasets. The basic idea is that anomalous instances are more susceptible to be isolated than normal instances when using a Decision Trees approach. Therefore, BigML builds an Ensembles that deliberately overfits each single tree to isolate each instance from the rest of the data points. Each tree is built by selecting a random feature and a random split, then the space is recursively partitioned randomly until single instances are isolated. Anomalous instances should take less partitions to isolate than normal data points. Figure 2.1 illustrates how anomalous instances can be isolated with less splits than normal instances. In other words, the closer to the tree root in a single tree, the more anomalous the instance.

\includegraphics[]{images/anomalies-intro}
Figure 2.1 Graphic representation example of a normal data point (left) versus an anomalous data point (right)

When all instances have been isolated, BigML automatically calculates an anomaly score by averaging the number of splits needed to isolate an instance across trees in the ensemble. Lower number of splits will result in higher scores. Then these averages are normalized to get a final score that can take values between 0% and 100%. This score measures how anomalous an instance is, e.g., the red data point on the left in Figure 2.1 took 10 partitions to isolate, while the one on the right took only 4, so the one on the right will have a higher anomaly score.

BigML also calculates the input field importances for each anomalous instance, that can be defined as the contribution of each field to the anomaly score. Field importances are calculated by finding the per-field sums of the instances partitioned by each split during an evaluation of an Isolation Forest. BigML normalizes these sums, and this yields a percentage per field ranging from 0% and 100% that measures its relative contribution to the anomaly score.

No distance metric is needed to detect anomalous instances in BigML. Empirical comparisons between Isolation Forest and other distance-based methods have demonstrated that detecting anomalies based on isolation techniques perform significantly better, especially for high-dimensional datasets.

Isolation Forests have several advantages:

  • It is a highly scalable method that can deal with large and high-dimensional datasets.

  • No distance metric is required, which makes anomaly detection much more efficient in terms of computational costs.

  • There is no need for data rescaling since it does not calculate distances.

  • It can handle missing data and categorical fields. (See section 2.2 .)

  • It is very robust to noise, i.e. can handle irrelevant or redundant fields since it uses an ensemble of decision trees.

  • The contribution of each field to the anomaly can be easily computed, as opposed to a black-box model.

  • It is almost a parameter-free method contributing to ease of use and reduction in performance tuning efforts.