Anomaly Detection with the BigML Dashboard

2.3 Interpreting BigML Anomalies

BigML displays a list of the top anomalous instances ranked by score. Usually a score of 60% or higher is a good rule of thumb for a given instance to be considered anomalous. BigML also provides the field importances for each top anomaly.

Figure 2.2 shows an anomaly view example created from a dataset containing some diabetes patient data. The first instance can be considered anomalous since its score is higher than 60%. If you mouse over the field importance histogram under the orange score bar, you can see that “Diabetes pedigree” is the field that contributes the most to the anomaly score, more than 25%.

If you further inspect the field values for this instance in the data inspector to the right, you find that this patient has very high values for “Diabetes pedigree”, an indicator of diabetes history among the family members, as well as for “Glucose” and “Pregnancies”, three fields that tend to be positively correlated with diabetes. But for this patient, “Diabetes” is false, so the algorithm rightfully points that this pattern represents an anomaly. Of course, this may be simply due to a data entry error, or it could be genuinely a personal anomaly. Regardless, it is a data point unlike the majority of data points in this dataset.

\includegraphics[]{images/interpret-anomalies}
Figure 2.2 Anomaly example

Note: the 60% threshold is no longer valid if the parameter Constraints is enabled since scores tend to be inflated. (See section 4.3 .)