Classification and Regression with the BigML Dashboard

1.2 Understanding BigML Models

This section describes internal details about BigML models. Besides being useful to better understand how BigML implements them and the optimizations it applies to improve performance, it will also provide the foundations to understand BigML models’ configuration options. You can safely skip this section on first read.

Models are built by splitting the data into partitions so each of them maximizes the information gain for Classification models or minimizes the mean squared error for Regression models. Each partition, i.e., split, has an associated Predicate that defines it, e.g., balance < 1,000. The process of partitioning the instances in a dataset is recursive, i.e., partitions are themselves split into smaller ones, thus forming a hierarchy of partitions with their associated predicates. At each step of this process, a number of summary statistics is also computed, such as how many instances belong partition, its confidence, etc.

BigML uses a proprietary algorithm for models that produces Classification And Regression Tree (CART) style decision Trees. BigML algorithm adapts itself to the dataset you are learning from to provide optimal performance. Indeed, for datasets that have a large number of instances, BigML will use “streaming”, i.e., it will process the data instances in chunks to reduce the memory footprint it requires. Alternatively, for datasets that have a large number of fields, BigML tends to load the whole dataset into memory.

The key points of BigML streaming algorithm (stree) are listed below:

  • Support for large datasets

  • Multi-core parallelism or multi-machine distribution

  • Anytime algorithm with frequent updates

The key points of BigML in-memory algorithm (mtree) are listed below:

  • Support for datasets with a large number of fields

  • Faster than stree, albeit at the cost of needing more memory

  • Anytime algorithm with frequent updates

1.2.1 Streaming Model

Inspired by algorithms by Ben-Haim and Tyree, BigML models are grown breadth-first. To grow the next generation of an stree, training instances are run through the Tree and summary statistics are collected in each Leaf. The summary statistics are then used to choose new splits for each leaf.

The main advantage of growing in this fashion is that an stree does not need to keep the training data in memory. Instead, the memory footprint for an stree is dependent on the size of the tree and the number of input features in the training data.

The summaries collected by an stree over a set of training instances may be merged with summaries from a separate block of training instances. This feature allows the stree summary collection phase to be easily parallelized or distributed.

Traditionally, CART trees consider every training instance that reaches a leaf when determining how to split. While an stree may be built similarly, it can also generate a split early using only a sample of the training data. stree attempts to detect when an Early split is safe by calculating when the summary statistics have “settled”. Early splitting requires that the training instances are shuffled beforehand. (See subsection 1.4.8 .)

1.2.2 In-memory Model

Unlike stree, which grows an entire new tree layer on every iteration (a breadth first expansion), mtree grows the tree one split at a time. At each iteration, an mtree chooses the split that looks to improve the tree the most. So the tree grows more like an A* search than a breadth-first search.

mtree makes it possible to cap the overall size of BigML trees, so this approach helps prevent defining splits on relatively unimportant parts of the tree.

When choosing the best candidate split, mtree uses the reduction in squared error for regression trees. For classification trees, we use a split’s information gain multiplied by its population. These scores proved better than reusing pruning error estimates. (See also subsection 1.4.3 .)

As expected, this model growth technique outperforms the breadth-first approach for similarly sized trees.

With stree, the node threshold option (see subsection 1.4.5 ) sets the minimum number of nodes in the tree, but that threshold could be exceeded. A threshold of 255 nodes could result in a tree sized anywhere between 255 and 509 nodes. However, asking for an mtree with 255 nodes will produce a tree with 255 nodes.

1.2.3 Scoring and Splitting

To choose a split for a leaf node, stree picks the best split for each input field and then selects the best of those candidates.

For categorical fields a candidate split is scored for every category, with a predicate in the form fieldX == "some_category". Along with the binary splits, stree scores an exploded split where every category is given its own child node.

Numeric splits are always binary with a predicate of the form fieldX >= 42. In traditional CART implementations, a split is considered between every two adjacent data points. Since stree does not keep the training data in memory, we use the summary Histogram [ 22 ] instead. The median between every pair of adjacent histogram bins becomes a candidate split.

Early Splitting

A node in an stree can grow for one of two reasons. First, when the node collects summary information over the entire dataset it grows, as there is no more information to help inform the split. Secondly, a node can grow after collecting summary information over just a portion of the training data. This is the early split.

The core assumption for early splitting is that a sample of the dataset can often provide enough information to pick a good split. This is commonly the case for splits near the top of the tree. stree detects when enough information has been collected for an early split.

To find the similarity, stree calculates split scores for each field. The scores are normalized across all fields in the field summary set. At this point stree has two normalized score distributions, one for each field summary set. The L1 distance between the distributions produces the similarity metric. A score of 0 means an exact match while a score of 1 means complete disagreement.

1.2.4 Pruning

As mentioned, models are trained by recursively partitioning the data, i.e., splitting it in two parts and then splitting each of them again in two and so on. This process goes on and on unless a stopping criteria is used. If a model grows too much, it ends up generating rules on minutiæ and losing the ability to generalize. This phenomenon is called Overfitting. Pruning strategies play an essential role in avoiding it.

If pruning is enabled, then at each split BigML tries to determine whether that split increases the confidence (for classification models) or decreases the expected error (for regression models). If it does not, then it is pruned away. The pruning uses pessimistic error estimates as given by either the Wilson score interval or a standard confidence interval for regression trees.

Note: a node needs at least two instances to have a split, so models built on datasets with less than 200 instances will not be pruned.

1.2.5 Field importance

The field importance is the relative contribution of each field to the objective field predictions. So, a field with higher importance will have a greater impact on predictions. You can find a visual histogram containing the importance for all fields in the Model Summary Report. (See subsection 1.6.1 .)

BigML calculates the field importances by estimating the prediction error each field helps to reduce for each split in a tree (these same estimates are used when pruning). Then, BigML sums the error reduction for every split grouped by field. Then, those sums are normalized, so that they total to 1. Conceptually, BigML measure of importance is based on the popular Random Forest measure of Breiman’s Gini importance [ 24 ] , except that BigML does not use Gini as the error metric.

Fields with an importance of zero may still be correlated with the objective field. It just means the model preferred other fields when choosing splits.

It is worth noting that the field importance values from Random Decision Forests will often be more meaningful than they are from a single tree.

Note: The concept of field importance is also used in prediction explanation for single predictions (See Figure 1.78 ). But they are calculated differently. A field can be very important for the model but insignificant for a given prediction.

1.2.6 Confidence and Probability

Classification models in BigML include two different measures of the model’s certainty when predicting a class at a node: confidences and probabilities. Both metrics can take values between 0%, which means the prediction is no better than randomly selecting a class, and 100% which indicates absolute certainty. The main difference between both metrics is that the confidence penalizes more heavily a low number of instances of the predicted class at the node. The confidence is considered a pesimistic measure of the model certainty because it is usually lower than the probability unless the number of instances at the node is very high in which case the confidence and probability take similar values. A detailed explanation of each one can be found in the following subsections.

Confidence

The confidence is calculated in BigML using the Wilson score formula. You can find a detailed explanation of the confidence formulation and its interpretation below.

Formulation

BigML computes confidence based on the lower bound of the Wilson score interval, which means it provides a pessimistic confidence:

\[ \frac{ \hat{p} + \frac{1}{2n}z^2 - z \sqrt{ \frac{ \hat{p}(1-\hat{p}) }{n} + \frac{z^2}{4n^2} } }{ 1 + \frac{1}{n} z^2 } \]

Where \(\hat{p}\) is the proportion of instances for the predicted class, \(n\) is the number of instances at that node and \(z=1.96\), calculated as the \(\frac{1-\alpha }{2}\) quantile of the standard normal distribution for a level of error of \(\alpha =5\% \).

Interpretation

The main goal is to balance the proportion of the predicted class with the uncertainty of a small number of instances. This formula takes into account two factors:

  • The class distribution of the node: assuming equal number of instances, the more homogeneous the node, the higher the confidence. In other words, the node containing more instances for the predicted class will have higher confidence.

  • The number of instances in the node: a node containing 1,000 instances with the same class distribution vs. another one containing five instances deserves higher confidence as uncertainty about the predicted class is higher whenever fewer instances are involved.

For example, if we are trying to predict if a person will have diabetes, Figure 1.4 shows both example nodes have the same predicted class distribution (93.75%), but the node with 64 instances has more confidence than the node with 16 instances because of the adjustment the Wilson interval formula makes to penalize the lower number of instances.

The idea is to provide a confidence measure underestimating the performance of your model, so you can be sure that predictions will not do it worse than the confidence.

BigML shows the confidence in several places: in a model’s top panel, at each node, in the prediction form (see Predict ), and in the BigML API.

\includegraphics[width=0.9\textwidth ]{images/models/confidence}
Figure 1.4 Example of confidences in two different nodes of the same tree

Probability

The probability of a class at a certain node is the percentage of instances for that class at the node. BigML also adds Laplace smoothing also known as additive smoothing to this probability. This means the class probabilities will tend to be pulled towards to their overall proportion in the dataset. If there are only a few instances in the leaf leading to the prediction, this smoothing can have a significant impact. If there are a high number of instances, the smoothing will have little effect.

For the probability calculation BigML takes into account the total number of instances for the class of interest at a given node (“\(\mathrm{node\_ class\_ count}\)”), the total number of instances in the node (“\(\mathrm{node\_ total\_ count}\)”), and the percentage of instances for that class over the total instances in the dataset (“\(\mathrm{class\_ prior}\)”):

\[ \mathrm{Probability\_ class} = \frac{\mathrm{node\_ class\_ count} + \mathrm{class\_ prior}}{\mathrm{node\_ total\_ count} +1} \]

To illustrate the probability calculation with an example, imagine a dataset with a total of 100 instances and three classes: class_a (30 instances), class_b (50 instances), and class_c (20 instances).The “\(\mathrm{class\_ prior}\)” for each of them is:

\[ \mathrm{class\_ a\_ prior}= 0.3 \]
\[ \mathrm{class\_ b\_ prior}= 0.5 \]
\[ \mathrm{class\_ c\_ prior}= 0.2 \]

If we were calculating the probabilities at a given node containing a total of 20 instances where 10 instances belong to class_a and other 10 instances to class_c, the probabilities for each class will be calculated as:

\[ \mathrm{probability\_ class\_ a}= \frac{10 + 0.3}{20 +1}= 0.49 \]
\[ \mathrm{probability\_ class\_ b}= \frac{0 + 0.5}{20 +1}= 0.024 \]
\[ \mathrm{probability\_ class\_ c}= \frac{10 + 0.2}{20 +1}= 0.486 \]

1.2.7 Expected Error

The measure of the predictions certainty at a node for regression trees is the Expected error.

Formulation

The formula below is the standard formula to estimate the average squared error for the prediction at a given node:

\[ \bar{e} = \frac{\sum _{i}(v_i - x)^2}{n} \]

Here, \(x\) is the predicted output at the node, \(n\) is the total number of instances, and \(v_1 . . .v_n\) are the rest of the objective values at that node. Note that if we use \(n-1\) instead of \(n\), this formula becomes the sample variance (\(S^2\)) of the node. To calculate the expected error for the predictions, we need to include the error that could arise from our sample variance not representing the true population variance. We solve this problem by constructing a confidence derived from the true variance, which we will call \(\sigma ^2\).

\[ \left[\frac{S^2(n-1)}{\chi ^2_{\alpha /2,n-1}}, \frac{S^2(n-1)}{\chi ^2_{1-\alpha /2,n-1}}\right] \]

The notation \(\chi ^2_{p,f}\) indicates the critical value from the \(\chi ^2\), distribution at probability \(p\) with \(f\) degrees of freedom. Using the upper bound, we can calculate the standard error of our estimate of the mean. The estimated error using the bounds is then our upper bound on the standard deviation \(\hat{\sigma }\), plus the possible error in the mean above, squared:

e= (z+σ)^2
= + +
=
=

If the variance (and thus error \(\hat{e}_c\)) is zero, we use the parent error \(\hat{e}_p\) to construct the following estimate of the child error given the parent prior \(\hat{e}_{c|p}\):

\[ \hat{e}_{c|p}= \frac{\hat{e}_p + n\hat{e}_c}{n+1}= \frac{\hat{e}_p}{n+1} \]

Interpretation

The expected error is an average of the instance errors at a given node. In Figure 1.5 , the node predicts a value of 70.60 with an error of 28.59, meaning the prediction is, on average, likely to be within 28.59 of the target (between 42.01 and 99.19 in this example). This is an average, so while a single prediction may be more than 28.59 away from the true target, on average it is expected to do better than that. As in the confidence calculations for classification models, it is a pessimistic measure of your model performance, so that you can be sure your predictions will not do worse than the predicted value +/- the expected error.

\includegraphics[width=0.5\textwidth ]{images/models/expected-error}
Figure 1.5 Example of the expected error at a node

Along with the prediction and the expected error at a node, BigML also provides the histogram with the instances and their values. This allows you to find out the predicted value distributions and gives you information such as the complete possible range for the objective variable or whether the error is more likely to be skewed to one side of the prediction. If you need to further explore the data for a specific branch, you can also create a dataset from that branch (see Figure 1.26 ).

1.2.8 Models with Images

BigML models do not take images as input directly, however, they can use image features as those fields are numeric.

BigML extracts image features at the source level. Image features are sets of numeric fields for each image. They can capture parts or patterns of an image, such as edges, colors and textures. For information about the image features, please refer to section Image Analysis of the Sources with the BigML Dashboard [ 22 ] .

\includegraphics[]{images/models/model-image-dataset-edge}
Figure 1.6 A dataset with images and image features

As shown in Figure 1.6 , the example dataset has an image field image_id. It also has image features extracted from the images referenced by image_id. Image feature fields are hidden by default to reduce clutter. To show them, click on the icon “Click to show image features”, which is next to the “Search by name” box. In Figure 1.7 , the example dataset has 234 image feature fields, called Histogram of gradients.

\includegraphics[]{images/models/model-image-dataset-edge-fields}
Figure 1.7 A dataset with image feature fields shown

From image datasets like this, models can be created and configured using the steps described in the following sections. All other operations including prediction, evaluation applies too.