Classification and Regression with the BigML Dashboard

2.2 Understanding Ensembles

In this section, we are going to describe a few internal details about ensembles and how BigML implements them. Specifically, since ensembles are based on BigML models, all the information provided in section 1.2 also applies to ensembles, unless overridden here.

BigML grows ensembles in a very similar way to how it grows simple models (see section 1.2 ). In particular, BigML does not use dataset streaming (see subsection 1.2.1 ) for ensembles, thus requiring the entire dataset to be loaded into memory. This choice is motivated by the intrinsic behavior of growing ensembles, which aggressively uses sampling to be able to create multiple significant models from the same dataset.

2.2.1 Decision Forests versus Boosted Trees

BigML offers three types of algorithms, Bagging, Random Decision Forests (which are included under the same ensemble type in BigML called Decision Forests), and Boosted Trees. The following subsections contain some technical details of the main commonalitites and differences for these methods. To read about the situations under which is better to use one method or another please refer to subsection 2.4.3 .

All ensemble methods have in common that they are composed by several single trees and their output is combined to yield a final prediction. The ensemble always has a better performance than each of the individual learners it is composed of.

The main difference between Decision Forests and Boosted Trees is the way single trees are grown and the way their predictions are combined to get the final ensemble prediction. See the following subsections for a detailed expalanation.

Single trees

Each single tree in Decision Forests tries to predict the objective field using certain level of randomess, either by selecting a random percentage of the dataset instances (Bagging) and/or by selecting a random subset of the input fields at each split (Random Decision Forests). However, in Boosted Trees, each single learner, instead of predicting the objective field, tries to learn from the mistakes made by the previous model by fitting a gradient step towards minimizing the error of the previous classifier. For regression problems, the error is the usual squared error (the squared difference between the true objectives and the current prediction). For classification problems, a tree is trained for each class in each iteration. The scores from the trees are normalized using the softmax function to obtain a probability distribution over classes given a datapoint. The error is the difference between this distribution and the “true distribution” over classes for the datapoints, which is one for the correct class and zero for all others.

Another characteristic for Boosted Trees is that they have a weight associated with each tree which measures their importance to calculate the final ensemble prediction. The weights are chosen via a line search, where possible weights are evaluated on a group of test points to find a weight that is near-optimal. Test points for the line search are randomly selected from the training data. Lastly, the weights are multiplied by the learning rate (see Learning rate ) to calculate the final weight. Read more about choosing weights using line search here.

Ensemble predictions: confidence, probability and expected error

For Decision Forests, single tree predictions are averaged to get a final prediction. The same quality measures obtained when building a single model are returned for Decision Forest predictions: confidence and probabilities, for classification problems, and expected error, for regression problems. Find the calculations details for each measure in subsection 1.2.6 . For regression ensembles, all the single trees predictions are averaged to get a single prediction. For classification ensembles, all the per-class confidences and probabilities are averaged taking into account all the trees in the ensemble. The class with the highest confidence or probability is returned. There is an additional technique to calculate predictions for classification problems called “votes”. It is based on the percentage of trees voting for each class in the ensemble to select the winner class. See Combine single tree predictions: probability, confidence or votes for a detailed explanation about probabilities, confidences and votes to calculate predictions.

For Boosted Trees, the single model predictions are additive rather than averaged. For boosted trees you only get the class probabilities in the case of classification ensembles, neither the confidence or the expected error can be calculated. For regression ensembles the final prediction is generated by calculating the sum of each tree prediction multiplied by its boosting weight (see the explanation for boosting weight in Single trees ). The expected error cannot be calculated for Boosted Trees so there is not quality measure returned for regression problems. Predictions for classification ensembles are similar, but separate weighted sums are found for each objective class. The resulting vector of weighted sums is then transformed into class probabilities using the softmax function. Hence, the probability for each of the classes in the objective field is returned to measure the prediction quality for boosting.

2.2.2 Field Importance

As with individual decision trees, the field importance for ensembles provides a measure of how important a data field is relative to the others. It is computed by taking a weighted average of how much each field reduces the predicted error of the tree at each split (more details in subsection 1.2.5 ). For individual trees this measure can be misleading as it assumes that the tree structure is correct, but for ensembles it is a more meaningful measure. (See Figure 2.4 ).

\includegraphics[]{images/ensembles/field-importance}
Figure 2.4 Field importance for ensembles

To visualize the marginal contribution of a field in the ensemble predictions, BigML offers Partial Dependence Plots. You can find a detailed explanation in section 2.5 .

Note: The concept of field importance is also used in prediction explanation for single predictions (See Figure 2.80 ). But they are calculated differently. A field can be very important for the ensemble but insignificant for a given prediction.

2.2.3 Ensembles with Images

BigML ensembles do not take images as input directly, however, they can use image features as those fields are numeric.

BigML extracts image features at the source level. Image features are sets of numeric fields for each image. They can capture parts or patterns of an image, such as edges, colors and textures. For information about the image features, please refer to section Image Analysis of the Sources with the BigML Dashboard [ 22 ] .

\includegraphics[]{images/ensembles/ensemble-image-dataset-texture}
Figure 2.5 A dataset with images and image features

As shown in Figure 2.5 , the example dataset has an image field image_id. It also has image features extracted from the images referenced by image_id. Image feature fields are hidden by default to reduce clutter. To show them, click on the icon “Click to show image features”, which is next to the “Search by name” box. In Figure 2.6 , the example dataset has 160 image feature fields, called Wavelet subbands.

\includegraphics[]{images/ensembles/ensemble-image-dataset-texture-fields}
Figure 2.6 A dataset with image feature fields shown

From image datasets like this, ensembles can be created and configured using the steps described in the following sections. All other operations including prediction, evaluation applies too.