Classification and Regression with the BigML Dashboard

Classification and Regression with the BigML Dashboard
Ensembles
Introduction

2.1 Introduction

There are multiple Machine Learning problems that can be solved using Supervised learning Machine Learning techniques. Some of these problems require to predict an output variable (Objective Field) given a number of input variables (input Fields). These problems can be divided into Classification and Regression depending on whether you need to predict a category (label or class) or a continuous value (a real number), respectively. To learn more about concrete use cases for both problems refer to section 1.1 .

However, most of these problems cannot be solved with a single model. One of the pitfalls of Machine Learning is that the algorithm has the potential to overfit your data, so its performance on your training data is very good, but it does not generalize well to new data, which makes single tree models poorer predictive models. Ensembles avoid this disadvantage.

An ensemble is a collection of multiple decision trees which are combined to create a stronger model with better predictive performance. An ensemble of models built on samples of the data can become a powerful predictor by averaging away the errors of each individual model. Generally, ensembles perform better than a single decision tree because they are less sensitive to outliers in your training data, which helps them mitigate the risk of Overfitting and generalize better when applied to new data.

Depending on the nature of your data and the specific values for the ensemble parameters, you can significantly boost the performance over using a single model. For a technical explanation of why ensembles perform better, please see this tutorial paper from our Chief Scientist, Tom Dietterich [ 27 ] .

BigML currently provides three types of ensembles:

Bagging (also known as Bootstrap Aggregating): this algorithm builds each single model composing the ensemble from a random subset of the dataset instances. By default the samples are taken using a rate of 100% with replacement (this is explained in subsection 2.4.9 ). While this is a simple strategy, it often outperforms more complex strategies. Read more about Bagging.
Random Decision Forests: similar to Bagging but it adds an additional element of randomness by choosing a random subset of features at each tree split. Read more about Random Decision Forests.
Boosted Trees: (or gradient boosted trees) this algorithm sequentially builds a set of weak learners and then combines their outputs in an additive manner to get a final prediction. In every boosting iteration, each single model tries to correct the errors made in the previous iteration by optimizing a loss function. Read more about Gradient Boosting.

See subsection 2.4.3 for an explanation of each algorithm.

This chapter contains comprehensive description of BigML’s ensembles including how they can be created with 1-click (section 2.3 ), all configuration options available (section 2.4 ), and the visualization provided by BigML (section 2.5 ). Once you create an ensemble, you can get a report for each field importance (see subsection 2.2.2 ), and a heatmap chart, known as Partial Dependence Plot (section 2.5 ), to visualize the impact of your input fields on predictions. See section 2.6 for an explanation of how ensembles can be used to make predictions. Moreover, you can also export your ensembles in different formats to make local predictions faster at no cost (subsection 2.7.1 ), move your ensembles to another project (section 2.11 ), or delete them permanently (section 2.13 ). The process to evaluate your ensemble’s predictive performance in BigML is explained in a different chapter (Chapter 7 ).

In BigML, the third tab of the main menu of your Dashboard allows you to list all your available ensembles. In the ensemble list view (Figure 2.1 ), you can see, for each ensemble, the Dataset it was created from, the ensemble’s Name, Type (either classification or regression), Objective (Objective Field name), Age (time elapsed since it was created), Size, and number of predictions, batch predictions, or evaluations that have been created using that ensemble. The search menu option in the top right corner of the ensemble list view allows you to search your models by name.

\includegraphics[]{images/ensembles/ensembles-listing} — Figure 2.1 Ensembles list view

By default, when you first create an account at BigML, or every time that you start a new Project, your list view for ensembles will be empty. (See Figure 2.2 .)

\includegraphics[]{images/ensembles/empty-ensembles} — Figure 2.2 Empty Dashboard ensembles view

Finally, in Figure 2.3 you can see the icon used to represent a model in BigML.