Classification and Regression with the BigML Dashboard

2.14 Takeaways

This chapter explains ensembles in detail. See below a list of key points:

  • An ensemble is a collection of decision tress which are combined together to create a stronger model with better predictive performance.

  • Ensembles are one of the best performing Machine Learning algorithms, often winning Machine Learning competitions across a multitude of domains and use cases.

  • Ensembles are very fast to train and test, which significantly streamlines real-life Machine Learning projects.

  • You can use ensembles to solve Classification and Regression problems.

  • BigML provides three types of ensembles: Bagging (a.k.a. Bootstrap Aggregating), Random Decision Forests and Boosted Trees. Bagging builds each model from a random subset of dataset. Random Decision Forests adds an additional element of randomness by choosing random features at each split. Boosted Trees iteratively builds each single model trying to learn from the previous model mistakes.

  • You can build ensembles from datasets that have been created in BigML. (See Figure 2.109 )

  • An ensemble can be an input to an evaluation, to a prediction, or to a batch prediction. (See Figure 2.109 ).

  • BigML ensembles support any type of fields as input fields (categorical, numeric, text and items fields).

  • You can create an ensemble with just 1-click or configure it as you wish. Ensembles are virtually parameter free, giving excellent results with no tuning.

  • If you don’t specify any Objective Field, BigML will take the last valid field in your dataset.

  • The default number of models for your ensemble is set to 10, and the maximum allowed from the Dashboard is 1,000 for Decision Forests and 2,000 for Boosted Trees.

  • In BigML, you can choose three different pruning strategies when building your ensemble: smart pruning, statistical pruning, or no statistical pruning.

  • By default, BigML ensembles don’t consider missing values when choosing splitting rules, but you can explicitly include them.

  • BigML provides three different options to assign specific weight your instances: balance objective, objective weights, weight field.

  • They inherit all the good qualities of individual trees including handling missing data and speed of prediction. However, they are not as easy to interpret as a single decision tree.

  • You can visualize your ensemble using the ensemble chart. The chart is a graphic representation of the marginal effect a subset of input fields have on the objective field (ensemble predictions) disregarding the rest of the fields.

  • As with individual decision trees, the field importance for ensembles provides a measure of a field importance on predictions relative to the others.

  • You need to evaluate your ensemble’s performance with data that the ensemble has not seen before.

  • For Decision Forests, the final prediction and, probability,Confidence (or Expected error), or votes is not known until all the component tree predictions are combined with the selected voting strategy (a.k.a. operating kinds in BigML).

  • BigML provides three different strategies to combine the final Decision Forests prediction: probability, confidence and votes.

  • Predictions for Boosted ensembles do not use combiners since the final rprediction is an additive process rather than averaged.

  • For classification Boosted Trees the probability of each class is returned at the prediction time while regression Boosted Trees do not have any accuracy measure for predictions.

  • You can predict single instances or multiple instances in batch using your ensemble.

  • BigML provides local predictions from the Dashboard for single instances, which allow you to get a real-time prediction without consuming any credits or requiring any internet connection.

  • BigML batch predictions allow you to make simultaneous predictions for multiple instances. For batch predictions, you always get a CSV file and an optional output dataset.

  • You can download an ensemble in a number of programming languages including Python, Java, Node.js, and Objective-C, among others, to use it in your local environment, and make predictions faster at no cost.

  • You can furnish your ensemble with descriptive information (name, description, tags, and category).

  • You can move an ensemble between different projects.

  • Ensembles cannot be shared, but you can share the individual component models.

  • You can stop the ensemble creation before the task has finished.

  • You can permanently delete an ensemble.

\includegraphics[]{images/ensembles/ensemble-workflows}
Figure 2.109 Model Workflows