Classification and Regression with the BigML Dashboard

Classification and Regression with the BigML Dashboard
Ensembles
Visualizing Ensembles

2.5 Visualizing Ensembles

Being able to effectively visualize an ensemble is paramount to exploring it, interpreting it, and explaining why it produces certain outcomes. BigML provides two different visualizations, a Partial Dependence Plot (PDP) and a list of the single models:

Partial Dependence Plot: a graphic representation of the marginal effect that the combination of two fields (Predictorss) have on the Objective Field (ensemble predictions) keeping the rest of the field values constant.
Model list: provides a list of the single models that form the ensemble. This visualization is only available for Decision Forest ensembles (see subsection 2.4.3 ).

Note: BigML does not provide the model list for Boosted Trees because they cannot be interpreted the same way as the models for other ensemble types (see subsection 2.5.2 ).

In the top menu you will find a summary of the ensemble results: the number of models in the ensemble, the sample rate, the objective field used, if the ensemble has been randomized, the type of ensemble used (Decision Forest or Boosted Trees), and the number of instances in the dataset.

\includegraphics[]{images/ensembles/ensemble-view2} — Figure 2.27 Ensemble top menu

For Decision Forests, below the top menu, you will find the icons corresponding to each one of the views (the PDP and the model list) to switch from one view to another. (See Figure 2.28 ).

\includegraphics[]{images/ensembles/switch-views} — Figure 2.28 Switch from the chart view to the model list view

You can find a detailed explanation of each view in the following subsections.

2.5.1 Partial Dependence Plot (PDP)

The PDP is the main default view you will find when creating an ensemble. The main goal is to represent the marginal effect of a set of variables (input Fields) on the ensemble predictions disregarding the rest of the variables. It is a common method for visualizing and interpreting the impact of the variables on ensemble predictions, and it can be used for classification and regression ensembles.

Note: the ensemble PDP is not a representation of the dataset values; it is a representation of the ensemble results and their dependence from a set of variables used as inputs.

In order to ensure responsiveness, the PDP is built using 10 models by default. For ensembles with a higher number of models, a random sample of 10 models will be selected to calculate the predictions. A warning message will appear at the top of the ensemble view to indicate that the chart has been built with a lower number of models because this may cause slight differences between the chart predictions and the ensemble actual predictions. Although in most cases these differences should be imperceptible, you can use the slider to increase the number of trees (up to 100 trees) and the re-sampling option to take another random sample of trees. Click on the corresponding options as shown in Figure 2.29 .

\includegraphics[]{images/ensembles/resampling} — Figure 2.29 Trees slider and resampling option for ensembles PDP

You can visualize classification and regression ensembles in the heatmap chart. In the case of classification ensembles, the different classes of the objective field are represented by different colors. The different color shadings for each class represent the different votes in the case of Decision Forests, i.e., the percentage of trees voting for a given class in the ensemble (see Combine single tree predictions: probability, confidence or votes ) and the class probabilities in the case of Boosted Trees (see subsection 2.2.1 ). For regression ensembles, the different prediction values are represented by differences in the color scale.

\includegraphics[]{images/ensembles/class-regress-pdp} — Figure 2.30 Classification and regression ensembles

The chart view is always composed of three main parts: the CHART itself, the PREDICTION legend and the INPUT FIELDS form. (See Figure 2.31 .)

\includegraphics[]{images/ensembles/chart} — Figure 2.31 Ensemble chart

The CHART allows you to view the impact of the two input fields on the objective classes predictions. You can select any categorical or numeric field for each axis. You can also switch the axis by clicking on the option on top of the chart area. (See Figure 2.32 .) In the grey area next to the axis selectors you can see the axis values. You can freeze the view by pressing Shift and release it again by pressing Escape from your keyboard. When the view is frozen, an edition icon will appear and you can edit the axis values to obtain a prediction for that value. (See Figure 2.32 .)

$\includegraphics[]{images/ensembles/chart2}$

Figure 2.32 Ensemble CHART options
The PREDICTION legend allows you to visualize the objective field classes (classification ensembles) or the predicted value (regression ensembles). In the case of classification ensembles you will also obtain the votes, i.e., the percentage of trees voting for a given class int he ensemble (see Combine single tree predictions: probability, confidence or votes ), for Decision Forests or the class probabilities for Boosted Trees (see subsection 2.2.1 ). By default, color tones and shadings are set according to the range of values shown in the chart area. This is the default because for some configurations of the chart the predictions may vary a small amount relative to the global range. For example, imagine the chart is showing temperature predictions based on location, time-of-year, and time-of-day. San Diego’s daily range (13$^\circ $ C to 18$^\circ $ C) could be tiny compared to the Earth’s global range (-62$^\circ $ C to 48$^\circ $ C). You can change this behavior and see the color scales and shading according to the total range of possible predicted values by clicking on the icon Total . (See Figure 2.33 .) For classification ensembles, this option allows you to see the color shading for the total range of potential values (from 0% to 100%). For regression ensembles, the Total colors option allows you to see the color scale for the total range of predictions. For classification ensembles you can also select to see only one of the classes using the class selector at the bottom of the legend. (See Figure 2.33 .)

$\includegraphics[]{images/ensembles/chart3}$

Figure 2.33 PREDICTION legend options
Below the chart legend, you can find the INPUT FIELDS form. (See Figure 2.34 .) You can configure the values for any numeric or categorical field. Text and items fields are not yet supported. By changing their values, you can see the predictions changing in real-time. You can sort the fields by their importance, select or disable them. If you disable an input field, it will be ignored to calculate the final prediction. The strategy used to calculate predictions when some fields are disabled is the proportional missing strategy (see Missing Strategies ).

Note: it is important to notice that disabled fields will be ignored when calculating the chart predictions. This is because the original intent of the PDP is to understand the impact of the axis fields by ignoring the influence of all the other fields. So if you trained the ensemble with missing values (see Missing Splits ) and they have some impact on predictions, you will not see it in the chart predictions. In this case it will be a mismatch between the chart predictions and your final predictions.

$\includegraphics[]{images/ensembles/chart4}$

Figure 2.34 INPUT FIELDS form in ensemble chart

Export chart as an image

Download the ensemble chart as an image in PNG format with or without legends. To download it with legends, press Shift from your keyboard to freeze the chart view. (See Figure 2.35 .)

\includegraphics[]{images/ensembles/download-chart} — Figure 2.35 Download ensemble chart in PNG format

Interpreting Partial Dependence Plots

You can easily see field impact on predictions using the ensembles chart. See below three different situations using an ensemble which aims to predict if a person has diabetes based on several input fields:

Both fields impact predictions: in the image below, the combination of the selected fields, “BMI” (Body Mass Index) and “Glucose”, have a high impact on predicting diabetes since variations in both fields cause variations in predictions.

$\includegraphics[]{images/ensembles/chart-fields}$

Figure 2.36 Both selected fields impact predictions
Only one of the fields impacts predictions: looking at the image below we can conclude that “Skinfold” is not a good predictor for diabetes since variations in this field don’t affect predictions. However, the level of “Glucose” has great impact on predictions. (See Figure 2.37 .)

$\includegraphics[]{images/ensembles/chart-fields2}$

Figure 2.37 One of the selected fields impact predictions
Both fields have low or no impact on predictions: if you select variables with little or no influence on predictions, you can see that variations in the selected fields don’t lead to differences in predictions. In this case, any combination of “Blood pressure” and “Insulin” always returns the same value for diabetes, “False”.

$\includegraphics[]{images/ensembles/chart-fields3}$

Figure 2.38 None of the selected fields impact predictions

2.5.2 Model List

The model list is only available for Decision Forests. The model list is not provided for Boosted Trees because they cannot be analyzed the same way as the models for other ensemble types. Instead of predicting the objective field, each boosted tree tries to fit a gradient to correct the mistakes made by the previous single tree. Therefore the single models are hardly interpretable individually.

The same BigML interactive visualization for models is used to show each single tree composing the ensembles. The ensemble model list provides a general overview on the ensemble and a means to get down to the single model level. The list of models comprise the following information for each of them:

Model preview: this is a snapshot of the tree representation for each model.
Model link: you can access each underlying tree model by clicking on this link.
Data distribution histogram: the distribution of the target field values in the training set that was used to build the model.
Predicted distribution histogram: predictions distribution for the model. The predicted distribution should roughly match the data distribution.

\includegraphics[]{images/ensembles/ensemble-view1} — Figure 2.39 Ensemble model list view

If you click on the model preview or link, you will be taken to that model view (see Figure 2.40 ), where you can use all of the model visualization features described in section 1.5 , such as BigML proprietary tree and sunburst dynamic visualization.

\includegraphics[]{images/models/tree-view} — Figure 2.40 Tree visualization