Classification and Regression with the BigML Dashboard
2.5 Visualizing Ensembles
Being able to effectively visualize an ensemble is paramount to exploring it, interpreting it, and explaining why it produces certain outcomes. BigML provides two different visualizations, a Partial Dependence Plot (PDP) and a list of the single models:
Partial Dependence Plot: a graphic representation of the marginal effect that the combination of two fields (Predictorss) have on the Objective Field (ensemble predictions) keeping the rest of the field values constant.
Model list: provides a list of the single models that form the ensemble. This visualization is only available for Decision Forest ensembles (see subsection 2.4.3 ).
Note: BigML does not provide the model list for Boosted Trees because they cannot be interpreted the same way as the models for other ensemble types (see subsection 2.5.2 ).
In the top menu you will find a summary of the ensemble results: the number of models in the ensemble, the sample rate, the objective field used, if the ensemble has been randomized, the type of ensemble used (Decision Forest or Boosted Trees), and the number of instances in the dataset.
For Decision Forests, below the top menu, you will find the icons corresponding to each one of the views (the PDP and the model list) to switch from one view to another. (See Figure 2.28 ).
You can find a detailed explanation of each view in the following subsections.
2.5.1 Partial Dependence Plot (PDP)
The PDP is the main default view you will find when creating an ensemble. The main goal is to represent the marginal effect of a set of variables (input Fields) on the ensemble predictions disregarding the rest of the variables. It is a common method for visualizing and interpreting the impact of the variables on ensemble predictions, and it can be used for classification and regression ensembles.
Note: the ensemble PDP is not a representation of the dataset values; it is a representation of the ensemble results and their dependence from a set of variables used as inputs.
In order to ensure responsiveness, the PDP is built using 10 models by default. For ensembles with a higher number of models, a random sample of 10 models will be selected to calculate the predictions. A warning message will appear at the top of the ensemble view to indicate that the chart has been built with a lower number of models because this may cause slight differences between the chart predictions and the ensemble actual predictions. Although in most cases these differences should be imperceptible, you can use the slider to increase the number of trees (up to 100 trees) and the re-sampling option to take another random sample of trees. Click on the corresponding options as shown in Figure 2.29 .
You can visualize classification and regression ensembles in the heatmap chart. In the case of classification ensembles, the different classes of the objective field are represented by different colors. The different color shadings for each class represent the different votes in the case of Decision Forests, i.e., the percentage of trees voting for a given class in the ensemble (see Combine single tree predictions: probability, confidence or votes ) and the class probabilities in the case of Boosted Trees (see subsection 2.2.1 ). For regression ensembles, the different prediction values are represented by differences in the color scale.
The chart view is always composed of three main parts: the CHART itself, the PREDICTION legend and the INPUT FIELDS form. (See Figure 2.31 .)
The CHART allows you to view the impact of the two input fields on the objective classes predictions. You can select any categorical or numeric field for each axis. You can also switch the axis by clicking on the option on top of the chart area. (See Figure 2.32 .) In the grey area next to the axis selectors you can see the axis values. You can freeze the view by pressing and release it again by pressing from your keyboard. When the view is frozen, an edition icon will appear and you can edit the axis values to obtain a prediction for that value. (See Figure 2.32 .)
The PREDICTION legend allows you to visualize the objective field classes (classification ensembles) or the predicted value (regression ensembles). In the case of classification ensembles you will also obtain the votes, i.e., the percentage of trees voting for a given class int he ensemble (see Combine single tree predictions: probability, confidence or votes ), for Decision Forests or the class probabilities for Boosted Trees (see subsection 2.2.1 ). By default, color tones and shadings are set according to the range of values shown in the chart area. This is the default because for some configurations of the chart the predictions may vary a small amount relative to the global range. For example, imagine the chart is showing temperature predictions based on location, time-of-year, and time-of-day. San Diego’s daily range (13\(^\circ \) C to 18\(^\circ \) C) could be tiny compared to the Earth’s global range (-62\(^\circ \) C to 48\(^\circ \) C). You can change this behavior and see the color scales and shading according to the total range of possible predicted values by clicking on the icon . (See Figure 2.33 .) For classification ensembles, this option allows you to see the color shading for the total range of potential values (from 0% to 100%). For regression ensembles, the colors option allows you to see the color scale for the total range of predictions. For classification ensembles you can also select to see only one of the classes using the class selector at the bottom of the legend. (See Figure 2.33 .)
Below the chart legend, you can find the INPUT FIELDS form. (See Figure 2.34 .) You can configure the values for any numeric or categorical field. Text and items fields are not yet supported. By changing their values, you can see the predictions changing in real-time. You can sort the fields by their importance, select or disable them. If you disable an input field, it will be ignored to calculate the final prediction. The strategy used to calculate predictions when some fields are disabled is the proportional missing strategy (see Missing Strategies ).
Note: it is important to notice that disabled fields will be ignored when calculating the chart predictions. This is because the original intent of the PDP is to understand the impact of the axis fields by ignoring the influence of all the other fields. So if you trained the ensemble with missing values (see Missing Splits ) and they have some impact on predictions, you will not see it in the chart predictions. In this case it will be a mismatch between the chart predictions and your final predictions.
Export chart as an image
Download the ensemble chart as an image in PNG format with or without legends. To download it with legends, press Figure 2.35 .)
from your keyboard to freeze the chart view. (SeeInterpreting Partial Dependence Plots
You can easily see field impact on predictions using the ensembles chart. See below three different situations using an ensemble which aims to predict if a person has diabetes based on several input fields:
Both fields impact predictions: in the image below, the combination of the selected fields, “BMI” (Body Mass Index) and “Glucose”, have a high impact on predicting diabetes since variations in both fields cause variations in predictions.
Only one of the fields impacts predictions: looking at the image below we can conclude that “Skinfold” is not a good predictor for diabetes since variations in this field don’t affect predictions. However, the level of “Glucose” has great impact on predictions. (See Figure 2.37 .)
Both fields have low or no impact on predictions: if you select variables with little or no influence on predictions, you can see that variations in the selected fields don’t lead to differences in predictions. In this case, any combination of “Blood pressure” and “Insulin” always returns the same value for diabetes, “False”.
2.5.2 Model List
The model list is only available for Decision Forests. The model list is not provided for Boosted Trees because they cannot be analyzed the same way as the models for other ensemble types. Instead of predicting the objective field, each boosted tree tries to fit a gradient to correct the mistakes made by the previous single tree. Therefore the single models are hardly interpretable individually.
The same BigML interactive visualization for models is used to show each single tree composing the ensembles. The ensemble model list provides a general overview on the ensemble and a means to get down to the single model level. The list of models comprise the following information for each of them:
Model preview: this is a snapshot of the tree representation for each model.
Model link: you can access each underlying tree model by clicking on this link.
Data distribution histogram: the distribution of the target field values in the training set that was used to build the model.
Predicted distribution histogram: predictions distribution for the model. The predicted distribution should roughly match the data distribution.
If you click on the model preview or link, you will be taken to that model view (see Figure 2.40 ), where you can use all of the model visualization features described in section 1.5 , such as BigML proprietary tree and sunburst dynamic visualization.