Classification and Regression with the BigML Dashboard

7.2 Understanding Evaluations

In this section, we are going to describe the technicalities behind evaluations. The main goal of evaluating your model is to measure its predictive performance. BigML provides two different ways to measure your model performance: by creating single evaluations or cross-validation evaluations.

  • Single evaluations are available for models, ensembles, logistic regressions, deepnets, and fusions. The basic idea is to make predictions for instances for which the objective field values are known, but the model has not seen before. The dataset containing those instances is called the testing dataset. Based on the comparison between the predicted and the actual values of the testing dataset, a set of performance measures are calculated. The usual way to obtain the testing dataset is to split the original dataset into two disjoint subsets: a training set and a test set. You can easily do this by using the 1-click menu option that automatically splits your dataset into a random 80% subset for training and 20% for testing. (Subsection 7.1 in Datasets with the BigML Dashboard document [ 23 ] explains how to do this.)

  • Cross-validation evaluations are available for models, ensembles, logistic regressions, and deepnets. In particular, BigML uses k-fold cross-validation. To create a cross-validation evaluation you just need a dataset as input. BigML then automatically splits your dataset in \(k\) complementary subsets. One of the subsets is used for evaluation while the rest of the data is used for training the model (the remaining \(k-1\) subsets). This process is performed \(k\) times, each time using different parts of the data for training and testing the models. Finally, cross-validation yields \(k\) different models and \(k\) evaluations. The results of the \(k\) evaluations are averaged to obtain the final performance measures. Cross-validation evaluations usually yield more accurate results than single evaluations since they avoid the potential error derived from randomly selecting a too optimistic testing dataset.

Both types of evaluations yield the same measures except for the evaluation curves, which will not be shown for the final averaged cross-validation. However, different performance measures will be obtained depending on the Objective Field type: numeric or categorical. In the following subsections you can find a detailed explanation of Classification measures and Regression measures provided by BigML. You can also find a brief subsection at the end mentioning the specificities of cross-validation evaluation measures.

7.2.1 Classification Measures

The performance of a classification model is usually represented by a confusion matrix, which contains the actual and the predicted values by the model. The evaluation measures are calculated based on the confusion matrix and the concept of positive and negative classes.

In the following subsections you can find a brief explanation of positive and negative concepts, the confusion matrix, the probability threshold , the classification measures and the evaluation curves provided by BigML.

Positive and Negative Classes

To understand the confusion matrix and the classification measures explained in Confusion Matrix and Classification Measures respectively, it’s important to have a general idea of the meanings of positive and negative classes.

In Machine Learning, by convention, the Objective Field classes are often referred to as positive and negative classes. The positive class is the class that is more important to accurately predict; e.g., if you are predicting cancer and you have two classes for the objective field, “true” and “false”, you can afford some mistakes in predicting when cancer is “false”, but it’s essential to identify the “true” cases. In other words, the positive class should be the one that has a greater cost of making a prediction error. The rest of classes will then be considered negative classes. The positive class is usually the minority class and the negative class is the majority class in binary classification problems because in many cases, it is often more interesting to predict the rare cases rather than the most evident and common ones.

Considering a class positive or negative is not related to its labels or meaning, so in many cases the positive class can be the “bad” outcome and the negative class the “good” outcome. For example, again consider disease diagnosis, where “true” indicates the patient has the disease, and it is usually considered the positive class.

Confusion Matrix

A common method to analyze the model predictive performance is the confusion matrix. The confusion matrix is a table containing the predictions and the actual values for the objective field classes so you can visualize the correct decisions as well as the errors made by the classifier.

In BigML, the columns represent the actual values and the rows represent the predictions for each of the classes by default. You can transpose rows by columns by clicking the switcher. (See Figure 7.10 .) The intersection between actual values and predicted values yields four possible situations:

  • True Positives (TP): positive instances correctly classified

  • False Positives (FP): negative instances classified as positive

  • True Negatives (TN): negative instances correctly classified as non-positive

  • False Negatives (FN): positive instances classified as negative

A model with \(n\) classes will yield a confusion matrix with \(n\) columns and \(n\) rows (see Figure 7.10 ). However, confusion matrices for the evaluation curves (see Evaluation curves ) will always yield 2 columns and 2 rows because negative classes are aggregated as if they were one class (you can find more information here).

\includegraphics[]{images/evaluations/confusion_matrix}
Figure 7.10 Confusion Matrix example

The cell colors in the confusion matrix indicate the TP, FP, TN and FN values. In the diagonal of the table you can find the correct predictions, the TP and TN 1 . See Confidence, Probability and Vote Thresholds to find out how to select the positive class.

\includegraphics[]{images/evaluations/positive_class}
Figure 7.11 Cell colors indicate the class selected as positive

Classification Measures

BigML provides different metrics to measure your model’s performance :

  • Accuracy

  • Precision

  • Recall

  • F-measure

  • Phi Coefficient

  • Macro-averages

  • Kendall’s Tau

  • Spearman’s Rho

You can find an explanation of each measure in the following subsections. All the measures except the Kendall’s Tau and the Spearman’s Rho are derived from the confusion matrices and they change according to the positive class and the threshold selected as explained in Confidence, Probability and Vote Thresholds . Some of these measures are used to display the evaluation curves explained in Evaluation curves .

Accuracy

Accuracy is calculated as the number of correctly classified instances over the total instances evaluated.

\[ \text{Accuracy}= \frac{TP + TN}{Total \; instances} \]
\includegraphics[]{images/evaluations/accuracy}
Figure 7.12 Accuracy example

Accuracy remains a popular measure for model performance since it is very easy to calculate, but for many real life problems it is too simplistic and misleading. One of the most obvious is when the model has to deal with unbalanced classes. For example, suppose that we get 90% of accuracy in a binary classification model for which you have 900 instances for one of the classes and 100 for the other one. A 90% accuracy is reachable just by classifying all the 1,000 instances as the majority class. This is why it is very important to take into account two more measures, Precision and Recall (explained below).

Precision

Precision is the percentage of correctly predicted instances over the total instances predicted for the positive class. (See Figure 7.13 .)

\[ \text{Precision}= \frac{TP}{TP + FP} \]
\includegraphics[]{images/evaluations/precision}
Figure 7.13 Precision example

Recall

Recall is the percentage of correctly classified instances over the total actual instances for the positive class. (See Figure 7.14 .)

\[ \text{Recall}= \frac{TP}{TP + FN} \]
\includegraphics[]{images/evaluations/recall}
Figure 7.14 Recall example

F-measure

The F-measure, also called the F-score, is the balanced harmonic mean between Precision and Recall. The F-measure is often a more useful metric than accuracy since poor performance in either Precision or Recall will result in a low F-measure value. It can range between 0 and 1. Higher values indicate better performance.

\[ \text{F-measure }= \frac{2 \times Precision \times Recall}{Precision + Recall} \]
\includegraphics[]{images/evaluations/f-measure}
Figure 7.15 F-measure example

Phi Coefficient

Phi Coefficient, also called the Mathews Correlation Coefficient, it is the correlation coefficient between the predicted and actual values. It returns a value between -1 and 1. A coefficient of -1 negative correlation between predictions and actual values; a 0 indicates the prediction is not any better than random, and a coefficient of 1 indicates a perfect prediction.

\[ \text{Phi Coefficient} = \frac{ TP \times TN - FP \times FN }{\sqrt{(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)}} \]
\includegraphics[]{images/evaluations/phi-coeff}
Figure 7.16 Phi Coefficient example

BigML also reports the maximum phi coefficient, which is the highest phi coefficient given all possible thresholds (see Confidence, Probability and Vote Thresholds ). You can find it as a mark in the threshold slider as shown in Figure 7.17 .

\includegraphics[]{images/evaluations/max-phi-coeff}
Figure 7.17 Maximum Phi Coefficient

To find out more about classification performance measures, refer to this paper [ 12 ] written by BigML’s VP of Machine Learning algorithms, Charles Parker.

Macro-averages

As it is explained in the previous subsections, classification measures are computed per class, except in the case of Accuracy which is the only measure that is always computed for the overall model.

BigML computes the average of per class measures to measure the overall model performance. Those global statistics are called the macro-averages of the measures since they are computed by giving equal weight to all classes. You can find them in the evaluation view under the names of Average Precision, Average Recall, Average F-Measure and Average Phi as shown in Figure 7.18 .

\includegraphics[]{images/evaluations/averages}
Figure 7.18 Macro-averaged measures

Read more on macro-averaging in this paper [ 36 ] .

Kendall’s Tau

The Kendall’s tau and the Spearman’s rho coefficients (see Figure 7.19 ) are the only measures that do not have a formula based on confusion matrices. The Kendall’s tau is based on all possible pair of rankings considering each instance in the testing dataset. It measures the degree of correlation between the ranked instances and it can take values between -1 and 1. You can find a more detailed explanation on how the Kendall’s tau coefficient is computed in this section.

Each instance in the testing dataset has an actual class for the objective field and a probabilty score result of the model predictions. The positive class is assigned a value of 1 and the negative class a value of 0. If for two different instances \(a\) and \(b\), score(\(a\))>score(\(b\)) and class(\(a\))>class(\(b\)) or score(\(a\))<score(\(b\)) and class(\(a\))<class(\(b\)), the pair of instances is “concordant” because the ordering of scores matches the actual classes, if it doesn’t, the pair is “discordant”.

Letting \(C\) be concordant pairs and \(D\) discordant pairs, the tau coefficient is calculated as follows:

\[ \text{Kendall's tau} = \frac{C - D}{C + D} \]

BigML specifically computes Kendall’s tau-b coefficient, which makes an adjustment in the denominator for pairs that are tied. A pair is tied when the score(a)=class(a) or score(b)=class(b). In this case, the pair is neither concordant nor discordant, so the coefficient denominator needs to be modified to keep the range of values [-1, 1].

\includegraphics[]{images/evaluations/kendalls}
Figure 7.19 Kendall’s Tau-b coefficient

Spearman’s Rho

Similar to the Kendall’s tau explained int he previous subsection, the Spearman’s rho coefficient measures the degree of correlation between the model predictions and the testing dataset actual values. It computes the Pearson correlation coefficient, between the ranks of the instances and it can take values between -1 and 1. Values closer to 1 indicate a perfect positive correlation which means a better performing model. On the other hand, closer values to -1 indicate a worse performing model. A value of 0 indicates that the model is no better than another model making random predictions. See the formula below:

\[ \text{Spearman's rho}=\rho _{rg_X,rg_Y} = \frac{cov(rg_X,rg_Y)}{\sigma _{rg_X}\sigma _{rg_Y}} \]

The \(cov(rg_X,rg_Y)\) is the covariance of the ranked values and the \(\sigma _{rg_x}\) and \(\sigma _{rg_y}\) are the standard deviations of the ranked values.

\includegraphics[]{images/evaluations/spearman}
Figure 7.20 Spearman’s Rho coefficient

Confidence, Probability and Vote Thresholds

Classification models in BigML always return a confidence and/or a probability for each prediction, i.e. a percentage between 0% and 100% that measures the certainty of the prediction. See Probability , Ensemble predictions: confidence, probability and expected error , and section 4.2 to know how BigML calculates probabilities for models, ensembles and logistic regression. Deepnets and fusions also return a probability measure.

When evaluating or predicting with your model you can set a probability or confidence threshold for a selected class, known as the positive class, so the model only predicts the positive class if the probability or the confidence is greater than the threshold set, otherwise it predicts the negative class. For example, imagine the following diabetes predictions for three different patients:

Patients

Diabetes False

Diabetes True

Patient 1

80%

20%

Patient 2

10%

90%

Patient 3

60%

40%

Table 7.1 Example of diabetes prediction probabilities for three different patients

Without setting any threshold, just by looking at the probabilities for each predicted class, patients 1 and 3 would be predicted as “False” and patient 2 as “True”. However, if we select “True” as the positive class and we set a probability threshold of 30%, patient 2 and 3 will be predicted as “True”, and only patient 1 will have a “False” prediction.

For decision forests (see subsection 2.2.1 ), you can also set a vote threshold, i.e., a threshold based on the percentage of models in the ensemble voting for the positive class. The type of threshold (confidence, probability or vote threshold) can be configure before creating the evaluation (see subsection 7.4.2 ).

Setting a threshold is specially useful when you want to minimize false positives at the cost of false negatives. By setting a probability or confidence threshold you can easily increase the minority class predictions.

Different thresholds produce different confusion matrices, hence different metrics for the same evaluation. These different metrics according to the threshold set can be seen in the evaluation curve views provided by BigML (see Evaluation curves ).

In some cases, it is likely that several thresholds yield the same matrix, above all for small test datasets. In BigML you can select any threshold and the positive class by using the options shown in the Figure 7.21 . By setting different thresholds you will see the values for the confusion matrix and the metrics changing accordingly. Single points in the evaluation curves correspond to different thresholds.

\includegraphics[]{images/evaluations/threshold}
Figure 7.21 Select the probability threshold and the positive class

The greater the threshold, fewer instances will be predicted for the positive class. By setting a a threshold of 100%, all instances are predicted as the negative class (see Figure 7.22 ).

\includegraphics[]{images/evaluations/threshold-100}
Figure 7.22 Probability threshold of 100%

On the other hand, by setting a 0% threshold all instances are predicted as the positive class (see Figure 7.23 ).

\includegraphics[]{images/evaluations/threshold-0}
Figure 7.23 Probability threshold of 0%

Evaluation curves

As explained in Confidence, Probability and Vote Thresholds , setting different thresholds can result in different confusion matrices and different metrics. The best way to evaluate how your model performs for all the possible thresholds is to plot the metrics in different charts:

  • Precision-Recall curve

  • ROC curve

  • Gain curve & K-S statistic

  • Lift curve

Single points for each curve represent a probability threshold for the positive class selected.

Precision-Recall Curve

The Precision-Recall curve visually represents the trade-off between both measures for the positive class. Precision and recall are inversely related, i.e., for the same model you can increase recall using a lower threshold for the positive class, but it will usually result in a decrease in precision, and vice versa. You can find the formulas of both measures in Classification Measures .

A high precision and a high recall are represented by points near the upper right corner of the chart (1,1), thus the greater the area under the precision-recall curve the better. BigML provides two different area calculations:

  • PR AUC: the Area Under the Curve (AUC) is calculated taking into account the exact curve shape (Figure 7.24 ).

    \includegraphics[]{images/evaluations/pr-curve-auc}
    Figure 7.24 Area Under the Curve for the Precision-Recall curve
  • PR AUCH: the Area Under the Convex Hull is calculated taking into account the convex shape of the curve where no other points lay above the curve. You can visualize it by clicking the option shown in Figure 7.25 ,

    \includegraphics[]{images/evaluations/pr-curve-auch}
    Figure 7.25 Area Under the Convex Hull for the Precision-Recall curve

The appropiate balance between precision and recall needs to be decided in a per-case basis according to the costs associated wit false positives and false negatives.

ROC Curve

The ROC space graphically represents the existing trade-off between the recall (or sensitivity) and specificity for classification problems. The recall is obtained by calculating the True Positive Rate (TPR), i.e. the ratio of instances that has been correctly classified for the positive class. The False Positive Rate (FPR), equivalent to 1-specificity, is the percentage of negative class instances that have been incorrectly classified.

You can obtain the TPR and FPR by normalizing the confusion matrix results:

\[ \text{TPR} = \text{Recall} = \frac{TP}{TP + FN} \]
\[ \text{FPR} = \frac{FP}{TN + FP} \]
\includegraphics[]{images/evaluations/roc}
Figure 7.26 The ROC curve

The diagonal of the chart divides the space; all the points found in the upper left part of the chart (where TPR>FPR) can be considered good results, and all those found in the diagonal or below (where TPR=<FPR) are bad results. The diagonal represents a model that has the same performance as choosing a class for each point at random.

Similar to the precision-recall curve, the Area Under the Curve (AUC) for an evaluation, is the area beneath the evaluation’s ROC curve in the ROC space (see Figure 7.27 ). Higher AUC values indicate a better classifier performance; however, in extreme cases, such as AUC=1, it may reflect an Overfitting problem.

\includegraphics[]{images/evaluations/roc2}
Figure 7.27 The ROC AUC

You can also visualize the Area Under the Convex Hull by clicking the highlighted option in Figure 7.30 .

\includegraphics[]{images/evaluations/roc3}
Figure 7.28 The ROC AUCH

Gain Curve & K-S statistic

The Gain curve (or Cumulative Gain curve) represents the relationship between the percentage of correct predictions for the positive class and the effort needed to achieve them measured as the percentage of instances predicted. The y-axis in the Gain curve is equivalent to the recall as well as the True Positive Rate (TPR) and the x-axis is the percentage of positive class instances. The formulas for these metrics are:

\[ \text{Gain} = \text{Recall} = \text{TPR} = \frac{TP}{TP + FN} \]
\[ \text{\% of Positive Instances} = \frac{TP + FP}{TP + FP + TN + FN} \]

Similar to the ROC curve, the diagonal of the chart represents the results of a random model. All the points above the diagonal can be considered good results. The closer a point is to the upper left corner (0,1), the better.

Along with the Gain curve, BigML also provides in the same chart the Negative Cumulative Response curve (represented by the black curve shown in Figure 7.29 ). The Negative Cumulative Response curve represents the percentage of instances incorrectly predicted as positive, so it is equivalent to the False Positive Rate (FPR) as explained in the section on ROC curves (Figure 7.25 ):

\[ \text{Negative Cumulative Response} = \text{FPR} = \frac{FP}{TN + FP} \]

Along with the Gain and Negative Cumulative response curve, BigML provides the calculation of the Kolomogorov Smirnov statistic (K-S statistic). It measures the maximum difference between the TPR and the FPR over all possible thresholds:

\[ \text{K-S statistic} = \max {(TPR-FPR)} \]

The K-S statistic is an indicator of how well the model separates the positive from the negative classes. A K-S statistic of 100% indicates a perfect separation and a model that classifies everything correctly. Higher values for the K-S statistic indicate a higher quality model.

\includegraphics[]{images/evaluations/gain}
Figure 7.29 The Gain curve

Lift Curve

The Lift curve shows the goodness of fit of your model compared to a random class assignment given a sample of positive instances. The Lift is plotted in the y-axis and it is calculated as the ratio between the result predicted by your model and the result using no model. The x-axis represents again the percentage of correct predictions. The formulas for these metrics are:

\[ \text{Lift} = \frac{Precision}{\frac{Positive Instances}{Total Instances}} = \frac{\frac{TP}{TP+FP}}{{\frac{TP + FN}{TP + FP + TN +FN}}} \]
\[ \text{\% of Positive Instances} = \frac{TP + FP}{TP + FP + TN + FN} \]

The horizontal line in the chart indicating a 100% lift (see Figure 7.30 ) represents a model that makes random predictions.

\includegraphics[]{images/evaluations/lift}
Figure 7.30 The Lift curve

For a more detailed explanation of the Gain and Lift charts, please refer to this article.

7.2.2 Regression Measures

When the objective field of the model, ensemble, deepnet, or fusion is numeric the resulting evaluation includes the regression measures explained below.

Mean Absolute Error

The Mean Absolute Error is the mean of the model prediction errors for each instance. It is computed as the average of the absolute values of the differences between the target variable predicted by the model (\(y'\)) vs. the actual values (\(y\)). Letting \(N\) be the total number of instances evaluated, then:

\[ \text{Mean Absolute Error} = \frac{\sum _{n} \left| y'_n - y_n \right|}{N} \]

Mean Squared Error

The Mean Squared Error is similar to the Mean Absolute Error, but the differences between predictions and actual values are squared. It is computed as the average of the squares of the differences between the target variable predicted by the model (\(y'\)) vs. the actual values (\(y\)). Letting \(N\) be the total number of instances evaluated, then:

\[ \text{Mean Square Error} = \frac{\sum _{n} (y'_n - y_n)^2 }{N} \]

R Squared

The \(R^2\), also called the coefficient of determination, measures how much better the model is than always predicting the mean value of the target variable (\(\bar{y}\)) in the test set. It can take values up to 1. Values below 0 indicate the model is worse than predicting the mean; a value of 0 means the model is not any better than predicting the mean; and 1 means the model perfectly fits the data. Although an \(R^2=1\) may not necessarily be desirable (since it can be a symptom of Overfitting), higher values for \(R^2\) usually mean better performance.

\[ R^2 = 1- \frac{\sum _{n} (y'_n - y_n)^2 }{\sum _{n} (y_n - \bar{y}_n)^2} \]

7.2.3 Cross-Validation Measures

BigML cross-validation yields \(k\) different models and \(k\) evaluations. To get a single estimation of the model’s performance, the results of the \(k\) evaluations are averaged to obtain the final cross-validation measures (per class and overall measures). Consequently, cross-validation evaluations have the same measures as single classification and regression evaluations. (See subsection 7.2.1 and subsection 7.2.2 .)

Additionally, apart from the averages, you will also find the standard deviation for each classification and regression measure. (See subsection 7.5.4 .)

Cross-validation evaluations do not include the evaluation curves, but will in a future release.

  1. Note that all negative classes are aggregated so the TN may include incorrect predictions for the negative classes. E.g., imagine three classes, a, b and c where the positive class is a while b and c are the negative classes. The instances of the b class incorrectly predicted as c as well as the instances of the c class predicted as b will be considered TN since they are “negative instances correctly classified as non-positive”.