Classification and Regression with the BigML Dashboard

4.2 Understanding Logistic Regressions

As mentioned in the introduction of this chapter, logistic regression is a Supervised learning learning algorithm to solve classification problems. Logistic regression works better in those cases for which the features are roughly linear and the problem can be linearly solved. This is mainly due to the fact that the logistic regression generates linear decision boundaries to separate the objective field classes. You can find a detailed explanation of this behavior in the following blog post.

The reason behind this linear behavior can be found in the logistic regression formula which consist of a logistic function whose argument is a linear combination of the input field values. You can see the logistic regression formula below (Figure 4.5 ), where dependent variable, \(p_i\), is the probability for each of the \(i\) classes of the objective field, and the independent variables, (\(X1, X2, ... Xk\)) represent the \(k\) variables for the input fields in your dataset, which are multiplied by the logistic regression coefficients (\(b_{0,i},b_{1,i},b_{2,i},...b_{k,i}\)).

\[ p_i = \frac{1}{1+ e^{-f_i(X)}} \]

\(\textrm{where}\)

\[ f_i(X) = b_{0,i} + b_{1,i}X_1 + b_{2,i}X_2....+ b_{k,i}X_k) \]
Figure 4.5 Logistic regression formula

The logistic regression tries to learn the \(k\) coefficients (\(b_{0,i}, b_{1,i},b_{2,i},...b_{k,i}\)) of the linear function, \(f_i(X)\), using maximum likelihood estimation techniques. BigML logistic regression is an optimized implementation of the liblinear library which uses the Trust-Region Newton Optimization method to estimate the coefficients.

Each class of the objective field will have a different set of coefficients associated, e.g., if the objective field has two classes, two different functions, \(p_1\) and \(p_2\), will be learned from the training data, one by class (see Figure 4.6 ).

\[ p_1 = \frac{1}{1+ e^{-(b_{0,1} + b_{1,1}X_1 + b_{2,1}X_2....+ b_{k,1}X_k)}} \]
\[ p_2 = \frac{1}{1+ e^{-(b_{0,2} + b_{1,2}X_1 + b_{2,2}X_2....+ b_{k,2}X_k)}} \]
Figure 4.6 Logistic regression formulas for two classes

A positive coefficient (\(b_k{\gt}0\)) for a field, indicates a positive correlation with the predicted class, while negative coefficients (\(b_k{\lt}0\)) indicate a negative relationship. Higher absolute coefficient values for a field results in a greater impact on predictions of that field. This should not be misinterpreted as field importance due to several reasons:

  • Field importance in Logistic Regression can be defined as the contribution of a field to the final class probability which depends not only on the field coefficient but also on the interactions with the rest of the input fields. Since the model assumes independence between the different inputs, coefficients can be considered as absolute measures of field importance only when all inputs are independent, but this is often not the case. In many real datasets, the impact of a particular field is also dependent on the values of other fields.

  • Different field magnitudes make coefficients incomparable. Coefficients for fields with different magnitudes, e.g., salary and age, are not comparable since they will tend to be higher for fields with smaller scales. Changing the scale of the field would change the coefficient value. BigML automatically scales your numeric fields. (See subsection 4.4.8 .)

  • Fields can be multi-collinear. If two fields are highly correlated, they can be effectively substituted for each other during model training, so the field importance could be split between the two. The greater number of fields in the dataset, the more likely the fields are multi-collinear.

BigML provides a table containing all your logistic regression coefficients. (See subsection 4.5.2 .)

When the logistic regression has learned the coefficients, you can use the model to make predictions for new instances. The logistic regression always returns a probability per class of the objective field. The class with the highest probability will be the predicted class. Taking into account the previous formulas in Figure 4.6 , for a given set of input values, (\(X1, X2, ... Xk\)), you will get two probabilities, one per class, e.g., \(p_1=85\% \) and \(p_2=15\% \). In BigML, when there are more than two classes, the probabilities are normalized so the sum of all probabilities for each instance prediction is equal to 100%.

By definition, the input fields (\(X1, X2, ... Xk\)) in the logistic regression formula need to be numeric values. However, BigML logistic regressions can support any type of fields by applying a set of transformations to categorical, text, and items fields. Moreover, BigML can also handle missing values for any type of field. The following subsections detail both behaviors.

4.2.1 Input Field Transformations

Apart from numeric fields, BigML logistic regressions are optimized to support categorical, text and items fields by applying a set of transformations in order to convert them in numeric values:

  • Categorical fields are One-hot encoded by default, i.e., each class is mapped to a separate 0-1 numeric variable. For a given instance, the variable corresponding to the instance class, has its value set to 1, while the other variables are set to 0.

    For example, imagine you are trying to predict the probability of customer \(churn=[True, False]\) given two input fields: number of calls (numeric), \(numCalls\), and the tariff plan (categorical), \(tariffPlan=[B, N, P]\), which includes three different classes, \(B\) (basic), \(N\) (normal), \(P\) (professional). The logistic regression will create one variable for the numeric field and another three variables for the categorical field, one by class. Letting \(i\) be the objective field classes (\(True\), \(False\)), the logistic regression formula will be:

    \[ p_{i} = \frac{1}{1+ e^{-f_i(X)}} \]

    where

    \[ f_i(X)={b_{0,i} + b_{1,i}numC alls + b_{2,i}B + b_{3,i}N + b_{4,i}P} \]

    For a new customer with values \(numCalls=240\) and \(tariffPlan=N\) then:

    \[ f_i(X)={b_{0,i} + b_{1,i}240 + b_{2,i}0 + b_{3,i}1 + b_{4,i}0} \]

    BigML also provides three other types of coding, Dummy, Contrast and Other coding, that you can configure for each of your categorical fields. See subsection 4.4.10 for a complete explanation of categorical fields encoding.

  • For text fields, each term is mapped to a corresponding numeric variable, whose value is the number of occurrences of that term in the instance. Text fields without term analysis enabled are excluded from the model (read the Sources with the BigML Dashboard document to learn more about text analysis [ 22 ] ).

  • For Items fields, each different item is mapped to a corresponding numeric field, whose value is the number of occurrences of that item in the instance.

4.2.2 Missing Values

BigML logistic regressions can handle missing values for any type of field. For categorical, text, and items fields, missing values are always included as another category, term or item by default.

For numeric fields, missing values are also included by default, but you can deactivate this option by configuring your logistic regression (see subsection 4.4.5 ). Alternatively, you can replace your missing numeric values by another valid value like the field’s mean, median, maximum, minimum or zero (see subsection 4.4.4 ). If none of the mentioned options has been enabled for building your logistic regression, the instances containing missing values for numeric fields in your dataset will be ignored by the model.

For missing values, a separate variable is created to build the logistic regression. Once the logistic regression is created, you can find an additional coefficient for each field at the end of the coefficient table. (See Figure 4.7 .) Learn more about the coefficient table in subsection 4.5.2 .

If the dataset does not contain missing values for a field, the coefficient for missing values will be zero, except in the case of text fields which can be different from zero. This is due to the fact that BigML has a limit of 1,000 terms for text fields, so there may be instances not containing any of the terms considered to build the model and appear as missing values instead. (See Field Limits to know more about term limits for text fields.)

\includegraphics[]{images/logisticregression/lr-missing-coeff-table}
Figure 4.7 Missing numeric coefficients at the end of logistic regression table

4.2.3 Logistic Regressions with Images

BigML logistic regressions do not take images as input directly, however, they can use image features as those fields are numeric.

BigML extracts image features at the source level. Image features are sets of numeric fields for each image. They can capture parts or patterns of an image, such as edges, colors and textures. For information about the image features, please refer to section Image Analysis of the Sources with the BigML Dashboard [ 22 ] .

\includegraphics[]{images/logisticregression/lr-image-dataset-resnet18}
Figure 4.8 A dataset with images and image features

As shown in Figure 4.8 , the example dataset has an image field image_id. It also has image features extracted from the images referenced by image_id. Image feature fields are hidden by default to reduce clutter. To show them, click on the icon “Click to show image features”, which is next to the “Search by name” box. In Figure 4.9 , the example dataset has 512 image feature fields, extracted by a pre-trained CNN, ResNet-18.

\includegraphics[]{images/logisticregression/lr-image-dataset-resnet18-fields}
Figure 4.9 A dataset with image feature fields shown

From image datasets like this, logistic regressions can be created and configured using the steps described in the following sections. All other operations including prediction, evaluation applies too.