Classification and Regression with the BigML Dashboard

3.2 Understanding Linear Regressions

As mentioned in the introduction of this chapter, linear regression is a Supervised learning learning algorithm used to solve regression problems. It’s simple to understand and has very good interpretability.

Linear Regression assumes a linear relationship between the input fields, also called predictors, and the single objective field, or the output variable. More specifically, the objective field can be modeled from a linear combination of the input fields:

\[ y = \beta _{0} + \beta _{1}x_1 + \beta _{2}x_2....+ \beta _{n}x_n \]

This is the linear regression formula, also called linear equation, where \(y\) is the objective field, \((x_1, x_2, …, x_n)\) represent the n variables, also called predictors, for the input fields in the input data, and \((\beta _1, \beta _2, …, \beta _n)\) are the coefficients which are the scale factors assigned to the respective variables. The one additional coefficient \(\beta _0\) is often called the intercept or the bias coefficient.

Learning a linear regression model means estimating the values of the coefficients with the data available. A positive coefficient \((\beta _i{\gt}0)\) for a input field, indicates a positive correlation between the input field and the objective field, while a negative coefficient (\(\beta _i {\lt} 0\)) indicate a negative correlation. Higher absolute coefficient values for a field results in a greater influence of that field on predictions. When a coefficient becomes zero, it effectively removes the influence of the input field on the model and hence on the predictions.

BigML Linear Regression produces an estimate for the coefficient values \(\beta _0, \beta _1, ..., \beta _n\) using a least-squares fit on the training data.

BigML provides a table containing all your linear regression coefficients. (See subsection 3.5.2 )

When the linear regression has learned the coefficients, you can use the model to make predictions for new instances. In a prediction, the linear regression returns a predicted value of the objective field.

By definition, the input fields \((x_1, x_2, …, x_n)\) in the linear regression formula need to be numeric values. However, BigML linear regressions can support any type of fields by applying a set of transformations to categorical, text, and items fields. Moreover, BigML can also handle missing values for any type of field. The following subsections detail both behaviors.

3.2.1 Input Field Transformations

Apart from numeric fields, BigML linear regressions are optimized to support categorical, text and items fields by applying a set of transformations in order to convert them to numeric values:

  • Categorical fields are Dummy encoded by default. Dummy encoding converts one n-class categorical field to n separate variables. One of the classes is designated as the reference or dummy class, which is assigned a value of 0 for each variable. The dummy class, if not specified by the user, is the first class value in lexicographic order. So there are n-1 variables (n classes minus the dummy class); one additional variable for missing values is created if the training dataset contains missing values for this field. For a given instance, the variable corresponding to the instance’s categorical value has its value set to 1 (except if the categorical value is the dummy class), while the other variables are set to 0.

    BigML also provides Contrast coding and Other coding, that you can configure for each of your categorical fields. See subsection 3.4.6 for a complete explanation of categorical fields encoding.

  • For text fields, each term is mapped to a corresponding numeric variable, whose value is the number of occurrences of that term in the instance. Text fields without term analysis enabled are excluded from the model (read the Sources with the BigML Dashboard document to learn more about text analysis [ 22 ] ).

  • For Items fields, each different item is mapped to a corresponding numeric field, whose value is the number of occurrences of that item in the instance.

3.2.2 Missing Values

BigML linear regressions can handle missing values for any type of field. For categorical, text, and items fields, missing values are always included as another category, term or item by default.

For numeric fields, missing values are always included. If a field in the training data contains missing data, then a corresponding binary-valued predictor will be created which takes a value of 1 when that field is missing in a particular row, and 0 otherwise. The other predictors pertaining to that field will have a value of 0 when the value is missing. Once the linear regression is created, you can find an additional coefficient for each field at the end of the coefficient table. (See Figure 3.4 ) Learn more about the coefficient table in subsection 3.5.2 .

Alternatively, you can replace your missing numeric values by another valid value like the field’s mean, median, maximum, minimum or zero (see subsection 3.4.3 ).

If the input data does not contain missing values for a field, the coefficient for missing values will be zero, except in the case of text fields which can be different from zero. This is due to the fact that BigML has a limit of 1,000 terms for text fields, so there may be instances not containing any of the terms considered to build the model and appear as missing values instead. (See Field Limits to know more about term limits for text fields.)

\includegraphics[]{images/linearregression/lnr-missing-coeff-table}
Figure 3.4 Missing numeric coefficients at the end of linear regression table

3.2.3 Number of Predictors

Because of input field transformations and missing values, one input field may become more than one predictors in the linear equation to fit the data. This table summarizes the nubmer of predictors which will be generated for each input field type.

Input field type

No Missing Value

Missing Values

Numeric

1

2

Categorical,
dummy-encoded

(number of classes) - 1

number of classes

Categorical,
contrast or other-encoded

number of classes

(number of classes) + 1

Text

number of terms

(number of terms) + 1

Items

number of items

(number of items) + 1

Table 3.1 Number of predictors per input field

Bias term, which is also called intercept term and enabled by default, is a predictor.

For example, if the data has \(10\) numeric input fields, \(2\) of them have missing values, it will generate \(8 + 2*2 = 12\) predictors. It also has \(2\) categorical fields, one has \(6\) classes and is dummy-encoded without missing values, another has \(3\) classes and is contrast-encoded with missing values, which will generate \(6 - 1 + 3 + 1 = 9\) predictors. It’s also got a text field which has \(15\) terms and missing values, and this will generate \(15 + 1 = 16\) predictors. In addition, it has \(1\) items field which has \(8\) items and no missing values and this will become \(8\) predictors. Bias term is enabled. Altogether, this linear regression will have \(46\) predictors from its \(14\) input fields.

3.2.4 Ill-conditioned Problems

A linear regression is ill-conditioned when there is insufficient data to estimate the value of the coefficients. Typically, this is when the number of rows is fewer than the number of predictors. In this case, the coefficients which are unable to be estimated will be set to 0, and in the JSON response of the model, the stats output will not contain standard_errors, z_values, confidence_intervals, and p_values. A warning will also be added to the model status.

Predictions with an ill-conditioned linear regression will have confidence and prediction intervals equal to 0.