Classification and Regression with the BigML Dashboard

3.4 Linear Regression Configuration Options

While the 1-click creation menu option (see section 3.3 ) provides a convenient and easy way to create a BigML linear regression, you can also have more control over the linear regression creation and configure a number of parameters that affect the way BigML creates linear regressions. Click the configure linear regression menu option in the configuration menu of your dataset view. (See Figure 3.7 .)

\includegraphics[]{images/linearregression/lnr-config}
Figure 3.7 Configure linear regression

3.4.1 Objective Field

The objective field, or “target field”, is the field you want to predict. Linear regressions only support numeric fields as the Objective Field.

BigML takes the last numeric field in your dataset as the objective field by default. If you want to change the objective field, you have two options: you can select another field from the configuration panel to build the linear regression, or you can change it permanently from your dataset view.

  • Select the Objective field from the linear regression configuration panel. This option will only affect the linear regression you are building that time. (See Figure 3.8 .)

    \includegraphics[]{images/linearregression/lnr-objective-field}
    Figure 3.8 Configure the objective field to create the linear regression
  • Change the default objective field for the dataset. This option will save your objective field preference for any model you build. Click on the edition icon next to the field name when you mouse over it, a pop up window will be displayed. Then click on the Objective field icon and Save it. (See Figure 3.9 .)

    \includegraphics[]{images/linearregression/lnr-objective-field-dataset}
    Figure 3.9 Change the default objective field

3.4.2 Automatic optimization

You can turn on the Automatic optimization option so BigML will automatically tune the parameters of your linear regression (see Figure 3.10 ).

\includegraphics[]{images/linearregression/auto-lnr-optimization}
Figure 3.10 Automatic optimization

The main focus of optimization in linear regression is the bias term, also known as the intercept term. Hand-tuning it is a time consuming process and BigML offers first-class support for automatic linear regression parameter optimization.

Behind the scenes, BigML uses the same technology for linear regression parameter optimization as the one used for OptiML. If you want to know more about the technical details, please read the Chapter 2 of the document OptiML with the BigML Dashboard [ 15 ] .

When you turn on the Automatic optimization option, all the linear regression parameters will be disabled (because they will be automatically optimized), except the Default numeric value and the Weights parameters which you can manually configure (see Figure 3.11 ).

\includegraphics[]{images/linearregression/auto-lnr-optimization2}
Figure 3.11 Configure the default numeric value

Since the optimization process can take some time, BigML offers two configurable parameters to limit the time to create the optimized linear regression: a training duration (see Training duration ) and the linear regression candidates (see Linear regression candidates ).

Training duration

The scale parameter to regulate the linear regression runtime. It’s set as an integer from 1 to 10. It indicates the user preference for the amount of time they wish the optimization to take. The higher the number, the more time that users are willing to wait for possibly better linear regression performance. The lower the number, the faster that users wish the linear regression training to finish. The default value is set to 5.

The training duration is set in a scale. The actual training time depends on the dataset size, among other factors.

\includegraphics[]{images/linearregression/lnr-training-duration}
Figure 3.12 Training duration

Linear regression candidates

The maximum number of different linear regressions (i.e., linear regressions using a unique configuration) to be trained and evaluated during the optimization process. The default number is 128 candidates which is usually enough to find the best linear regression, but you can set it from 4 up to 200. Only the top-performing linear regression will be returned. If the training duration is very low (see Training duration ) given the dataset size, it is possible that not all the linear regression candidates will be tried out.

\includegraphics[]{images/linearregression/lnr-candidates}
Figure 3.13 Linear regression candidates

3.4.3 Default Numeric Value

Linear regressions can include missing values as valid values for any type of fields as explained in subsection 3.2.2 . However, there can be situations for which you don’t want to include them in your model. For those cases, the Default numeric value parameter is an easy way to replace missing numeric values by another valid value. You can select to replace them by the field’s Mean, Median, Maximum, Minimum or by Zero. (See Figure 3.14 .)

\includegraphics[]{images/linearregression/lnr-default-numeric}
Figure 3.14 Select a default numeric value to replace missing numeric values

Note: if your dataset does not contain missing values for your numeric fields, this parameter will not have impact on your linear regression.

3.4.4 Weights

It is not unusual for a dataset to have unbalanced fields, which means there are many instances in certain ranges, and few in others. For example, in datasets used to model company financials, there are many more companies with employees numbered from 50-500, while there are only a few with more than 100,000 employees. So as the company size increases, there are fewer cases to fit. In that case, you may want to assign more weights to the scarce instances so they are equivalent to the abundant ones.

BigML provides an option to assign specific weights to your instances (see Figure 3.15 ).

\includegraphics[]{images/linearregression/lnr-weights}
Figure 3.15 Weight options for linear regression

Weight Field

The Weight field option allows you to assign individual weights to each instance by choosing a special weight field. The selected field should be integer, with a minimum value of 1, and it must not contain any negative or missing values. However, any non-negative weight field will be accepted. If the minimum value is different from 1, each value in the weight field will be divided by the minimum value and rounded to the nearest integer.

If an instance has a weight of 3 it will be replicated three times in the dataset to train the model.

The weight field will be excluded from the input fields when building the linear regression. You can select an existing field in your dataset or you may create a new one in order to assign customized weights.

3.4.5 Bias

You can include or exclude the Bias from the model, a.k.a. the intercept term of the linear regression formula. (See formula in section 3.2 .) For most cases, including the bias results in a better model. By default it is included. (See Figure 3.16 .)

\includegraphics[]{images/linearregression/lnr-bias}
Figure 3.16 Bias parameter

3.4.6 Field Codings

Categorical fields must be converted to numeric values in order to train a linear regression model. By default, they are Dummy encoded, with the default dummy class as the first class in lexicographic order. BigML also allows you to configure two other types of coding for each one of your categorical fields: Contrast coding, and Other coding. See the following subsections for a detail explanation of each option. (Learn more about input fields transformations in subsection 3.2.1 .)

Dummy Coding

The main goal of using dummy coding is to compare a class selected as the reference or control class with the rest of classes. The control class is assigned a value of 0 for each variable. The control class is called dummy class in BigML and it is usually a class with a representative number of instances compared to the other classes in the dataset. See an example of dummy coding schema for three different classes, with the “Class 1” being the dummy class, in Table 3.2 :

Classes

C0

C1

C2

Class 1

0

0

0

Class 2

1

0

0

Class 3

0

1

0

MISSING

0

0

1

Table 3.2 Dummy coding example for 3 classes

To set Dummy coding for a field:

  1. Click on the configuration icon next to the field name. (See Figure 3.17 .)

    \includegraphics[]{images/linearregression/lnr-field-codings}
    Figure 3.17 Field coding configuration
  2. A modal window will be displayed so you can configure the field codings for that field. If the field does not have a previous configuration for field codings, it will be disabled. Enable field coding configuration by clicking on the green switcher shown in Figure 3.18 .

    \includegraphics[]{images/linearregression/lnr-enable-field-codings}
    Figure 3.18 Enable field coding configuration
  3. Select the class you want to set as the dummy class. (See Figure 3.19 .)

    \includegraphics[]{images/linearregression/lnr-dummy-configured}
    Figure 3.19 Select the dummy class
  4. Click Save . Make sure you saved your configuration by looking at the bottom message “Configured Coding: DUMMY”. (See Figure 3.20 .)

    \includegraphics[]{images/linearregression/lnr-dummy-saved}
    Figure 3.20 Field codings: dummy

    Note: you cannot select several field codings for the same field simultaneously.

  5. Close the modal window by clicking outside or by clicking Cancel .

    \includegraphics[]{images/linearregression/lnr-dummy-cancel}
    Figure 3.21 Close modal window

    Note: if the Cancel button is red, it indicates there are changes you have not saved yet so you will lose them by closing the modal window.

  6. After configuring the field codings for a field, the configuration icon will become green. (See Figure 3.22 .)

    \includegraphics[]{images/linearregression/lnr-field-codings-configured}
    Figure 3.22 Field codings configured
  7. To remove the field coding configuration for that field, click Disable from the switcher and click Save again. (See Figure 3.23 .)

    \includegraphics[]{images/linearregression/lnr-dummy-disable}
    Figure 3.23 Disable field coding configuration

After creating your linear regression, your dummy class will be identified with the dummy icon in the coefficients table view (see subsection 3.5.2 ). (See Figure 3.24 .)

\includegraphics[]{images/linearregression/lnr-dummy-table}
Figure 3.24 Dummy class in table view

Contrast Coding

Contrast coding allows you to set different values for different classes. Instead of the 0-1 values of Dummy coding, you will be able to set any integer or float value for each of the classes, plus an additional one for missing values. The sum of all values must equal 0. The values of the classes need to be set based on certain hypothesis, e.g., higher values for a class assume this class has more influence on the objective field than the others. A positive value indicates a positive relationship between the class and the objective field while a negative value indicates a negative relationship. A coefficient of 0 will exclude the class from the model. In the Table 3.3 you can see an example of contrast coding schema for three different classes.

Classes

C0

Class 1

0.5

Class 2

-0.25

Class 3

-0.25

MISSING

0

Table 3.3 Contrast coding example for 3 classes

To set Contrast coding for a field, follow these steps:

  1. Click on the configuration icon next to the field name. (See Figure 3.25 .)

    \includegraphics[]{images/linearregression/lnr-field-codings2}
    Figure 3.25 Field coding configuration
  2. A modal window will be displayed so you can configure the field codings for that field. If the field does not have a previous configuration for field codings, it will be disabled. Enable field coding configuration by clicking on the green switcher shown in Figure 3.26

    \includegraphics[]{images/linearregression/lnr-enable-field-codings2}
    Figure 3.26 Enable field coding configuration
  3. Select the Contrast coding option. (See Figure 3.27 .)

    \includegraphics[]{images/linearregression/lnr-contrast-coding}
    Figure 3.27 Field codings: contrast coding
  4. Set the values you want for your classes based on your hypothesis. All classes values must sum 0. (See Figure 3.28 .) By using the BigML API, multiple contrast codings can be given for a field as long as all the codings are Orthogonal to ensure there are no co-dependent coefficients. Check the corresponding

    documentation.

    \includegraphics[]{images/linearregression/lnr-contrast-configured}
    Figure 3.28 Set the contrast coding values for each class

    Note: you cannot select several field codings for the same field simultaneously.

  5. Click Save . Make sure you saved your configuration by looking at the bottom message “Configured Coding: CONTRAST”. (See Figure 3.29 .)

    \includegraphics[]{images/linearregression/lnr-contrast-saved}
    Figure 3.29 Contrast coding saved
  6. Close the modal window by clicking outside or by clicking Cancel .

    \includegraphics[]{images/linearregression/lnr-contrast-cancel}
    Figure 3.30 Close modal window

    Note: if the Cancel button is red, it indicates there are changes you have not saved yet so you will lose them by closing the modal window.

  7. After configuring the field codings for a field, the configuration icon will become green. (See Figure 3.31 .)

    \includegraphics[]{images/linearregression/lnr-field-codings-configured2}
    Figure 3.31 Field codings configured
  8. To remove the field coding configuration for that field, click Disable from the switcher and click Save again. (See Figure 3.32 .)

    \includegraphics[]{images/linearregression/lnr-contrast-disable}
    Figure 3.32 Disable field coding configuration

After creating your linear regression, you will be able to see your Contrast coding values in the coefficients table view (see subsection 3.5.2 ) by clicking on the icon. (See Figure 3.33 .)

\includegraphics[]{images/linearregression/lnr-contrast-table}
Figure 3.33 Contrast icon in table view

A modal window will be displayed with your codings values and you can download them in CSV or JSON format by clicking on the corresponding icons. (See Figure 3.34 .)

\includegraphics[]{images/linearregression/lnr-contrast-modal}
Figure 3.34 Contrast modal window in table view

Other Coding

Other coding allows you to set different values for different classes. It works the same way as contrast coding (see Contrast Coding ), but in this case the values do not need to sum 0. In the Table 3.4 you can see an example of other coding schema for three different classes.

Classes

C0

Class 1

2

Class 2

-0.4

Class 3

3

MISSING

1

Table 3.4 Other coding

To set Other coding for a field, follow these steps:

  1. Click on the configuration icon next to the field name. (See Figure 3.35 .)

    \includegraphics[]{images/linearregression/lnr-field-codings}
    Figure 3.35 Field coding configuration
  2. A modal window will be displayed so you can configure the field codings for that field. If the field does not have a previous configuration for field codings, it will be disabled. Enable field coding configuration by clicking on the green switcher shown in Figure 3.36

    \includegraphics[]{images/linearregression/lnr-enable-field-codings}
    Figure 3.36 Enable field coding configuration
  3. Select the Other coding option. (See Figure 3.37 .)

    \includegraphics[]{images/linearregression/lnr-other-coding}
    Figure 3.37 Field codings: other coding
  4. Set the values you want for your classes based on your hypothesis. You can set any float or integer value. (See Figure 3.38 .) By using the BigML API, multiple other codings can be given for a field. Check the corresponding

    documentation.

    \includegraphics[]{images/linearregression/lnr-other-configured}
    Figure 3.38 Set the other coding values for each class

    Note: you cannot select several field codings for the same field simultaneously.

  5. Click Save . Make sure you saved your configuration by looking at the bottom message “Configured Coding: OTHER”. (See Figure 3.39 .)

    \includegraphics[]{images/linearregression/lnr-other-saved}
    Figure 3.39 Other coding saved
  6. Close the modal window by clicking outside or by clicking Cancel .

    \includegraphics[]{images/linearregression/lnr-other-cancel}
    Figure 3.40 Close modal window

    Note: if the Cancel button is red, it indicates there are changes you have not saved yet so you will lose them by closing the modal window.

  7. After configuring the field codings for a field, the configuration icon will become green. (See Figure 3.41 .)

    \includegraphics[]{images/linearregression/lnr-field-codings-configured3}
    Figure 3.41 Field codings configured
  8. To remove the field coding configuration for that field, click Disable from the switcher and click Save again. (See Figure 3.42 .)

    \includegraphics[]{images/linearregression/lnr-other-disable}
    Figure 3.42 Disable field coding configuration

After creating your linear regression, you will be able to see your Other coding values in the coefficients table view (see subsection 3.5.2 ) by clicking on the icon. (See Figure 3.43 .)

\includegraphics[]{images/linearregression/lnr-other-table}
Figure 3.43 Other coding in coefficients table

A modal window will be displayed with your coding values and you can download them in CSV or JSON format by clicking on the corresponding icons. (See Figure 3.44 .)

\includegraphics[]{images/linearregression/lnr-other-modal}
Figure 3.44 Other coding modal window

3.4.7 Sampling Options

Sometimes you do not need all the data contained in your dataset to build your linear regression. If you have a very large dataset, sampling may be a good way of getting faster results. BigML allows you to sample your dataset before creating the linear regression, so you do not need to create a separate dataset first. You can find a detailed explanation of the sampling parameters available in the following subsections. (See Figure 3.45 .)

Rate

The Rate is the proportion of instances to include in your sample. Set any value between 0% and 100%. Defaults to 100%.

Range

Specifies a subset of instances from which to sample, e.g., choose from instance 1 until 200. The Rate you set will be computed over the Range configured. This option may be useful when you have temporal data, and you want to train your linear regression with historical data, and test it with the most recent one to check if it can predict based on time.

Sampling

By default, BigML selects your instances for the sample by using a random number generator, which means two samples from the same dataset will likely be different even when using the same rates and row ranges. If you choose deterministic sampling, the random-number generator will always use the same seed, thus producing repeatable results. This lets you work with identical samples from the same dataset.

Replacement

Sampling with replacement allows a single instance to be selected multiple times. Sampling without replacement ensures that each instance cannot be selected more than once. By default, BigML generates samples without replacement.

Out of bag

This argument will create a sample containing only out-of-bag instances for the currently defined rate, so the final total number of instances for your sample will be one minus the rate configured for your sample (when replacement is false). This can be useful for splitting a dataset into training and testing subsets. It is only selectable when a sample rate is less than 100%.

\includegraphics[]{images/linearregression/lnr-sampling}
Figure 3.45 Sampling parameters for linear regression

3.4.8 Advanced Ordering

Ordering options are relevant to ensure that BigML can correctly determine whether it can take an Early split of your dataset to accelerate the training process. In particular, early splitting can only be safely used if the training instances have been previosuly shuffled.

If your instances are already shuffled, BigML allows you to choose the linear option. This will make the process of building the model much faster, since it will not required to reshuffle the dataset. If you need to shuffle your instances, BigML provides two options to that aim, deterministic shuffling and random shuffling, which are described below.

Ordering options have no influence on datasets of less than 34GB, since the whole dataset is used to build the model.

By default, BigML uses deterministic shuffling to ensure the same (deterministic) sample of the instances is used and the built model is thus repeatable.

Deterministic Shuffling

The deterministic shuffling option ensures that the row shuffling of a dataset is always the same, so that retraining a BigML model from the same dataset yields the same results.

By default, this option is true.

Linear Shuffling

The linear shuffling option is useful when you know that your instances are already in random order. Using linear shuffling, the BigML model will be constructed faster.

By default, this option is false.

Random Shuffling

The random shuffling option will ensure that a different shuffling will be tried each time you train your model.

By default, this option is false.

\includegraphics[]{images/linearregression/lnr-ordering-params}
Figure 3.46 Ordering argument for linear regression

3.4.9 Creating Linear Regressions with Configured Options

After finishing the configuration of your options, you can change the default linear regression name in the editable text box. Then you can click on the Create linear regression button to create the new linear regression, or reset the configuration by clicking on the Reset button.

\includegraphics[]{images/linearregression/lnr-configuration-create-lnr}
Figure 3.47 Create linear regression after configuration

3.4.10 API Request Preview

The API Request Preview button is in the middle on the bottom of the configuration panel, next to the Reset button (See (Figure 3.47 )). This is to show how to create the linear regression programmatically: the endpoint of the REST API call and the JSON that specifies the arguments configured in the panel. Please see (Figure 3.48 ) below:

\includegraphics[]{images/linearregression/lnr-configuration-api-preview}
Figure 3.48 Linear regression API request preview

There are options on the upper right to either export the JSON or copy it to clipboard. On the bottom there is a link to the API documentation for linear regressions, in case you need to check any of the possible values or want to extend your knowledge in the use of the API to automate your workflows.

Please note: when a default value for an argument is used in the configuration, the argument won’t appear in the generated JSON. Because during API calls, default values are used when arguments are missing, there is no need to send them in the creation request.