Classification and Regression with the BigML Dashboard

Classification and Regression with the BigML Dashboard
Models
Model Configuration Options

1.4 Model Configuration Options

While the 1-click creation menu option (see section 1.3 ) provides a convenient and easy way to create a BigML model from a dataset, there are cases when you want more control. This section will focus on the options that BigML offers to configure its internal algorithms for BigML models.

You can set a number of parameters that affect the way BigML creates models from a dataset. Such parameters can be grouped in two categories:

Parameters that are permanently associated to the dataset, such as its objective field and preferred fields. Once you provide a value for a dataset’s permanent parameters, they will be used as a default value for the creation of models from that dataset.
Parameters that only affect the model that is currently being created and that you are expected to set each time. Those include the objective field, included/excluded fields, and a number of configuration options that are described below.

Set a dataset’s permanent parameters by clicking on the edit button that is displayed when you hover on the dataset’s fields. This opens a modal dialog where you can set some of the field properties (See Figure 1.10 ).

Click on the preferred field button to make that field Non-preferred fields.

Click on the objective field button to make that field the new objective field.

To access the configuration panel, select the configure model menu option located in the configuration menu of your dataset’s detail view. (See Figure 1.11 .)

\includegraphics[]{images/models/configure-model} — Figure 1.11 Configure model

When the configuration panel is displayed, you can:

Select or deselect individual fields for them to be included in or excluded from the model computation.
Change the objective field used for the model to be created.
Manually configure a number of configuration options or automatically optimize these options.

Note: when the configuration panel is displayed, the edit is not visible, so you cannot set the dataset’s permanent properties.

See below for a detailed explanation of the configuration options that are available as well as the corresponding default values.

1.4.1 Objective Field

Also known as “target field”, the Objective Field is the output variable you want to predict.

Select your objective field in BigML in either of two ways. Specify the objective field each time you create a model from the configuration panel or set a field as the default objective for all models by clicking the edit button and then the objective field button.

By default, BigML will use the last valid field in your dataset as objective, with the exemption fields of type text and items that cannot be used as objective.

1.4.2 Automatic Optimization

You can turn on the Automatic optimization option so BigML will automatically tune the parameters of your model (see Figure 1.12 ).

\includegraphics[]{images/models/auto-model-optimization} — Figure 1.12 Automatic optimization

The high number of possible combinations for parameter values makes it difficult to find the optimum configuration since the combinations that lead to a poor result outnumber the ones that result in a satisfying performance. Hand-tuning different configurations is a time-consuming process that requires a high level of expertise and intuition. To combat this problem, BigML offers first-class support for automatic model parameter optimization.

Behind the scenes, BigML uses the same technology for model parameter optimization as the one used for OptiML. If you want to know more about the technical details, please read the Chapter 2 of the document OptiML with the BigML Dashboard [ 15 ] .

When you turn on the Automatic optimization option, all the model parameters will be disabled (because they will be automatically optimized), except the Missing splits and the Weights parameters which you can manually configure (see subsection 1.4.4 and subsection 1.4.6 ).

\includegraphics[]{images/models/auto-model-optimization2} — Figure 1.13 Configure the missing splits and the weights for your model

Since the optimization process can take some time, BigML offers two configurable parameters to limit the time to create the optimized model: a training duration (see Training duration ) and the model candidates (see Model candidates ).

Training duration

The scale parameter to regulate the model runtime. It’s set as an integer from 1 to 10. It indicates the user preference for the amount of time they wish the optimization to take. The higher the number, the more time that users are willing to wait for possibly better model performance. The lower the number, the faster that users wish the model training to finish. The default value is set to 5.

The training duration is set in a scale. The actual training time depends on the dataset size, among other factors.

\includegraphics[]{images/models/model-training-duration} — Figure 1.14 Training duration

Model candidates

The maximum number of different models (i.e., models using a unique configuration) to be trained and evaluated during the optimization process. The default is 128 models which is usually enough to find the best model, but you can set it from 4 up to 200. The top-performing model model will be returned. If the training duration is very low (see Training duration ) given the dataset size, it is possible that not all the model candidates will be tried out.

1.4.3 Pruning

As explained in subsection 1.2.4 , pruning strategies are essential to avoid Overfitting, a phenomenon that reduces a model’s ability to generalize.

In BigML you can choose three different strategies for pruning (Figure 1.16 ):

Smart Pruning: considers pruning the nodes with less than 1% of the instances.
Statistical Pruning: considers every node for pruning.
No Statistical Pruning: deactivates pruning altogether.

By default, BigML uses Smart Pruning to create your models.

1.4.4 Missing Splits

When training a model, BigML may encounter missing values, which can be either considered or ignored for the definition of splitting rules.

To include missing splits in your model, enable the missing splits option. (See Figure 1.16 .) If missing values are included in your model, you may find rules with predicates of the following kind: field x = "is missing" or field x = "y or is missing".

BigML includes missing values following the MIA approach [ 40 ] .

By default, BigML does not include missing splits.

1.4.5 Node Threshold

Set the Node threshold to set a limit to a BigML model’s growth. (See Figure 1.16 .) A lower threashold simplifies the model while helping to avoid Overfitting. However, it may also have reduce the model’s predictive power compared to deeper models. The ideal number of nodes may depend on the dataset size and the number of features. Larger datasets with many important features may require more complex models. Reducing the number of nodes can also be useful to get an initial unerstanding of the basic data patterns. Then you can start growing the model from there.

By default, BigML sets a 512 node threshold. Since nodes are computed in batches, on occasion the final number of nodes can be greater than the node threshold. (See subsection 1.2.2 .)

\includegraphics[]{images/models/model-params} — Figure 1.16 Model configuration options

1.4.6 Weight Options

It is not unusual for a dataset to have an unbalanced Objective Field, where some categories are common and others very rare. For example, in datasets used to predict fraud, usually fraudulent transactions are very scarce compared to regular ones. When this happens, models tend to predict the most frequent values simply because the overall model’s performance metric improves with that approach. However, in cases such as fraud prediction, you may be more interested in predicting rare values rather than successfully predicting frequent ones. In that case, you may want to assign more weight to the scarce instances so they are equivalent to the abundant ones.

BigML provides three different options to assign specific weight to your instances.

Balance Objective

When you set the balance objective weight, BigML automatically balances the classes of the objective field by assigning a higher weight to the less frequent classes, with the most frequent class always having a weight of 1. This option is only available for classification models. For example, take the following frequencies for each class:

[False, 2000; True, 50]

By enabling the balance objective option, BigML will automatically apply the following weights:

[False, 1; True, 40]

In this example, the class “True” is getting forty times more weight as it is forty times less frequent than the most abundant class.

Objective Weights

The objective weights option allows you to manually set a specific weight for each class of the objective field. BigML oversamples your weighted instances replicating them as many times as the weight stablishes. If you do not list a class, it is assumed to have a weight of 1. Weights of 0 are also valid. This option is only available for classification models.

This option can be combined with the Weight field (see Weight Field ). When combining it with the Weight field, both weights are multiplied. For example if you assign a weight of 3 for the “True” class and the weight field assigns a weight of 2 for a given instance labeled as “True”, that instance will have a total weight of 6.

Weight Field

The Weight Field option allows you to assign individual weights to each instance by choosing a special weight field. It can be used for both regression and classification models. The selected field must be numeric and it must not contain any missing values. The weight field will be excluded from the input fields when building the model. You can select an existing field in your dataset or you may create a new one in order to assign customized weights.

For example, below is a dataset for which we included a field called “Weight” that assign a ten time higher weight to fraudulent transactions in comparison to non-fraudulent ones. BigML provides a powerful tool, the BigML Flatline editor, to add new fields to your dataset, such as a weight field. As an additional example, we could also take into account the transaction “Amount” to calculate the weights, so transactions with higher amounts will have higher weights.

Trans. ID	Products	Online	Amount $	Fraud	Weight
xxxxxx098	XYZGH	yes	3,218	FALSE	1
xxxxxx345	VBHGF	no	1,200	FALSE	1
xxxxxx123	UYFHJ	yes	5,000	FALSE	1
xxxxxx567	HSNKI	no	390	FALSE	1
xxxxxx789	SHSYA	yes	500	TRUE	10
xxxxxx093	DFSTU	yes	423	FALSE	1
xxxxxx012	TYISJ	yes	60,000	FALSE	1
xxxxxx342	SJSOP	no	789	FALSE	1
xxxxxx908	IOPKJ	no	9,450	FALSE	1
xxxxxx334	HIOPN	yes	50,678	TRUE	10

Table 1.1 Weight Field example for transactional dataset

\includegraphics[]{images/models/weighting-params} — Figure 1.17 Weighting arguments for models

1.4.7 Sampling Options

Sometimes you do not need all the instances contained in your testing dataset to build your model. If you have a very large dataset, Sampling may be a good way of getting faster results. (See Figure 1.18 .)

The same sampling options described in the Datasets with the BigML Dashboard document [ 23 ] to sample datasets, are also available when building BigML models. They are divided in two groups: sampling and advanced sampling options.

Rate

The sampling rate is the frequency of instances being extracted from the dataset and included in your sample. A sampling rate of 100% means that all instances are included; a rate of 10% means 10% of the instances are included. This option may take any value between 0% and 100%. You can easily configure the rate by moving the slider in the configuration panel for sampling, or by typing the percentage in the tiny input box, both highlighted in Figure 1.18 .

By default, BigML uses a 100% rate.

Range

The sampling range is the subset of the dataset instances from which to sample, e.g., from instance 5 to instance 1,000. The rate will be applied over the range configured.

By default, all instances are included, i.e., the range is (1, num. rows in dataset).

Sampling

The sampling option represents the type of the sampling process, which can be either random or deterministic.

When using deterministic sampling the random-number generator will always use the same seed, producing repeatable results.

By default, BigML uses random sampling.

Replacement

The replacement option controls whether a single instance can be selected multiple times or not. Sampling without replacement ensures that each instance cannot be selected more than once.

By default, BigML generates samples without replacement.

Out of Bag

The out of bag option allows you to include in your sample only those instances that were not selected in the first place, thus effectively inverting the sampling outcome. It is only selectable when a sample is deterministic and the sample rate is less than 100%. The total percentage of instances included in your sample will be one minus the rate (when replacement is not allowed). This can be useful for splitting a dataset into training and testing subsets.

By default, BigML will not use out of bag instances.

\includegraphics[]{images/models/sampling-params} — Figure 1.18 Sampling arguments for models

1.4.8 Advanced Ordering

Ordering options are relevant to ensure that BigML can correctly determine whether it can take an Early split of your dataset to accelerate the training process. In particular, early splitting can only be safely used if the training instances have been previosuly shuffled. (See Early Splitting .)

If your instances are already shuffled, BigML allows you to choose the linear option. This will make the process of building the model much faster, since it will not required to reshuffle the dataset. If you need to shuffle your instances, BigML provides two options to that aim, deterministic shuffling and random shuffling, which are described below.

Ordering options have no influence on datasets of less than 34GB, since the whole dataset is used to build the model.

By default, BigML uses deterministic shuffling to ensure the same (deterministic) sample of the instances is used and the built model is thus repeatable.

Deterministic Shuffling

The deterministic shuffling option ensures that the row shuffling of a dataset is always the same, so that retraining a BigML model from the same dataset yields the same results.

By default, this option is true.

Linear Shuffling

The linear shuffling option is useful when you know that your instances are already in random order. Using linear shuffling, the BigML model will be constructed faster.

By default, this option is false.

Random Shuffling

The random shuffling option will ensure that a different shuffling will be tried each time you train your model.

By default, this option is false.

\includegraphics[]{images/models/ordering-params} — Figure 1.19 Ordering argument for models

1.4.9 Creating Models with Configured Options

After finishing the configuration of your options, you can change the default model name in the editable text box. Then you can click on the Create model button to create the new model, or reset the configuration by clicking on the Reset button.

\includegraphics[]{images/models/model-configuration-create-model} — Figure 1.20 Create model after configuration

1.4.10 API Request Preview

The API Request Preview button is in the middle on the bottom of the configuration panel, next to the Reset button (See (Figure 1.20 )). This is to show how to create the model programmatically: the endpoint of the REST API call and the JSON that specifies the arguments configured in the panel. Please see (Figure 1.21 ) below:

\includegraphics[]{images/models/model-configuration-api-preview} — Figure 1.21 Model API request preview

There are options on the upper right to either export the JSON or copy it to clipboard. On the bottom there is a link to the API documentation for models, in case you need to check any of the possible values or want to extend your knowledge in the use of the API to automate your workflows.

Please note: when a default value for an argument is used in the chosen configuration, the argument won’t appear in the generated JSON. Because during API calls, default values are used when arguments are missing, there is no need to send them in the creation request.