Classification and Regression with the BigML Dashboard

2.4 Ensemble Configuration Options

While 1-click creation (see section 2.3 ) provides a convenient and easy way to create an ensemble from a dataset, there are cases when you want more control. This section will focus on the options that BigML offers to configure its internal algorithms for BigML ensembles.

You can set a number of parameters that affect the way BigML creates ensembles from a dataset. Such parameters can be grouped in two categories:

  • Parameters that are permanently associated to the dataset, such as its objective field and preferred fields. Once you provide a value for a dataset’s permanent parameters, they will be used as a default value for the creation of ensembles from that dataset.

  • Parameters that only affect the ensemble that is currently being created and that you are expected to set each time, such as included/excluded fields, and a number of configuration options that are described below. The objective field can also be specified on a per-ensemble basis, if you do not want to tie it to the dataset as described above.

Set a dataset’s permanent parameters by clicking on the edit button that is displayed when you hover on the dataset’s fields. This opens a modal dialog where you can set some of the field properties (See Figure 2.9 ).

\includegraphics[]{images/models/models-edit-dataset-field-modal}
Figure 2.9 Configure permanent parameter modal

Click on the preferred field button to make that field Non-preferred fields.

Click on the objective field button to make that field the new objective field.

To access the configuration panel, select the configure ensemble menu option located in the configuration menu of your dataset’s detail view. (See Figure 2.10 .)

\includegraphics[]{images/ensembles/configure-ensemble}
Figure 2.10 Configure ensemble

When the configuration panel is displayed, you can:

  • Select or deselect individual fields for them to be included in or excluded from the ensemble computation.

  • Change the objective field used for the ensemble to be created.

  • Manually configure a number of configuration options or automatically optimize these options.

    Note: when the configuration panel is displayed, the edit is not visible, so you cannot set the dataset’s permanent properties.

Configuration options are the same for ensembles as for models (section 1.4 ) plus a few more: type of algorithm (Decision Forests —which includes bagging and random decision forests— and Boosted Trees), number of models, random candidates, and the Boosting parameters (number of iterations, early stopping, learning rate and setp out of bag). Sampling options are also important for the configuration of ensembles. (See subsection 1.4.7 ).

You can find a detailed explanation of the configuration options below.

2.4.1 Objective Field

Also known as “target field”, the Objective Field is the output variable you want to predict.

Select your objective field in BigML in either of two ways. Specify the objective field each time you create an ensemble from the configuration panel or set a field as the default objective for all ensembles by clicking the edit button and then the objective field button.

By default, BigML will use the last valid field in your dataset as objetive, with the exemption fields of type text and items that cannot be used as objective.

\includegraphics[]{images/ensembles/ensemble-objective}
Figure 2.11 Ensemble objective field

2.4.2 Automatic Optimization

You can turn on the Automatic optimization option so BigML will automatically tune the parameters of your ensemble (see Figure 2.12 ).

\includegraphics[]{images/ensembles/auto-ensemble-optimization}
Figure 2.12 Automatic optimization

The high number of possible combinations for parameter values makes it difficult to find the optimum configuration since the combinations that lead to a poor result outnumber the ones that result in a satisfying performance. Hand-tuning different configurations is a time-consuming process that requires a high level of expertise and intuition. To combat this problem, BigML offers first-class support for automatic ensemble parameter optimization.

Behind the scenes, BigML uses the same technology for ensemble parameter optimization as the one used for OptiML. If you want to know more about the technical details, please read the Chapter 2 of the document OptiML with the BigML Dashboard [ 15 ] .

When you turn on the Automatic optimization option, all the ensemble parameters will be disabled (because they will be automatically optimized), except the Missing splits and the Weights parameters which you can manually configure (see Missing Splits and subsection 2.4.8 ).

\includegraphics[]{images/ensembles/auto-ensemble-optimization2}
Figure 2.13 Configure the missing splits and weights for your ensemble

Note: there is a maximum of 256 trees per ensemble that will be tried out during the optimization process. If you think that your ensemble needs a higher number of models, you can manually configure it.

Since the optimization process can take some time, BigML offers two configurable parameters to limit the time to create the optimized ensemble: a training duration (see Training duration ) and the ensemble candidates (see Ensemble candidates ).

Training duration

The scale parameter to regulate the ensemble runtime. It’s set as an integer from 1 to 10. It indicates the user preference for the amount of time they wish the optimization to take. The higher the number, the more time that users are willing to wait for possibly better ensemble performance. The lower the number, the faster that users wish the ensemble training to finish. The default value is set to 5.

The training duration is set in a scale. The actual training time depends on the dataset size, among other factors.

\includegraphics[]{images/ensembles/ensemble-training-duration}
Figure 2.14 Training duration

Ensemble candidates

The maximum number of different ensembles (i.e., ensembles using a unique configuration) to be trained and evaluated during the optimization process. The default numbre is 128 ensembles which is usually enough to find the best ensemble, but you can set it from 4 up to 200. Only the top-performing ensemble will be returned. If the training duration is very low (see Training duration ) given the dataset size, it is possible that not all the ensemble candidates will be tried out.

\includegraphics[]{images/ensembles/ensemble-candidates}
Figure 2.15 Ensemble candidates

2.4.3 Type

This option allows you to choose between two methods to build your ensemble: Decision Forests and Boosting. (See Figure 2.16 .) By selecting Decision Forests you can either build a Bagging or a Random Decision Forests ensemble. By default, Decision Forests take a random subset of instances from the dataset, which creates a Bagging ensemble. You can also add an additional element of randomness by choosing random features at each split (see Randomize and Random Candidates ) so you get a Random Decision Forest ensemble. By choosing Boosted Trees BigML builds gradient boosting trees. Read a technical description about Decision Forests and Boosted Trees in subsection 2.2.1 .

There is not an easy answer for the question of which method yields the best resutls. Usually you will have to try and test all options. However, depending on your dataset’s characteristics, you can sometimes have an initial idea of which may perform better.

In Boosted Trees, the effect of additional trees is essentially an expansion of the hypothesis space in a way that it is not for Decision Forests. So if you expect the decision function to be very complex and you have a lot of data, boosting may work better than Decision Forests.

On the other hand, if you have a noisy domain, where Overfitting is a concern, Decision Forests may be a better option than Boosted Trees, since boosting is more vulnerable to label noise.

Finally, if you have what you suspect is an "easily learnable" function, the additional power offered by Boosted Trees, or even Random Decision Forests, may not help; a more simple method, like Bagging or even models may perform better than the other two options.

\includegraphics[]{images/ensembles/ensemble-type}
Figure 2.16 Ensemble type: Decision Forests or Boosted Trees

2.4.4 Number of Models

This option allows you to configure the number of decision trees to create your ensembles when you select Decision Forest as your ensemble type (see subsection 2.4.3 ). By default, the number of models is set to 10 and the maximum allowed is 1,000 trees. (See Figure 2.17 .) For Boosted Trees the number of models will be determined by the number of iterations (see subsection 2.4.5 ) with a maximum of 2,000 trees per ensemble.

Generally, increasing the number of trees will yield better results. Furthermore, there is no downside except higher computational time. The situations where more models are likely to provide the most improvements are those where the dataset is not very large (e.g., in the thousands of instances or less), the data is very noisy, and (in the case of Random Decision Forests) when there are many correlated features that are all somewhat useful.

Take into account that each additional model tends to deliver less marginal improvement, so if the difference between nine and ten models is very small, it is very unlikely that an eleventh model will make a big difference.

\includegraphics[]{images/ensembles/number-models}
Figure 2.17 Number of models

2.4.5 Number of Iterations

This parameter sets the maximum number of iterations to be performed when Boosted Trees is selected. For Regression ensembles, one boosted tree will be generated for every iteration. For Classification ensembles, however, “N” trees will be generated for every iteration where “N” is the number of classes in the objective field.

By default, the number of iterations is 10 and the maximum allowed is 1,000 with a limit of 2,000 maximum single models built. If you set 1,000 iterations using a dataset with more than two classes for the objective field, in the case the ensemble reaches the maximum of 2,000 trees, it will stop.

Note: when one of the early stopping options is enabled (see Early stopping ), the final number of iterations may be lower than the number of iterations configured.

\includegraphics[]{images/ensembles/number-iterations}
Figure 2.18 Number of iterations

2.4.6 Trees

Since ensembles are composed of several decision trees, some of the parameters that can be configured for BigML models, can also be applied for ensembles like Missing splits and the Node threshold. You can also use the Randomize parameter to select a random subset of your fields at each split and create a Random Decision Forest. You can find a detailed explanation of the three parameters in the subsections below.

Pruning

If pruning is enabled, BigML determines whether each tree split increases the confidence (for classification ensembles) or decreases the expected error (for regression ensembles). If it does not, then the split is pruned away. As explained in subsection 1.2.4 , pruning strategies are key to avoid Overfitting, a phenomenon that reduces an ensemble’s ability to generalize. The statistical pruning is only available for decision forests (see subsection 2.4.3 ).

In BigML you can choose three different strategies for pruning (Figure 2.19 ):

  • Smart Pruning: considers pruning the nodes with less than 1% of the instances.

  • Statistical Pruning: considers every node for pruning.

  • No Statistical Pruning: deactivates pruning altogether.

By default, BigML uses Smart Pruning to create your decision forests.

Missing Splits

When training an ensemble, BigML may encounter missing values, which can be either considered or ignored for the definition of splitting rules.

To include missing splits in your ensemble, enable the missing splits option. (See Figure 2.19 .) If missing values are included in your ensemble, you may find rules with predicates of the following kind: field x = "is missing" or field x = "y or is missing".

BigML includes missing values following the MIA approach [ 40 ] .

By default, BigML does not include missing splits.

Node Threshold

Set the Node threshold to set a limit to each single tree growth within the ensemble. (See Figure 2.19 .) A lower threashold simplifies the ensemble while helping to avoid Overfitting. However, it may also have reduce the ensemble’s predictive power compared to deeper ensembles. The ideal number of nodes may depend on the dataset size and the number of features. Larger datasets with many important features may require more complex ensembles. Reducing the number of nodes can also be useful to get an initial unerstanding of the basic data patterns. Then you can start growing the ensemble from there.

By default, BigML sets a 512 node threshold. Since nodes are computed in batches, on occasion the final number of nodes can be greater than the node threshold. (See subsection 1.2.2 .)

Randomize and Random Candidates

The randomize option allows another layer of randomization and it works for Decision Forest and for Boosted Trees. If randomize is enabled, a random subset of the input fields will be selected at each tree split. If you selected Decision Forest as the ensemble type (see subsection 2.4.3 ), this will create a Random Decision Forest.

When you click the randomize option, the random candidates option will be enabled so you can configure the number of input fields in the random subset to be consider at each split. BigML provides three options (Figure 2.19 ):

  • Default: the number of fields is the square root of the total number of input fields. This is a basic rule which works pretty well in most cases.

  • Number of fields: this sets a fixed number of the fields to be considered at each split.

  • Ratio of fields: sets the number of fields to be considered at each split as a percentage of the total number of input fields.

Note: for text fields each term (e.g., each word if space is the chosen separator) counts as an individual field.

\includegraphics[]{images/ensembles/trees-config-panel}
Figure 2.19 Trees parameters configuration

2.4.7 Boosting parameters

When you select Boosted Trees as the ensemble type (see subsection 2.4.3 ), the parameters in the Boosting tab will be enabled: early stopping and learning rate. See the following subsections for a detailed explanation.

Early stopping

The early stopping options try to find the optimal number of iterations by testing the single models after every iteration and resulting in an early stop if not significant improvement is made. Consequently, the total iterations for the Boosted Trees, may be lower than the one set in the Number of iterations parameter (see subsection 2.4.5 ).

You can select one of these three options:

  • Early out of bag: this option tries to find out the optimal number of iterations by recursively building single trees and testing the out-of-bag samples after every iteration. This option will use the parameters set in the ensemble sample to build the single trees (see subsection 2.4.9 ). If no significant improvement is made, it may result in an early stop. By default this option is enabled. (See Figure 2.20 .)

  • Early holdout: this option tries to find out the optimal number of iterations by recursively building single trees and holding out a portion of the dataset for testing at the end of every iteration. If no significant improvement is made on the holdout, it may result in an early stop. The percentage of the dataset holdout is set to 30% by default, but you can configure it. By deafult this option is disabled.

  • None: this option deactivates the early stopping parameter so the total iterations will be the same as the ones specified in the Number of iterations parameter (see subsection 2.4.5 ). By deafult, this option is disabled.

Learning rate

The learning rate, also known as the gradient step, controls how aggressively the boosting algorithm fits the data. You can set values greater than 0 and smaller than 1. Larger values will prevent overfitting, but smaller values generally work better (usually 0.1 or lower). By default it is set to 0.1. (See Figure 2.20 .)

\includegraphics[]{images/ensembles/boosting-params}
Figure 2.20 Boosting parameters

2.4.8 Weight Field Options

It is not unusual for a dataset to have an unbalanced Objective Field, where some categories are common and others very rare. For example, in datasets used to predict fraud, usually fraudulent transactions are very scarce compared to regular ones. When this happens, ensembles tend to predict the most frequent values simply because the overall ensemble’s performance metric improves with that approach. However, in cases such as fraud prediction, you may be more interested in predicting rare values rather than successfully predicting frequent ones. In that case, you may want to assign more weight to the scarce instances so they are equivalent to the abundant ones.

BigML provides three different options to assign specific weight to your instances.

Balance Objective

When you set the balance objective weight (see Figure 2.21 ), BigML automatically balances the classes of the objective field by assigning a higher weight to the less frequent classes, with the most frequent class always having a weight of 1. This option is only available for classification ensembles. For example, take the following frequencies for each class:

[False, 2000; True, 50]

By enabling the Balance objective option, BigML will automatically apply the following weights:

[False, 1; True, 40]

In this example, the class “True” is getting forty times more weight as it is forty times less frequent than the most abundant class.

Objective Weights

The objective weights option allows you to manually set a specific weight for each class of the objective field. BigML oversamples your weighted instances replicating them as many times as the weight stablishes. If you do not list a class, it is assumed to have a weight of 1. Weights of 0 are also valid. This option is only available for classification ensembles. (See Figure 2.21 .)

This option can be combined with the Weight field (see Weight Field ). When combining it with the Weight field, both weights are multiplied. For example if you assign a weight of 3 for the “True” class and the weight field assigns a weight of 2 for a given instance labeled as “True”, that instance will have a total weight of 6.

Weight Field

The Weight Field option allows you to assign individual weights to each instance by choosing a special weight field. (See Figure 2.21 .) It can be used for both regression and classification ensembles.The selected field must be numeric and it must not contain any missingvalues. The weight field will be excluded from the input fields when building the ensemble. You can select an existing field in your dataset or you may create a new one in order to assign customized weights.

For example, below is a dataset for which we included a field called “Weight” that assign a ten time higher weight to fraudulent transactions in comparison to non-fraudulent ones. BigML provides a powerful tool, the BigML Flatline editor, to add new fields to your dataset, such as a weight field. As an additional example, we could also take into account the transaction “Amount” to calculate the weights, so transactions with higher amounts will have higher weights.

Trans. ID

Products

Online

Amount $

Fraud

Weight

xxxxxx098

XYZGH

yes

3,218

FALSE

1

xxxxxx345

VBHGF

no

1,200

FALSE

1

xxxxxx123

UYFHJ

yes

5,000

FALSE

1

xxxxxx567

HSNKI

no

390

FALSE

1

xxxxxx789

SHSYA

yes

500

TRUE

10

xxxxxx093

DFSTU

yes

423

FALSE

1

xxxxxx012

TYISJ

yes

60,000

FALSE

1

xxxxxx342

SJSOP

no

789

FALSE

1

xxxxxx908

IOPKJ

no

9,450

FALSE

1

xxxxxx334

HIOPN

yes

50,678

TRUE

10

Table 2.1 Weight Field example for transactional dataset
\includegraphics[]{images/ensembles/weighting-params}
Figure 2.21 Weighting parameters

2.4.9 Trees sampling

The trees sampling is a different concept than the dataset sampling (see subsection 2.4.10 ). The dataset sampling is the one present in other BigML resources and it applies sampling just once to the input dataset before building the resource. In the case of the trees sampling, sampling is applied to the dataset as many times as the number of models that compose the ensemble. This way, separate samplings are created for each tree composing the final ensemble.

The default is to sample with a rate of 100% and with replacement, meaning that the same instance can be selected more than once. This ensures a different sampling for each tree. The out of bag option is not available for trees sampling.

How to Get Repeatable Ensembles

If you create two ensembles using the same dataset and same configuration you will see that the individual trees composing each ensemble may be different. This is due to sampling “randomization” that BigML algorithm uses. Although this does not usually affect the predictive performance of your ensembles, there may be situations in which it is desirable to produce exactly the same ensembles. You can achieve this by setting the Sampling option to deterministic (see Sampling ).

See below an explanation of each sampling option to build the single trees.

Rate

The sampling rate is the percentage of instances being extracted from the dataset and included in your per-tree sample. A sampling rate of 100% means that all instances are included; a rate of 10% means that only every tenth instance is included in each single model. This option may take any value between 0% and 100%. You can easily configure the rate by moving the slider in the configuration panel for sampling, or by typing the percentage in the tiny input box, both highlighted in Figure 2.22 .

By default, BigML uses a 100% rate combined with replacement (see Replacement ).

Sampling

The sampling option represents the type of the sampling process, which can be either random or deterministic. (See Figure 2.22 .)

When using deterministic sampling the random-number generator will always use the same seed, producing repeatable results. (See subsection 2.4.9 .)

By default, BigML uses random sampling.

Replacement

The replacement option controls whether a single instance can be selected multiple times or not. Sampling without replacement ensures that each instance cannot be selected more than once. (See Figure 2.22 .)

By default, BigML generates samples with replacement.

\includegraphics[]{images/ensembles/trees-sampling}
Figure 2.22 Trees sampling for ensembles

2.4.10 Dataset sampling

Sometimes you do not need all the instances contained in your testing dataset to build your ensemble. If you have a very large dataset, Sampling may be a good way of getting faster results. (See Figure 2.23 .)

The same sampling options described in the Datasets with the BigML Dashboard document [ 23 ] to sample datasets, are also available when building BigML ensembles. They are divided in two groups: sampling and advanced sampling options. (See Figure 2.23 .)

Rate

The sampling rate is the frequency of instances being extracted from the dataset and included in your sample. A sampling rate of 100% means that all instances are included; a rate of 10% means 10% of the instances are included. This option may take any value between 0% and 100%. You can easily configure the rate by moving the slider in the configuration panel for sampling, or by typing the percentage in the tiny input box, both highlighted in Figure 2.23 .

By default, BigML uses a 100% rate.

Range

The sampling range is the subset of the dataset instances from which to sample, e.g., from instance 5 to instance 1,000. The rate will be applied over the range configured.

By default, all instances are included, i.e., the range is (1, num. rows in dataset).

Sampling

The sampling option represents the type of the sampling process, which can be either random or deterministic.

When using deterministic sampling the random-number generator will always use the same seed, producing repeatable results.

By default, BigML uses random sampling.

Replacement

The replacement option controls whether a single instance can be selected multiple times or not. Sampling without replacement ensures that each instance cannot be selected more than once.

By default, BigML generates samples without replacement.

Out of Bag

The out of bag option allows you to include in your sample only those instances that were not selected in the first place, thus effectively inverting the sampling outcome. It is only selectable when a sample is deterministic and the sample rate is less than 100%. The total percentage of instances included in your sample will be one minus the rate (when replacement is not allowed). This can be useful for splitting a dataset into training and testing subsets.

By default, BigML will not use out of bag instances.

\includegraphics[]{images/ensembles/dataset-sampling}
Figure 2.23 Dataset sampling arguments for ensembles

2.4.11 Advanced Ordering

Ordering options are relevant to ensure that BigML can correctly determine whether it can take an Early split of your dataset to accelerate the training process. In particular, early splitting can only be safely used if the training instances have been previosuly shuffled. (See Early Splitting .)

If your instances are already shuffled, BigML allows you to choose the linear option. This will make the process of building the ensemble much faster, since it will not required to reshuffle the dataset. If you need to shuffle your instances, BigML provides two options to that aim, deterministic shuffling and random shuffling, which are described below.

Ordering options have no influence on datasets of less than 34GB, since the whole dataset is used to build the ensemble.

By default, BigML uses deterministic shuffling to ensure the same (deterministic) sample of the instances is used and the built ensemble is thus repeatable.

Deterministic Shuffling

The deterministic shuffling option ensures that the row shuffling of a dataset is always the same, so that retraining a BigML ensemble from the same dataset yields the same results. (See Figure 2.23 .)

By default, this option is true.

Linear Shuffling

The linear shuffling option is useful when you know that your instances are already in random order. Using linear shuffling, the BigML ensemble will be constructed faster. (See Figure 2.23 .)

By default, this option is false.

Random Shuffling

The random shuffling option will ensure that a different shuffling will be tried each time you train your ensemble. (See Figure 2.23 .)

By default, this option is false.

\includegraphics[]{images/ensembles/ordering-options}
Figure 2.24 Ordering options for ensembles

2.4.12 Creating Ensembles with Configured Options

After finishing the configuration of your options, you can change the default ensemble name in the editable text box. Then you can click on the Create ensemble button to create the new ensemble, or reset the configuration by clicking on the Reset button.

\includegraphics[]{images/ensembles/ensemble-configuration-create-ensemble}
Figure 2.25 Create ensemble after configuration

2.4.13 API Request Preview

The API Request Preview button is in the middle on the bottom of the configuration panel, next to the Reset button (See (Figure 2.25 )). This is to show how to create the ensemble programmatically: the endpoint of the REST API call and the JSON that specifies the arguments configured in the panel. Please see (Figure 2.26 ) below:

\includegraphics[]{images/ensembles/ensemble-configuration-api-preview}
Figure 2.26 Ensemble API request preview

There are options on the upper right to either export the JSON or copy it to clipboard. On the bottom there is a link to the API documentation for ensembles, in case you need to check any of the possible values or want to extend your knowledge in the use of the API to automate your workflows.

Please note: when a default value for an argument is used in the chosen configuration, the argument won’t appear in the generated JSON. Because during API calls, default values are used when arguments are missing, there is no need to send them in the creation request.