Classification and Regression with the BigML Dashboard

5.4 Deepnet Configuration Options

While the 1-click creation menu option (see section 5.3 ) provides a convenient and easy way to create a BigML deepnet, you can also have more control over the deepnet creation and configure a number of parameters that affect the way BigML creates deepnets. Click the configure deepnet menu option in the configuration menu of your dataset view. (See Figure 5.6 .)

\includegraphics[]{images/deepnet/deepnet-config}
Figure 5.6 Configure deepnet

5.4.1 Objective Field

The objective field, or “target field”, is the field you want to predict. Deepnets support categorical or numeric fields as the Objective Field.

BigML takes the last valid field in your dataset as the objective field by default. If you want to change the objective field, you have two options: you can select another field from the configuration panel to build the deepnet, or you can change it permanently from your dataset view.

  • Select the Objective field from the deepnet configuration panel. This option will only affect the deepnet you are building that time. (See Figure 5.7 .)

    \includegraphics[]{images/deepnet/deepnet-objective-field}
    Figure 5.7 Configure the objective field to create the deepnet
  • Change the default objective field for the dataset. This option will save your objective field preference for any model you build. Click on the edition icon next to the field name when you mouse over it, a pop up window will be displayed. Then click on the Objective field icon and Save it. (See Figure 5.8 .)

    \includegraphics[]{images/deepnet/deepnet-objective-field-dataset}
    Figure 5.8 Change the default objective field

5.4.2 Automatic Parameter Optimization

The high number of configurable parameters for neural networks makes it difficult to find the optimum configuration to get good results. Hand-tuning different configurations is a time-consuming process in which the combinations that lead to a poor result outnumber the ones that result in a satisfying performance. To combat this problem, BigML offers first-class support for automatic parameter optimization via two different methods:

  • Automatic network search: during the deepnet creation, BigML trains and evaluates over many possible network configurations, returning the best networks found for your problem. The final deepnet returned by the search is a “compromise” between the top “n” networks found in the search. The main problem of using this optimization method is that the creation of the deepnet may be significantly slower.

    Note: the search process is not totally deterministic, so although you are using the same dataset you might get slightly different results from run to run.

  • Automatic structure suggestion: BigML offers a faster technique that can also give quality results. BigML has learned some general rules about what makes one network structure better than another for a given dataset. BigML will automatically suggest a structure and a set of parameter values that are likely to perform well for your dataset.

Read more about the automatic parameter optimization in subsection 5.2.2 . You can choose either optimizing technique by selecting it in the configuration panel (see Figure 5.9 ).

\includegraphics[]{images/deepnet/auto-options}
Figure 5.9 Select an automatic optimization option

When you select an optimization strategy, the Network architecture parameters (see subsection 5.4.7 ), the Algorithm parameters (see subsection 5.4.8 ) and the Weights (see subsection 5.4.9 ) will be automatically set. You cannot manually tune any of them except the Weights which you can manually configure it. If you want to configure the rest, you need to deactivate the automatic optimization options using the switcher as shown in Figure 5.10 ,

\includegraphics[]{images/deepnet/auto-options2}
Figure 5.10 Disable automatic optimization options to manually configure the rest of the network parameters

5.4.3 Default Numeric Value

Deepnets can include missing values as valid values for any type of fields as explained in subsection 5.2.4 . However, there can be situations for which you do not want to include them in your model. For those cases, the default numeric value parameter is an easy way to replace missing numeric values by another valid value. You can select to replace them by the field’s Mean, Median, Maximum, Minimum or by Zero. (See Figure 5.11 .)

\includegraphics[]{images/deepnet/deepnet-default-numeric}
Figure 5.11 Select a default numeric value to replace missing numeric values

Note: if your dataset does not contain missing values for your numeric fields, this parameter will not have impact on your deepnet. If your dataset contains missing numeric values and you neither select a default numeric value or enable the missing numerics configuration option, instances with missing numeric values will be ignored to build the model.

5.4.4 Missing Numerics

By default, missing values for your numeric fields are included as valid values to build your deepnets. However, there can be cases for which you do not want them to be included in your model. The Missing numerics option allows you to select if you want to include or exclude the missing numeric values to build your deepnets. (See Figure 5.12 .)

\includegraphics[]{images/deepnet/deepnet-missing-numerics}
Figure 5.12 Include missing numeric values in your deepnet

Note: missing values are always included for categorical, text, and items fields. If your dataset contains missing numeric values and you do not either select the missing numeric option or set a default numeric value (see subsection 5.4.3 ), instances containing missing values will be ignored when building the deepnet.

5.4.5 Training Duration

The scale parameter to regulate the deepnet runtime. It’s set as an integer from 1 to 10. It indicates the user preference for the amount of time they wish the optimization to take. The higher the number, the more time that users are willing to wait for possibly better deepnet performance. The lower the number, the faster that users wish the deepnet training to finish. The default value is set to 5.

The training duration is set in a scale. The actual training time depends on the dataset size, among other factors.

\includegraphics[]{images/deepnet/deepnet-training-duration}
Figure 5.13 Set the training duration for a deepnet

5.4.6 Maximum Iterations

The number of iterations in a deepnet is the number of gradient steps the algorithm takes during the optimization process. You can set the maximum number of iterations to train your deepnet by activating this option using the switcher (see Figure 5.14 ). By default, this option will be deactivated, in which case BigML will stop training the network if a certain number of iterations goes by without substantial progress or if it reaches the limit of the training duration (see subsection 5.4.5 ).

\includegraphics[]{images/deepnet/deepnet-max-iterations}
Figure 5.14 Set the maximum number of iterations for a deepnet

5.4.7 Network Architecture

The basic architecture specification for a network consists of specifying the number of hidden layers in the network, the number of nodes in each layer, the activation function, and other parameters related to how the network connections are arranged such as residual learning, batch normalization and tree embedding.

Hidden Layers

The hidden layers in a neural network are the intermediate layers between the input layer (the one containing the input field values) and the output layer (the one containing the predictions). (See section 5.2 ). In BigML you can configure up to 32 hidden layers. For each layer you can specify the following parameters:

  • The activation function: by applying an activation function, the deepnet is able to represent a non-linear mapping between the inputs and the outputs which is necessary to solve complex problems. The activation function converts an input into an output that is used as an input again to feed the next layer in the network and so on. If no activation function is applied, the output will be simply a linear function of the inputs. In BigML you can select one of the following functions: “Tanh”, “Sigmoid”, “Softplus”, “Softmax”, “ReLU” or “None”. If “None” is selected, no activation function will be set for the layers so the raw output values for each node will be used as inputs for the next layer.

    To know more about each type of activation function please refer to this article

  • The number of nodes: each hidden layer in the network can have a variable number of nodes. Determining the optimal number of hidden nodes (also called hidden units or neurons) per layer is a complex task depending on:

    • The number of nodes of the input and output layers. The number of nodes in the input layer is equal to the number of fields of the dataset used to create the deepnet. The output layer always has one node if it is a Regression problem or as many nodes as classes has the objective field if it is a Classification problem.

    • The number of instances in the training dataset.

    • The complexity of the problem that is trying to be solved.

    • The gradient descent algorithm (see Gradient Descent Algorithm )and the activation function used.

    The higher the number of nodes, the higher the risk of Overfitting. However, too few nodes may lead to a poor solution if the function to be learned has some complexity. In most cases, you will need to try different size for the layers or use one of the BigML optimization options (see subsection 5.4.2 ) to reach a number of nodes per layer that provide satisfying results. In BigML you can set up to 8,192 nodes per layer. A rule of thumb to determine the size of the layers is that the number of hidden nodes should be somewhere in between the sizes of the input and the output layers. Also, the size of the hidden layer should not exceed twice the size of the input layer because at this point it is very likely to overfit. Read this article to know more about the variables that should be taken into account when deciding the nsize of the layers.

The weights for each layer are initialized randomly according to Xavier’s method.

If one of the automatic optimization options (see subsection 5.4.2 ) is enabled, the hidden layers will be automatically set. If you disable the automatic options, you can add and remove layers, select the activation function and the number of nodes for each layer (see Figure 5.15 ). You can add a minimum of one layer up to 32 layers.

\includegraphics[]{images/deepnet/hidden-layers}
Figure 5.15 Configure the hidden layers of the network

Learn Residuals

If learning residuals is enabled, it will cause alternate layers to learn a representation of the residuals for a given layer rather than the layer itself, by introducing shortcut connections. In other words, residual networks tweak the mathematical formula of the typical layer’s equation to include the inputs of a lower layer in a node of a higher layer. Residual learning has proved to be very successful for image recognition as described in this paper.

If one of the automatic optimization options (see subsection 5.4.2 ) is enabled, the residual learning will be automatically set. If you disable the automatic options, you can choose to include or exclude it (see Figure 5.16 ).

\includegraphics[]{images/deepnet/learn-residuals}
Figure 5.16 Enable or disable the residuals learning

Batch Normalization

If batch normalization is enabled, it will cause the outputs of a network to be normalized before being passed to the activation function, as described in this paper. This will introduce extra parameters in each layer (the mean, variance, and scale of the layer), and will significantly slow down training.

If one of the automatic optimization options (see subsection 5.4.2 ) is enabled, the batch normalization will be automatically set. If you disable the automatic options, you can choose to include or exclude it (see Figure 5.17 ).

\includegraphics[]{images/deepnet/batch-normalization}
Figure 5.17 Enable or disable the batch normalization

Tree Embedding

If tree mebedding is enabled, the network will learn a tree-based representation of the data as engineered features along with the raw features, essentially by learning trees over slices of the input space and a small amount of the training data. The theory is that these engineered features will linearize obvious non-linear dependencies before training begins accelerating the learning process.

If one of the automatic optimization options (see subsection 5.4.2 ) is enabled, the tree embedding will be automatically set. If you disable the automatic options, you can choose to include or exclude it (see Figure 5.18 ).

\includegraphics[]{images/deepnet/tree-embedding}
Figure 5.18 Enable or disable the tree embedding

To learn more about the residual learning, the batch normalization and the tree embedding, read this blog post.

5.4.8 Algorithm

BigML deepnets allow you to select different gradient descent algorthims to optimize the network weights in order to minimize the loss function. These algorithms have some specific parameters explained in Gradient Descent Algorithm and also some common ones such as the learning rate, the dropout rate, and the seed described in Gradient Descent Algorithm , Dropout Rate , and Seed , respectively.

If one of the automatic optimization options (see subsection 5.4.2 ) is enabled, the algorithm parameters will be automatically set. If you disable the automatic options, you can configure them.

Gradient Descent Algorithm

The most widely used algorithm during the neural networks training is the gradient descent algorithm. This optimization algorithm is used to minimize the loss function. Although the gradient descent is the most important and popular technique used to train neural networks, it presents some problems associated such as converging to a sub-optimal local minimma or setting a proper learning rate (not too small to avoid slow convergence and not too big to avoid divergence). To further optimize the gradient descent, BigML offers several optimization algorithms:

  • Momentum: this method helps accelerate the gradient descent in the right direction dampening the oscillations until convergence. Therefore, it usually leads to faster and stable converge by reducing the unnecessary parameter updates. The configurable parameters for this algorithm include:

    • Momentum: higher values accelerate the gradient descent.

    \includegraphics[]{images/deepnet/momentum}
    Figure 5.19 Momentum algorithm

    However, this method does not solve the problem that the final performance heavily depends on the selected learning rate.

  • Adagrad: is an adaptative learning method. Essentially, it adapts the learning rate to each parameter making big updates for infrequent parameters and small updates for frequent parameters. It solves the problem of selecting a unique learning rate since it can take a default rate and then adapt it for each parameter. This method works very well with sparse data. The configurable parameters for this algorithm include:

    • Initial Accumulator Value: this is the initial value for the gradient accumulator.

    \includegraphics[]{images/deepnet/adagrad}
    Figure 5.20 Adagrad algorithm

    The problem of Adagrad is that the learning rate tends to decay to a very small number so the learning stops. RMSProp tries to solve this problem.

  • RMSProp: is another adaptative learning method that can be considered an extension of Adagrad and tries to solve the problem of decaying rates. The configurable parameters for this algorithm include:

    • Momentum: higher values accelerate the gradient descent.

    • Decay: the speed to decay the moving average.

    • Epsilon: a parameter to avoid numeric precision problems.

    \includegraphics[]{images/deepnet/rmsprop}
    Figure 5.21 RMSProp algorithm
  • Adam: (Adaptative Moment Estimation) is another adaptative method that computes adaptative learning rates for each parameter like Adagrad and it also solves the problem about decaying rates like RMSProp. Moreover, it keeps an exponentially decaying average of past gradients like Momentum. Adam usually works well compared to other algorithms as it converges fast and it solves the problems that other algorithms may have. The configurable parameters for this algorithm include:

    • Beta1: decay rate for the first moment estimate (the mean).

    • Beta2: decay rate for the second moment estimate (the variance).

    • Epsilon: a parameter to avoid numeric precision problems.

    \includegraphics[]{images/deepnet/adam}
    Figure 5.22 Adam algorithm
  • FTRL: it also adapts the learning rate by slowing the learning rate per prameter. The configurable parameters for this algorithm include:

    • Regularization: the regularization factor to avoid Overfitting, i.e., tailoring the model to the training data at the expense of generalization. You can choose between L1 or L2 regularization.

    • Strength: is the inverse of the regularization strength, so higher values indicate less regularization. It must be a positive integer greater than 0. Too high values for strength will make the algorithm perfectly fit the training data boundaries. Too low values for strength will result in vague decision boundaries not following the data patterns.

    • Learning rate power: the learning rate power for the FTRL algorithm.

    • Initial Accumulator Value: this is the initial value for the gradient accumulator.

    \includegraphics[]{images/deepnet/ftrl}
    Figure 5.23 FTRL algorithm

In summary, if your data is sparse, some of the adaptative algorithms may perform better. Adagrad, RMSProp and Adam are quite similar and perform well for similar use cases. However, Adam is the one that usually outperforms the rest due to its bias correction.

eRgarding the algorithm-specific parameters (momentum, beta1 and beta2, accumulator values, learning rate power, etc.), they all offer similar ways of controlling how much gradient descent remembers previous iterations and uses those to inform the current gradient step. Tuning those parameters have a similar impact: too high values for this sort of correction will send the search zooming off in the wrong direction; too low values will result in the same problems as vanilla gradient descent (overfitting and getting stuck in a local minima). If these parameters are set just right, they improve the speed at which the algorithm converges, and helps it to avoid local minima.

For all these parameters, though, the most important rule is not to hand-tune and iterate them unless you have a specific reason to do it. The best values for them depend on your data, the topology of your network, and the random conditions you start in. Hence, the best option if you are not very experienced with neural networks is to use one of the BigML optimization options (see subsection 5.4.2 ) which will find the best configuration for your network automatically.

To learn more about the optimization algorithms please refer to this article

Learning Rate

The learning rate, also known as the gradient step, controls how aggressively the gradient descent algorithm fits the training data. You can set values greater than 0% and smaller than 100%. Larger values will prevent Overfitting, but smaller values generally work better (usually 1% or lower), although it usually takes longer to train the deepnet. As a general rule, you want to find a learning rate that is low enough so the network converges to a satisfying solution, but high enough to reduce as much as possible the training time.

If one of the automatic optimization options (see subsection 5.4.2 ) is enabled, the learning rate will be automatically set. If you disable the automatic options, you can select a value for the learning rate (see Figure 5.24 ).

\includegraphics[]{images/deepnet/learning-rate}
Figure 5.24 Configure the learning rate

Dropout Rate

The droput mechanism consists of randomly drop nodes (and their connections) from the network at training time. This prevents nodes from co-adapting so it is an effective method to control Overfitting. The dropout rate is the proportion of nodes dropped from the network during training.

If one of the automatic optimization options (see subsection 5.4.2 ) is enabled, the dropout rate will be automatically set. If you disable the automatic options, you can select a value for the dropout rate (see Figure 5.25 ).

\includegraphics[]{images/deepnet/dropout-rate}
Figure 5.25 Configure the dropout rate

Seed

The random seed controlling the ordering of training data, the initial network weights, and the behavior of dropout during training. If the automatic network search option is not enabled, by setting the same seed you can get repeatable deepnets using the same dataset.

\includegraphics[]{images/deepnet/seed}
Figure 5.26 Set a seed for the deepnet

5.4.9 Weights

It is not unusual for a dataset to have some categories that are common and others very rare. For example, in datasets used to predict fraud, usually fraudulent transactions are very scarce compared to regular ones. When this happens, models tends to predict the most frequent values simply because the overall model’s performance metrics improve with that approach. However, in cases such as fraud prediction, you may be more interested in predicting rare values rather than successfully predicting frequent ones. In that case, you may want to assign more weight to the scarce instances so they are equivalent to the abundant ones.

BigML provides two different options to assign specific weight to your instances, balance objective and objective weights explained in the following sections.

If one of the automatic optimization options (see subsection 5.4.2 ) is enabled, the weights will be automatically set. If you disable the automatic options, you can configure the weights of your dataset instances (see Figure 5.27 ).

\includegraphics[]{images/deepnet/deepnet-weights}
Figure 5.27 Weight options for deepnets

Balance Objective

When you set the balance objective weight, BigML automatically balances the classes of the objective field by assigning a higher weight to the less frequent classes, with the most frequent class always having a weight of 1. For example, take the following frequencies for each class:

[False, 2000; True, 50]

By enabling the Balance objective option, BigML will automatically apply the following weights:

[False, 1; True, 40]

In this example, the class “True” is getting forty times more weight as it is forty times less frequent than the most abundant class.

Objective Weights

The objective weights option allows you to manually set a specific weight for each class of the objective field. BigML oversamples your weighted instances replicating them as many times as the weight stablishes. If you do not list a class, it is assumed to have a weight of 1. Weights of 0 are also valid, but if all classes have a weight of 0, the deepnet creation will produce an error.

This option can be combined with the Weight field (see Weight field ).

Weight field

The Weight Field option allows you to assign individual weights to each instance by choosing a special weight field. It can be used for both regression and classification deepnets.The selected field must be numeric and it must not contain any missing values. The weight field will be excluded from the input fields when building the ensemble. You can select an existing field in your dataset or you may create a new one in order to assign customized weights.

For deepnets, the weight field modifies the loss function to include the instance weight. The outcome is similar to the oversampling technique.

5.4.10 Sampling Options

Sometimes you do not need all the data contained in your dataset to build your deepnet. If you have a very large dataset, sampling may be a good way of getting faster results. BigML allows you to sample your dataset before creating the deepnet, so you do not need to create a separate dataset first. You can find a detailed explanation of the sampling parameters available in the following subsections. (See Figure 5.28 .)

Rate

The rate is the proportion of instances to include in your sample. Set any value between 0% and 100%. Defaults to 100%.

Range

Specifies a subset of instances from which to sample, e.g., choose from instance 1 until 200. The rate you set will be computed over the range configured. This option may be useful when you have temporal data, and you want to train your deepnet with historical data, and test it with the most recent one to check if it can predict based on time.

Sampling

By default, BigML selects your instances for the sample by using a random number generator, which means two samples from the same dataset will likely be different even when using the same rates and row ranges. If you choose deterministic sampling, the random-number generator will always use the same seed, thus producing repeatable results. This lets you work with identical samples from the same dataset.

Replacement

Sampling with replacement allows a single instance to be selected multiple times. Sampling without replacement ensures that each instance cannot be selected more than once. By default, BigML generates samples without replacement.

Out of bag

This argument will create a sample containing only out-of-bag instances for the currently defined rate, so the final total number of instances for your sample will be one minus the rate configured for your sample (when replacement is false). This can be useful for splitting a dataset into training and testing subsets. It is only selectable when a sample rate is less than 100%.

\includegraphics[]{images/deepnet/deepnet-sampling}
Figure 5.28 Sampling parameters for deepnet

5.4.11 Creating Deepnets with Configured Options

After finishing the configuration of your options, you can change the default deepnet name in the editable text box. Then you can click on the Create deepnet button to create the new deepnet, or reset the configuration by clicking on the Reset button.

\includegraphics[]{images/deepnet/deepnet-configuration-create-deepnet}
Figure 5.29 Create deepnet after configuration

5.4.12 API Request Preview

The API Request Preview button is in the middle on the bottom of the configuration panel, next to the Reset button (See (Figure 5.29 )). This is to show how to create the deepnet programmatically: the endpoint of the REST API call and the JSON that specifies the arguments configured in the panel. Please see (Figure 5.30 ) below:

\includegraphics[]{images/deepnet/deepnet-configuration-api-preview}
Figure 5.30 Deepnet API request preview

There are options on the upper right to either export the JSON or copy it to clipboard. On the bottom there is a link to the API documentation for deepnets, in case you need to check any of the possible values or want to extend your knowledge in the use of the API to automate your workflows.

Please note: when a default value for an argument is used in the chosen configuration, the argument won’t appear in the generated JSON. Because during API calls, default values are used when arguments are missing, there is no need to send them in the creation request.