Classification and Regression with the BigML Dashboard

5.2 Understanding Deepnets

A deepnet in BigML is a Supervised learning learning method to solve Classification and Regression problems. Deepnets are an optimized version of Deep Neural Networks, a class of machine learning models inspired by the neural circuitry of the human brain. In these classifiers, the input features are fed to one or several groups of “nodes”. Each group of nodes is called a “layer”. Each node is essentially a function on the input that transforms the input features into another value or collection of values (see also Hidden Layers ). Then the entire layer transforms an input vector into a new “intermediate” feature vector. This new vector is fed as input to another layer of nodes. This process continues layer by layer, until we reach the final “output”, which is also a layer of nodes. The output is the network’s prediction: an array of per-class probabilities for classification problems or a single, real value for regression problems.

The “deep” in deep neural networks refers to the presence of more than one “hidden layer”; that is, more than one layer of nodes between the input and the output layers. The network architectures supported by BigML can be deep or shallow. The advantage of training deep architectures is that hidden layers have the opportunity to learn “higher-level” representations of the data that can be used to make correct predictions in cases where a direct mapping between input and output is difficult. For example, when classifying images of numeric digits, the input layer is raw pixels, the output layer is the probability for each digit, and the intermediate layers may learn features that represent the presence of, say, a loop or a vertical stroke. Read subsection 5.2.3 below for an explanation of the use cases where deepnets have a better performance.

5.2.1 Convolutional Neural Network

When the dataset used to create a deepnet contains images, the deepnet created will be a convolutional neural network.

Convolutional neural network, also known as CNN or ConvNet, is a type of deep neural networks. All deepnet operations described in this section, such as 1-click creation and configuration, also apply to CNNs. So does optimization, including automatic network search and structure suggestion.

The main difference between CNN and other types of neural networks is that the hidden layers of a CNN include at least one convolutional layer.

A convolutional layer performs one or more convolution operations. Each convolution operation transforms a set of neighboring inputs to an output, which is passed to the next layer. In the case of an image, the input layer consists of two-dimensional pixels. A convolution operation converts a block of pixels (say 3x3) to a single number. It can be imagined as a filter, sliding across the whole image and converting each (3x3) block of pixels to a number. The output of a convolutional layer can be thought as another image with a possibly decreased size and its pixels holding information from multiple (9) pixels of the image in the previous layer. The size of the pixel block (filter size) can be different (e.g. 3x3, 5x5), and one convolutional layer may have multiple convolution operations (number of filters).

There are other important layers in CNNs, such as ReLU and pooling layers. In the end when a CNN is fully trained, each convolutional layer, coupled with its ReLU and pooling layers, effectively captures image features. The first layers extract low-level features such as edges and colors while the deeper layers capture high-level features unique to the objects in the images. The outputs of convolutional layers are also called feature maps.

BigML can extract image features at the source level. For a composite source with images, different sets of image features can be extracted, which include features on edges, colors and texture. There are also pre-trained CNNs which capture more sophisticated features. Image features can be used to train supervised models as well as unsupervised models. If you know specific image features that will help you achieve your machine learning goals, you don’t have to use CNNs. Instead, you can configure your sources to extract image features. For information about the image features, please refer to section Image Analysis of the Sources with the BigML Dashboard [ 22 ] .

Note: When training a deepnet from a dataset containing images, that is, when training a CNN, all image feature fields extracted from the images will be ignored. In other words, when a dataset contains images, its deepnet is trained from raw image pixels, not from its extracted image features.

Because of convolution operations, which transform neighboring inputs such as 3x3 blocks, CNNs are excellent in machine learning using spatial data, especially images. However, CNNs may not work as well for non-spatial data such as tabular data. Think of this way: if many rows of a tabular data are swapped, the data is still considered the same. But doing the same to an image, it becomes a different image.

For a comprehensive introduction to CNN, please refer to its wikipedia entry.

Deep neural networks are notoriously sensitive to the chosen topology (or network structure) and the algorithm used to learn the weights for that topology. This sensitivity means that hand-tuning the topology and optimization algorithm can be difficult and time-consuming as the number of choices that lead to poor networks typically vastly outnumber the choices that lead to good ones.

To combat this problem, BigML offers first-class support for automatic parameter optimization that allows for automated discovery of better networks via two different methods:

  • Automatic network search: during the deepnet creation, BigML trains and evaluates over many possible network configurations, returning the best networks found for your problem. The final deepnet returned by the search is a “compromise” between the top “n” networks found in the search. The algorithm BigML uses for this optimization technique is a variant on the hyperband algorithm. Instead of selecting parameter value candidates for evaluation at random, however, BigML uses an acquisition technique based on techniques from Bayesian parameter optimization. The main downsides of using this optimization method is that the creation of the deepnet may be significantly slower.

    Note: the search process is not totally deterministic, so although you are using the same dataset you might get slightly different results from run to run. This is because BigML trains multiple models concurrently and the order in which they finish is important. After each model finishes, the search modifies its behavior based on the performance of the one that just finished (i.e., the next trained model in the search depends on the previous ones). Although results may not be repeatable, the differences should be almost unperceivable in most cases.

  • Automatic structure suggestion: BigML offers a faster technique that can also give quality results. The ability to quickly train and test your deepnets is especially useful when working on feature engineering. BigML has trained thousands of networks on dozens of datasets in order to understand the effectiveness of various network topologies. As such, BigML has learned some general rules about what makes one network structure better than another for a given dataset. BigML will automatically suggest a structure and set of parameter values that are likely to perform well for your dataset.

To learn more about these optimization techniques, read this blog post.

You can choose either optimizing technique by selecting it in the configuration panel (see subsection 5.4.2 ). Alternatively, you can manually set the parameters for your deepnet. By default, BigML uses the automatic structure suggestion strategy to create deepnets.

5.2.3 Deepnets Use Cases

A common question when solving classification and regression problems is which algorithm should be used to get the best results: models and ensembles? Logistic regressions? Deepnets? In most cases, there is not an effective way to know in advanced which method will perform better so the best strategy is training and evaluating each one of them and compare their performances. However, a general rule for deepnets is that they usually perform better with complex datasets and difficult problems. That is, high-dimensional datasets where either only a few features are non-noise, or where the decision function is spread across many different features.

On one hand, decision trees and ensembles have spectacular representational power when the dataset has a high number of variables because their hypothesis space grows with the data and the decision tree algorithm is able to efficiently search through the space to get a good solution in reasonable time. However, trees have trouble representing objectives that are smooth functions of lots of variables.

On the other hand, logistic regression optimizes a function that takes into account all of the variables at once, not just one or two at a time, and is able to optimize this function efficiently. However, their representational power is considerably lower; they can only represent linear decision boundaries.

Deepnets try to get the best of both methods. Because their structure is very flexible, they have high representational power, and because they’re optimized via gradient descent (like logistic regression), they do fine with smooth functions of potentially all of the input variables.

The main downside of deepnets is the efficiency that both decision trees and logistic regression provide. The structure of deepnets is super-flexible, but there is no way to search through the possible structures and parameters as quickly as can be done for trees or logistic regression. The only way to find an optimal structure is to try a lot of them. BigML tries to make this search as clever as possible (see subsection 5.2.2 ), but it is still significantly more time-consuming than trees and logistic regression with no guarantee that it will beat them.

5.2.4 Missing Values

BigML deepnets can handle missing values for any type of field. For categorical, text, and items fields, missing values are always included by default.

For numeric fields, missing values are also included by default, but you can deactivate this option by configuring your deepnet (see subsection 5.4.4 ). If the missing numeric option is disabled, the instances containing missing values for numeric fields in your dataset will be ignored by the deepnet. Also when using your deepnet to make predictions, you will not be able to have missing values for the numeric fields in the input data.