Datasets with the BigML Dashboard
8.1 Adding Fields to a Dataset
If you need to create new fields (i.e., Feature Engineering), BigML allows you to do it using common operations over your existing data, or writing custom operations with Flatline formulas. The following subsections describe how to add new fields to your dataset.
To start, access the configuration option menu and select Add Fields to Dataset. (See Figure 8.2 .)
This leads you to a configuration panel for adding fields, where you can add a name for the new fields, decide which operation you wish to apply, and select the field you will use to generate the new one. (See Figure 8.3 .) You can add up to ten new fields manually using BigML Dashboard, as well as writing a custom formula. This is explained in the following subsections.
BigML also provides a help panel with an explanation of each operation. This help panel may be useful when you want to quickly find the meaning of each operation. Note: this is the same help panel as when filtering your dataset.
Finally, you can also name your extended dataset differently before you click the
button.The following subsections define each of the operations you can apply to an existing field to create a new one.
8.1.1 Discretization
BigML offers three options to Discretization your numeric fields to create new fields from them (See Figure 8.4 ):
Discretize by percentiles: select a discretization value and BigML will split the field values into equal population segments (categories). Discretizing by percentiles will split the field values into 100 different categories, by quartiles into 4, by terciles into 3, etc.
Discretize by groups: specify the number of groups and BigML will split the field values into equal width segments (categories), e.g., setting 3 groups for a field ranging from 0 to 6 will yield: category 1= [0,2], category 2= [2,4], category 3= [4,6].
Is within percentiles?: specify a percentile range between 0 and 1 and you will get a boolean field with True or False values for each instance depending whether they belong to the specified range.
8.1.2 Replacing Missing Values
Create new fields out of the selected numeric field by replacing the missing values with these operations (See Figure 8.5 ):
Fixed value: all your field missing values will be replaced by the specified value. You can set a number or a string.
Maximum: missing values will be replaced by the maximum value of the selected field.
Mean: missing values will be replaced by the mean of the selected field.
Median: missing values will be replaced by the median of the selected field.
Minimum: missing values will be replaced by the minimum value of the selected field.
Population: missing values will be replaced by the number of the total instances that have valid values for the selected field, e.g., for a field containing 54 instances with valid values, the missing values will be replaced by 54.
Random integer: BigML creates a new field with a random value for each instance. You can set the maximum value you want for your random value generator.
Random value: missing values will be replaced by a random value within your field range.
Random weighted value: BigML sets a random value for your missing values within your field range but weighted by the population, so the population distribution for that field is used as a probability measure for the random value generator.
The operations fixed value, random value, and random weighted value are also available for categorical fields.
8.1.3 Normalizing
Create new fields out any numeric fields by normalizing them with the following operations (see Figure 8.6 ):
Normalize: is a standardization of data distribution so your fields can be comparable. Select the range for which you want to normalize your field, which should be within the field range.
Z-score: is a measure indicating the distance of the values from the mean.
Logarithmic normalization: applies the z-score function to the logarithm of the values in the given field.
8.1.4 Math
You can also create new fields out of any numeric fields by applying any of the following math operations (see Figure 8.7 ):
Exponentiation: computes \(e\) elevated to the field value: \(e^x\).
Logarithm (base 2): converts fields into a logarithmic scale. This is useful for fields with a wide range of data (since it reduces the range to a more manageable scale) and to find exponential patterns in your data.
Logarithm (base 10): converts fields into a logarithmic scale.
Logarithm (natural): converts fields into a logarithmic scale.
Square: elevates the value to the square: \(x^2\).
Square root: computes the square root of the value: \(\sqrt{x}\).
8.1.5 Sliding Windows
Creating new features using sliding windows is one of the most common feature engineering techniques in Machine Learning. It is usually applied to frame time series data using previous data points as new input fields to predict the next time data points.
For example, imagine we have one year of sales data to predict sales. We have the daily sales (our objective field) and other information such as the holidays, the offers in the shop, etc. (our predictors). (See Figure 8.8 ). As domain experts, we know that past sales can be key predictors to predict today’s sales. Therefore, we can use our objective field “sales” to create additional input fields that contain past data. We can create an infinite number of fields, last day sales, the average of last week sales, the difference between last month and this month sales, etc. However, we need to be very careful not to include today or future sales data in these new features; otherwise, we will be introducing leakage in our model. For example, in the Figure 8.8 below, we are creating a new predictor that calculates the average sales of the last two days (see the field in green “avgSales_L2D”). This is a sliding window in which the window starts at -2 and it ends at -1.
In BigML, you can define the following operations and parameters to create sliding windows:
Operation: select one of the below operations to be applied to the instances in the window (see Figure 8.9 ).
Sum of instances: sums consecutive instances by defining a window start and end. For example, for a sales dataset where each instance is a different day, we can get the sum of sales of the previous 5 days (including today) by defining a window that starts at -5 and ends at 0 relative to each instance in the dataset.
Mean of instances: calculates the mean of consecutive instances by defining a window start and end (negative values are previous instances and positive values next instances). For example, for a sales dataset where each instance is a different day, we can get the mean of sales of the previous 5 days (including today) by defining a window that starts at -5 and ends at 0 relative to each instance in the dataset.
Median of instances: calculates the median of consecutive instances by defining a window start and end (negative values are previous instances and positive values next instances). For example, for a sales dataset where each instance is a different day, we can get the median of sales of the previous 5 days (including today) by defining a window that starts at -5 and ends at 0 relative to each instance in the dataset.
Minimum of instances: calculates the minimum of consecutive instances by defining a window start and end (negative values are previous instances and positive values next instances). For example, for a sales dataset where each instance is a different day, we can get the minimum of sales of the previous 5 days (including today) by defining a window that starts at -5 and ends at 0 relative to each instance in the dataset.
Maximum of instances: calculates the maximum of consecutive instances by defining a window start and end (negative values are previous instances and positive values next instances). For example, for a sales dataset where each instance is a different day, we can get the maximum of sales of the previous 5 days (including today) by defining a window that starts at -5 and ends at 0 relative to each instance in the dataset.
Product of instances: calculates the product of consecutive instances by defining a window start and end (negative values are previous instances and positive values next instances). For example, for a sales dataset where each instance is a different day, we can get the product of sales of the previous 5 days (including today) by defining a window that starts at -5 and ends at 0 relative to each instance in the dataset.
Difference from first: calculates the difference between values associated with the start and end indices of the window, where the end index must be greater than the start index and the difference is calculated as end - start. For example, for a sales dataset where each instance is a different day, we can get the difference between yesterday and today’s sales \( [Sales(today) - Sales(yesterday)]\) by defining a window that starts at -1 and ends at 0.
Difference from first (%): calculates the percentage difference between values associated with the start and end indices of the window, where the end index must be greater than the start index and the difference is calculated as end - start. For example, for a sales dataset where each instance is a different day, we can get the percentage difference between yesterday and today’s sales \({[Sales(today) - Sales(yesterday)]/ Sales(yesterday)}\) by defining a window that starts at -1 and ends at 0.
Difference from last: calculates the difference between values associated with the start and end indices of the window, where the end index must be greater than the start index and the difference is calculated as start - end. For example, for a sales dataset where each instance is a different day, we can get the difference between today and tomorrow’s sales \([Sales(today) - Sales(tomorrow)]\) by defining a window that starts at 0 and ends at 1.
Difference from last (%): calculates the percentage difference between values associated with the start and end indices of the window, where the end index must be greater than the start index and the difference is calculated as start - end. For example, for a sales dataset where each instance is a different day, we can get the percentage difference between today and tomorrow’s sales \({[Sales(today) - Sales(tomorrow)]/ Sales(tomorrow)}\) by defining a window that starts at 0 and ends at 1.
Field: you can only select numeric fields to calculate sliding windows.
Window Start: the start of the window defines the first instance to be considered for the defined calculation. Negative values are previous instances and positive values next instances. The 0 is the current instance.
Window End: the end of the window defines the last instance to be considered for the defined calculation. Negative values are previous instances and positive values next instances. The 0 is the current instance.
Finally, click
and you will be able to see the new fields containing the sliding window calculations at the end of the new dataset.
8.1.6 Types
To create new fields from a categorical, text, or items field, use the types operations explained below (see Figure 8.12 ). Note: only the categorical operation is available for numeric fields:
Categorical: coerce numeric field values into categorical values, e.g., the number 10 will become a string “10”.
Integer: coerce categorical values to integer values, e.g., the string “7.5 pounds” will become 7. Boolean values are assigned 0 (false) and 1 (true).
Real: coerce categorical values to float values, e.g., the string “7.5 pounds” will become 7.5. Boolean values are assigned 0 and 1.
8.1.7 Random
Random operations are available for numeric and categorical fields, except the first operation (random integer) which does not have any field type associated with it:
Random integer: BigML creates a new field with a random value for each instance.
Random value within field range: BigML sets a random value but takes your field range as the reference for minimum and maximum values.
Random weighted value: BigML sets a random value within your field range weightened by the population, so the population distribution for that field is used as a probability measure for the random generator.
8.1.8 Statistics
Another option to add new fields to your dataset based on your numeric fields is by applying statistics operations (see Figure 8.14 ):
Mean: computes the field mean for all instances.
Population: computes the count of total instances for that field.
Population fraction: computes the number of instances whose values are below the specified value.
8.1.9 Write Flatline Formula
In addition to all the operations explained the above subsections, BigML lets you perform any kind of operations with an flatline formula. Similar to filtering fields of your dataset, type the desired formula in either Lisp or JSON syntax.
Furthermore, use BigML Flatline editor, a powerful and flexible open-source lisp-like language, to create and validate your formulas before using them. To access the Flatline editor, first select the syntax you want to use, and click the highlighted icon in Figure 8.15 , which leads to the Flatline editor.
Note: the Flatline editor can be used to add new fields to your dataset following the same procedure as when filtering your dataset. (See subsection 7.3.7 .)
8.1.10 View and Reuse New Fields’ Formulas
When you add new fields to a dataset, you will be able to view the formulas used to create them by clicking the option shown in Figure 8.16 .
This option will display a window with the Flatline formula that is found underneath any new field in a dataset (see Figure 8.17 ). You can copy or download the formula (in Lisp and JSON formats) to create the same field using other datasets.