Datasets with the BigML Dashboard
7.3 Filtering Datasets
BigML lets you transform your original dataset in several ways. This section covers how to create a new dataset by filtering instances. You may use the pre-defined operations criteria available in the filters selector, or you may customize your filter using Flatline formulas.
Access this option by clicking the configure option menu and selecting Filter Dataset. (See Figure 7.10 .)
This leads you to the configuration panel for filtering (Figure 7.11 ) where you can choose the field you want to filter, and decide which operation you wish to apply. Add up to ten different filtering conditions manually by clicking the button shown in this panel. You can add as many filtering conditions as you want by using flatline formulas. Please read subsection 7.3.7 or the Flatline manual for your reference, which is also available from the help panel. The help panel may be useful when you want to quickly find the definition of each operation. Finally, you can name your filtered dataset differently before you click the button.
You may want to filter different instances from your dataset depending on your goals. For instance, you might need to find the instances that have missing values in a certain field, or instances that contain values higher than X for another field, etc. The following subsections cover which operations are available per field type.
7.3.1 Filtering by Numeric Fields
To filter your numeric fields, choose between the following operations:
Comparison (See Figure 7.12 .)
Is between: includes instances containing values within the specified range
Is less than: includes instances containing values below the specified level
Is less than or equal to: includes instances containing values equal or below the specified level
Is greater than: includes instances containing values above the specified level
Is greater than or equal to: includes instances containing values equal or above the specified level
Equals (See Figure 7.13 .)
Is equal: includes instances containing the specified value/values
Is not equal: excludes instances containing the specified value/values
Missing values (See Figure 7.14 .)
If value is missing: includes instances containing missing values for the selected field
If value isn’t missing: excludes instances containing missing values for the selected field
Statistics (See Figure 7.15 .)
Is between percentiles: includes instances within the specified percentiles E.g., a percentile between 0 and 0.3 includes the first 30% of the instances.
Is below the mean: includes instances below the mean of the selected field
Is above the mean: includes instances above the mean of the selected field
7.3.2 Filtering by Categorical Fields
BigML lets you decide which operation you want to apply to filter your field. The following operations are applicable to categorical fields and all field types supported by BigML. (See Figure 7.16 and Figure 7.17 .)
Specific values
Equals: includes instances containing the specified value/values
Does not equal: excludes instances containing the specified value/values
Missing values
If value is missing: includes instances containing missing values for the selected field
If value isn’t missing: excludes instances containing missing values for the selected field
7.3.3 Filtering by Text Fields
To filter your text fields you can choose between the following operations:
Equals (See Figure 7.18 .)
Is equal: includes instances containing the specified value/values
Is not equal: excludes instances containing the specified value/values
Contains (See Figure 7.19 .)
Is like (case-sensitive): matches words containing at least part of the letters specified, taking into account lower and upper cases, e.g., “great” will also match a text containing the word “great” or “greatness,” but not “Great” or “Greatness”
Is like (case-insensitive): matches words containing at least part of the letters specified, not taking into account lower and upper cases, e.g., “great” will also match a text containing the word “great”, “greatness”, “Great” or “Greatness”
Contains (case-sensitive): matches texts containing the exact words specified, taking into account lower and upper cases, e.g., “great” will match a text containing the word “great”, but not “Great”
Contains (case-insensitive): matches texts containing the exact words specified, not taking into account lower and upper cases, e.g., “great” will match a text containing the word “great” or “Great”
Doesn’t contain (See Figure 7.20 .)
Is not like (case-sensitive): excludes instances with words containing at least part of the letters specified, taking into account lower and upper cases, e.g., “great” will also exclude a text containing the word “great” or “greatness”, but not “Great” or “Greatness”
Is like (case-insensitive): excludes instances with words containing at least part of the letters specified, not taking into account lower and upper cases, e.g., “great” will also exclude a text containing the word “great”, “greatness”, “Great” or “Greatness”
Not contains (case-sensitive): excludes texts containing the exact words specified, taking into account lower and upper cases, e.g., “great” will exclude a text containing the word “great”, but not “Great”
Not contains (case-insensitive): excludes texts containing the exact words specified, not taking into account lower and upper cases, e.g., “great” will exclude a text containing the word “great” or “Great”
Missing values (See Figure 7.21 .)
If value is missing: includes instances containing missing values for the selected field
If value isn’t missing: excludes instances containing missing values for the selected field
7.3.4 Filtering by Items Fields
BigML offers the below operations for you to filter your dataset by items fields:
Equals (See Figure 7.22 .)
Is equal: includes instances containing the specified value/values
Is not equal: excludes instances containing the specified value/values
Contains (See Figure 7.23 .)
Contains (case-insensitive): matches texts containing the exact words specified, not taking into account lower and upper cases, e.g., “great” will match a text containing the word “great” or “Great”
Not contains (case-sensitive): excludes texts containing the exact words specified, taking into account lower and upper cases, e.g., “great” will exclude a text containing the word “great”, but not “Great”
Missing values: (See Figure 7.24 )
If value is missing: includes instances containing missing values for the selected field
If value isn’t missing: excludes instances containing missing values for the selected field
7.3.5 Filtering by Date-Time Fields
To filter your dataset by date-time fields, BigML offers the same operations as for the numeric fields (see subsection 7.3.1 ). The only difference is that you have to select the values in a calendar. (See Figure 7.25 .)
7.3.6 Filtering using Flatline Formulas
You can also filter your dataset by writing a flatline formula, either with Lisp syntax or with JSON syntax. BigML lets you easily type the formulas directly (see Figure 7.26 and Figure 7.27 ), or use the Flatline editor to create and validate your flatline formula. (See subsection 7.3.7 for more details.)
7.3.7 Filtering using the Flatline Editor
The flatline language can greatly help you filter your dataset in infinite ways to get higher quality predictors. Follow the steps below to edit your Lisp formula or your JSON formula. Select the desired syntax. The following example is a Lisp flatline formula:
Click the highlighted icon in Figure 7.28 to add a formula using the Flatline editor:
Next the Lisp expression is selected. Type your expression in the editor panel Figure 7.29 . You can also use the help panel any time if you have doubts about the operation to compute (Figure 7.30 ).
Click the Figure 7.30 to know whether the operation is valid. If it is valid (Figure 7.31 ), proceed with the following steps, but if it is not valid, BigML will display a message (Figure 7.32 ) letting you know the error.
button inIf you want to convert the Lisp expression into a JSON expression simply switch to JSON expression (Figure 7.33 ) so you do not lose it.
After validating your expression, click the Figure 7.31 ) to see the expression result shown in Figure 7.34 . You can observe that, by default, only the fields involved in the formula are shown in the preview.
button (inYou can change this, and display all the fields in the dataset by clicking in the switcher shown in Figure 7.35
Then click the Figure 7.34 .) BigML will display the new Lisp expression in the same field where you can directly type the expression before opening the Flatline editor. (See Figure 7.36 .) Press the button to create the filtered dataset.
button. (See
Please visit the Flatline manual for a full discussion about how to use the Flatline editor.
7.3.8 View and Reuse Filters
When you create the filtered dataset, you will be able to view the filters applied by clicking the option shown in Figure 7.37 .
This option will display a window with the Flatline formula used to filter the dataset (see Figure 7.38 ). You can copy or download the formula (in Lisp and JSON formats) to apply this filter to another dataset.
This section described how to transform your data by filtering a dataset. The next section (section 7.4 ) explains a different way of filtering your original dataset, by removing the duplicated instances.