Datasets with the BigML Dashboard

7.3 Filtering Datasets

BigML lets you transform your original dataset in several ways. This section covers how to create a new dataset by filtering instances. You may use the pre-defined operations criteria available in the filters selector, or you may customize your filter using Flatline formulas.

Access this option by clicking the configure option menu and selecting Filter Dataset. (See Figure 7.10 .)

\includegraphics[]{images/access-filtering}
Figure 7.10 Access to filter your dataset

This leads you to the configuration panel for filtering (Figure 7.11 ) where you can choose the field you want to filter, and decide which operation you wish to apply. Add up to ten different filtering conditions manually by clicking the Add condition button shown in this panel. You can add as many filtering conditions as you want by using flatline formulas. Please read subsection 7.3.7 or the Flatline manual for your reference, which is also available from the help panel. The help panel may be useful when you want to quickly find the definition of each operation. Finally, you can name your filtered dataset differently before you click the Create dataset button.

\includegraphics[]{images/filter-conf}
Figure 7.11 Configuration panel for filtering

You may want to filter different instances from your dataset depending on your goals. For instance, you might need to find the instances that have missing values in a certain field, or instances that contain values higher than X for another field, etc. The following subsections cover which operations are available per field type.

7.3.1 Filtering by Numeric Fields

To filter your numeric fields, choose between the following operations:

  • Comparison (See Figure 7.12 .)

    • Is between: includes instances containing values within the specified range

    • Is less than: includes instances containing values below the specified level

    • Is less than or equal to: includes instances containing values equal or below the specified level

    • Is greater than: includes instances containing values above the specified level

    • Is greater than or equal to: includes instances containing values equal or above the specified level

      \includegraphics[]{images/numeric-comparison}
      Figure 7.12 Filtering a dataset by a numeric field with comparison operations
  • Equals (See Figure 7.13 .)

    • Is equal: includes instances containing the specified value/values

    • Is not equal: excludes instances containing the specified value/values

      \includegraphics[]{images/numeric-equals}
      Figure 7.13 Filtering a dataset by a numeric field with equals operations
  • Missing values (See Figure 7.14 .)

    • If value is missing: includes instances containing missing values for the selected field

    • If value isn’t missing: excludes instances containing missing values for the selected field

      \includegraphics[]{images/numeric-missing}
      Figure 7.14 Filtering a dataset by a numeric field with missing values operations
  • Statistics (See Figure 7.15 .)

    • Is between percentiles: includes instances within the specified percentiles E.g., a percentile between 0 and 0.3 includes the first 30% of the instances.

    • Is below the mean: includes instances below the mean of the selected field

    • Is above the mean: includes instances above the mean of the selected field

      \includegraphics[]{images/numeric-statistics}
      Figure 7.15 Filtering a dataset by a numeric field with statistics operations

7.3.2 Filtering by Categorical Fields

BigML lets you decide which operation you want to apply to filter your field. The following operations are applicable to categorical fields and all field types supported by BigML. (See Figure 7.16 and Figure 7.17 .)

  • Specific values

    • Equals: includes instances containing the specified value/values

    • Does not equal: excludes instances containing the specified value/values

      \includegraphics[]{images/all-specific}
      Figure 7.16 Filtering a dataset by all field types with specific values operations
  • Missing values

    • If value is missing: includes instances containing missing values for the selected field

    • If value isn’t missing: excludes instances containing missing values for the selected field

      \includegraphics[]{images/all-missing}
      Figure 7.17 Filtering a dataset by all field types with missing values operations

7.3.3 Filtering by Text Fields

To filter your text fields you can choose between the following operations:

  • Equals (See Figure 7.18 .)

    • Is equal: includes instances containing the specified value/values

    • Is not equal: excludes instances containing the specified value/values

      \includegraphics[]{images/text-equals}
      Figure 7.18 Filtering a dataset by a text field with equals operations
  • Contains (See Figure 7.19 .)

    • Is like (case-sensitive): matches words containing at least part of the letters specified, taking into account lower and upper cases, e.g., “great” will also match a text containing the word “great” or “greatness,” but not “Great” or “Greatness”

    • Is like (case-insensitive): matches words containing at least part of the letters specified, not taking into account lower and upper cases, e.g., “great” will also match a text containing the word “great”, “greatness”, “Great” or “Greatness”

    • Contains (case-sensitive): matches texts containing the exact words specified, taking into account lower and upper cases, e.g., “great” will match a text containing the word “great”, but not “Great”

    • Contains (case-insensitive): matches texts containing the exact words specified, not taking into account lower and upper cases, e.g., “great” will match a text containing the word “great” or “Great”

      \includegraphics[]{images/text-contains}
      Figure 7.19 Filtering a dataset by a text field with contains operations
  • Doesn’t contain (See Figure 7.20 .)

    • Is not like (case-sensitive): excludes instances with words containing at least part of the letters specified, taking into account lower and upper cases, e.g., “great” will also exclude a text containing the word “great” or “greatness”, but not “Great” or “Greatness”

    • Is like (case-insensitive): excludes instances with words containing at least part of the letters specified, not taking into account lower and upper cases, e.g., “great” will also exclude a text containing the word “great”, “greatness”, “Great” or “Greatness”

    • Not contains (case-sensitive): excludes texts containing the exact words specified, taking into account lower and upper cases, e.g., “great” will exclude a text containing the word “great”, but not “Great”

    • Not contains (case-insensitive): excludes texts containing the exact words specified, not taking into account lower and upper cases, e.g., “great” will exclude a text containing the word “great” or “Great”

      \includegraphics[]{images/text-not-contain}
      Figure 7.20 Filtering a dataset by a text field with doesn’t contain operations
  • Missing values (See Figure 7.21 .)

    • If value is missing: includes instances containing missing values for the selected field

    • If value isn’t missing: excludes instances containing missing values for the selected field

      \includegraphics[]{images/text-missing}
      Figure 7.21 Filtering a dataset by a text field with missing values operations

7.3.4 Filtering by Items Fields

BigML offers the below operations for you to filter your dataset by items fields:

  • Equals (See Figure 7.22 .)

    • Is equal: includes instances containing the specified value/values

    • Is not equal: excludes instances containing the specified value/values

      \includegraphics[]{images/items-equals}
      Figure 7.22 Filtering a dataset by an items field with equals operations
  • Contains (See Figure 7.23 .)

    • Contains (case-insensitive): matches texts containing the exact words specified, not taking into account lower and upper cases, e.g., “great” will match a text containing the word “great” or “Great”

    • Not contains (case-sensitive): excludes texts containing the exact words specified, taking into account lower and upper cases, e.g., “great” will exclude a text containing the word “great”, but not “Great”

      \includegraphics[]{images/items-contains}
      Figure 7.23 Filtering a dataset by an items field with contains operations
  • Missing values: (See Figure 7.24 )

    • If value is missing: includes instances containing missing values for the selected field

    • If value isn’t missing: excludes instances containing missing values for the selected field

      \includegraphics[]{images/items-missing}
      Figure 7.24 Filtering a dataset by an items field with missing values operations

7.3.5 Filtering by Date-Time Fields

To filter your dataset by date-time fields, BigML offers the same operations as for the numeric fields (see subsection 7.3.1 ). The only difference is that you have to select the values in a calendar. (See Figure 7.25 .)

\includegraphics[]{images/date-time-filter}
Figure 7.25 Filtering a dataset by a date-time field with comparison operations

7.3.6 Filtering using Flatline Formulas

You can also filter your dataset by writing a flatline formula, either with Lisp syntax or with JSON syntax. BigML lets you easily type the formulas directly (see Figure 7.26 and Figure 7.27 ), or use the Flatline editor to create and validate your flatline formula. (See subsection 7.3.7 for more details.)

\includegraphics[]{images/lisp-expression}
Figure 7.26 Filtering a dataset by a Lisp flatline formula
\includegraphics[]{images/json-expression}
Figure 7.27 Filtering a dataset by a JSON flatline formula

7.3.7 Filtering using the Flatline Editor

The flatline language can greatly help you filter your dataset in infinite ways to get higher quality predictors. Follow the steps below to edit your Lisp formula or your JSON formula. Select the desired syntax. The following example is a Lisp flatline formula:

  1. Click the highlighted icon in Figure 7.28 to add a formula using the Flatline editor:

    \includegraphics[]{images/access-flatline}
    Figure 7.28 Access the Flatline editor
  2. Next the Lisp expression is selected. Type your expression in the editor panel Figure 7.29 . You can also use the help panel any time if you have doubts about the operation to compute (Figure 7.30 ).

    \includegraphics[]{images/lisp-1}
    Figure 7.29 Edit your Lisp expression
    \includegraphics[]{images/lisp-2}
    Figure 7.30 Help panel to learn more about the operations you can use to filter your dataset
  3. Click the Validate button in Figure 7.30 to know whether the operation is valid. If it is valid (Figure 7.31 ), proceed with the following steps, but if it is not valid, BigML will display a message (Figure 7.32 ) letting you know the error.

    \includegraphics[]{images/lisp-3}
    Figure 7.31 Example of a valid expression
    \includegraphics[]{images/lisp-4}
    Figure 7.32 Example of an invalid expression

    If you want to convert the Lisp expression into a JSON expression simply switch to JSON expression (Figure 7.33 ) so you do not lose it.

    \includegraphics[]{images/lisp-6}
    Figure 7.33 JSON expression
  4. After validating your expression, click the Preview button (in Figure 7.31 ) to see the expression result shown in Figure 7.34 . You can observe that, by default, only the fields involved in the formula are shown in the preview.

    \includegraphics[]{images/lisp-5}
    Figure 7.34 Preview of the expression result (only fields in formula)

    You can change this, and display all the fields in the dataset by clicking in the switcher shown in Figure 7.35

    \includegraphics[]{images/lisp-5-1}
    Figure 7.35 Preview of the expression result (all fields)
  5. Then click the Accept button. (See Figure 7.34 .) BigML will display the new Lisp expression in the same field where you can directly type the expression before opening the Flatline editor. (See Figure 7.36 .) Press the Create dataset button to create the filtered dataset.

    \includegraphics[]{images/lisp-expression-double}
    Figure 7.36 Lisp formula edited in the Flatline editor

Please visit the Flatline manual for a full discussion about how to use the Flatline editor.

7.3.8 View and Reuse Filters

When you create the filtered dataset, you will be able to view the filters applied by clicking the option shown in Figure 7.37 .

\includegraphics[]{images/view-filters}
Figure 7.37 View the filters applied to a dataset

This option will display a window with the Flatline formula used to filter the dataset (see Figure 7.38 ). You can copy or download the formula (in Lisp and JSON formats) to apply this filter to another dataset.

\includegraphics[]{images/copy-filters}
Figure 7.38 Copy and download filters

This section described how to transform your data by filtering a dataset. The next section (section 7.4 ) explains a different way of filtering your original dataset, by removing the duplicated instances.