Datasets with the BigML Dashboard

2.3 Text and Items Fields

Item fields are similar to categorical fields, with the key difference that items may be provided as a set (e.g., items purchased together). Text fields are similar to item fields, but optionally include text processing (such as stemming, described below). BigML processes how frequently a given term appears in a field, and it shows the number of instances that used it. For this, BigML offers the Tag cloud, an alternative representation to a histogram. Tag clouds are available only for text and items fields, and they are represented with a different icon placed next to the histogram, as seen in Figure 2.4 . This example shows that 183 instances of this field use the term “cava”.

\includegraphics[width=0.5\textwidth ]{images/text-items-histogram}
Figure 2.4 Example of histogram for text and items fields

For more details, once you click on the TXT icon to discover how often a given term appears in your dataset. (See Figure 2.5 .) The bigger the term, the more frequently repeated. Check how many times each term is repeated by mousing over each term, e.g., “chardonnay” appears 155 times in this field. You can download the tag cloud in the SVG or PNG format by clicking the SVG or PNG button.

\includegraphics[]{images/tag-cloud}
Figure 2.5 Example of a tag cloud

BigML can find up to 1,000 terms across all your text and items fields of your dataset. To find these terms, BigML parses the text considering the text analysis options configured for your source. (See the section Text Analysis of the Sources with the BigML Dashboard [ 22 ] .) If you used BigML default term tokenization, all terms will be separated considering spaces and other symbols (comma, colon, semicolon, tab, etc). Each block of text between separators is considered a term.

  • If the stopwords option is enabled, BigML eliminates words like: a, the, is, at, on, which, etc.

  • If the text field has stemming enabled, all terms with the same root are considered one single value; e.g., if stemming is enabled the words “great,” “greatly,” and “greatness” would be considered one value instead of three different values. BigML calculates how often each of these terms appear in the fields. If “great” appears 12 times and “greatness” appears eight times, the term count will account for 20 instances of the term “great.”

  • BigML also allows you to differentiate words when they contain upper or lower cases. When case sensitivity is enabled, “Great” and “great” will count as two different words in the tag cloud, otherwise they would be treated as the same word.

If BigML incorrectly detects a numeric or categorical field as a text field, you may override the field type during source configuration. (See the section Updating Field Types of the Sources with the BigML Dashboard [ 22 ] .)