Sources with the BigML Dashboard

6.8 Text Analysis

The Text Analysis switch allows you to enable or disable analysis of text fields. The configuration options in this section are global for all the fields of your source, but you can also configure these options directly on individual text fields by overwriting the global configurations on a field-by-field basis. (See figure Figure 6.4 .)

\includegraphics[]{images/sources/fields-config-2} — Figure 6.4 Global and text fields configuration

The options configured at the source level will take effect when you create the Dataset. You can see the text analysis options configured for a given dataset if you display the Details in the Info panel from the dataset view (see Figure 6.5 ). Since a dataset can have many text fields with different languages, you can find the information about which languages have been detected in the tooltip when you mouse hover the text optype green icon or in the tag cloud.

\includegraphics[]{images/sources/dataset-details} — Figure 6.5 Text options configured for a given dataset

6.8.1 Language

BigML attempts to do basic language detection of each text field. You can choose any of the following languages at a global level or individual field level: Arabic, Catalan, Chinese, Czech, Danish, Dutch, English, Farsi/Persian, Finish, French, German, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Turkish, Romanian, Russian, Spanish, and Swedish.

\includegraphics[]{images/sources/lang-configuration} — Figure 6.6 Language configuration options

6.8.2 Tokenize

Tokenization strategy allows splitting the text into several unique values. You can choose one of the following methods (default is “All”):

Tokens only: individual words are used as terms. For example, “ML for all” becomes [“ML”, “for”, “all”].
Full terms only: the entire field is treated as a single term as long as it is shorter than 256 characters. In this case “ML for all” stays [“ML for all”]
All: both full terms and tokenized terms are used. In this case [“ML for all”] becomes [“ML”, “for”, “all”, “ML for all”].

\includegraphics[]{images/sources/token-configuration} — Figure 6.7 Tokenize configuration options

6.8.3 Stop words removal

The Stop words removal selector allows you to remove the use of usually uninformative stop words as part of the text analysis. Some examples of stop words are: a, the, is, at, on, which, etc. Obviously, these change according to the language chosen to process each text field. This is the reason why BigML offers three options:

Yes (detected language): this option removes the stop words only for the detected language. If you have several languages mixed within the same field, the stop words of the non-detected languages will appear in your models. This is the option selected by default.
Yes (all languages): this option removes the stop words for all languages. Although you have several languages mixed within the same field, you will not find any stop words in your models. The downside is that some stop words for some languages may be valid words for other languages.
No: this option will avoid the stop words removal. Therefore, the stop words will be included in your text analysis.

Next to the Stop words removal selector you will find another selector that allows you to choose the aggressiveness of stopword removal where each level is a superset of words in the previous ones: Light, Normal, and Aggressive. By default, BigML performs Normal stop words removal.

\includegraphics[]{images/sources/stopwords-configuration} — Figure 6.8 Stop words configuration options

6.8.4 Max. n-grams

The Max. n-grams selector allows you to choose the maximum n-gram size to consider for your text analysis. An n-gram is a frequent sequence of n terms found in the text. For example, “market” is a unigram (n-gram of size one), “prime minister” is a bigram (n-gram of size two), “Happy New Year” is a trigram (n-gram of size three), and so on. If you choose to keep stop words, they will be considered for the n-grams. You can select from unigrams up to five-grams.

\includegraphics[]{images/sources/ngrams-configuration} — Figure 6.9 n-grams configuration options

6.8.5 Stemming

BigML can differentiate all possible words or apply stemming, so words with the same root are considered one single value. For example, if stemming is enabled, the words great, greatly and greatness would be considered the same value instead of three different values. This option is enabled by default.

\includegraphics[]{images/sources/stemming-configuration} — Figure 6.10 Stemming configuration

6.8.6 Case sensitivity

Specify whether you want BigML to differentiate words if they contain upper or lower cases. If you click the case sensitivity option, terms with lower and upper cases will be differentiated, e.g., “House” and “house” will be considered two different terms. This option is inactive by default.

\includegraphics[]{images/sources/sensitivity-configuration} — Figure 6.11 Case sensitivity configuration

6.8.7 Filter terms

You can select to exclude certain terms from your text analysis. BigML provides the following otpions:

Non-dictionary words: this option excludes terms that are unusual in the provided language. For this filter, BigML uses its own custom dictionaries that are composed of different sources such as online word lists, parses of Wikipedia, movie scripts, etc. These source may change depending on the language. The words in our dictionaries might contain terms like slang, abbreviations, proper names, etc. depending on whether or not these words are common enough to be found in our internet sources.
Non-language characters: this option excludes terms containing uncommon characters for words in the provided language. For example, if the language is Russian, all terms containing non-Cyrillic characters will be filtered out. Numeric digits will be considered non-language characters regardless of language.
HTML keywords: this option excludes JavaScript/HTML keywords commonly seen in HTML documents.
Numeric digits: this option excludes any term that contains a numeric digit in [0-9].
Single tokens: this option excludes terms that contain only a single token, i.e., unigrams. Only bigrams, trigrams, four-grams, five-grams and/or full terms will be considered (at least one of these options needs to be selected, otherwise the single token filter will be disabled).
Specific terms: this is a free text option where you can write any term or group of terms to be excluded from your text analysis.

\includegraphics[]{images/sources/filter-terms-configuration} — Figure 6.12 Filter terms