Sources with the BigML Dashboard
5.4 Text
Text fields (or string fields) are used to represent an arbitrary number of characters. Many Machine Learning algorithms are designed to work only with numeric and categorical fields and cannot easily handle text fields. BigML takes a basic and reliable approach, leveraging some basic Natural Language Processing (NLP) techniques along with a simple (bag-of-words) style method of feature generation to include text fields within its modeling framework.
Text fields are specially processed by BigML using the configuration options explained in Chapter 6 .
First, BigML performs some basic language detection. BigML recognizes texts in Arabic, Catalan, Chinese, Czech, Danish, Dutch, English, Farsi/Persian, Finish, French, German, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Turkish, Romanian, Russian, Spanish, and Swedish. Please let the Support Team at BigML know if you want BigML to add your language.
BigML can also perform case sensitive or insensitive analyses, remove stop words before processing the text, search for n-grams in the text, use some basic stemming, and apply different filters to your text fields. Finally, it can use different tokenization strategies. All these options are described in Chapter 6 .
The icon that BigML uses to refer to text fields is shown on Figure 5.7 .
Figure 5.8 is an example of a CSV file with a text field. It has two fields: the first one is the text of a tweet directed to an airline, and the second one is a label that represents a sentiment (i.e., positive, negative, or neutral). If you create a source with that file, BigML will automatically assign the types text and categorical as shown on Figure 5.9 .