Sources with the BigML Dashboard

5.4 Text

Text fields (or string fields) are used to represent an arbitrary number of characters. Many Machine Learning algorithms are designed to work only with numeric and categorical fields and cannot easily handle text fields. BigML takes a basic and reliable approach, leveraging some basic Natural Language Processing (NLP) techniques along with a simple (bag-of-words) style method of feature generation to include text fields within its modeling framework.

Text fields are specially processed by BigML using the configuration options explained in Chapter 6 .

First, BigML performs some basic language detection. BigML recognizes texts in Arabic, Catalan, Chinese, Czech, Danish, Dutch, English, Farsi/Persian, Finish, French, German, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Turkish, Romanian, Russian, Spanish, and Swedish. Please let the Support Team at BigML know if you want BigML to add your language.

BigML can also perform case sensitive or insensitive analyses, remove stop words before processing the text, search for n-grams in the text, use some basic stemming, and apply different filters to your text fields. Finally, it can use different tokenization strategies. All these options are described in Chapter 6 .

The icon that BigML uses to refer to text fields is shown on Figure 5.7 .

\includegraphics[width=2cm]{images/sources/text} — Figure 5.7 Text field icon

Figure 5.8 is an example of a CSV file with a text field. It has two fields: the first one is the text of a tweet directed to an airline, and the second one is a label that represents a sentiment (i.e., positive, negative, or neutral). If you create a source with that file, BigML will automatically assign the types text and categorical as shown on Figure 5.9 .

tweet, sentiment
@united is it on a flight now? Thanks for reply.,neutral
"@united Actually, the flight was just Cancelled Flightled!
http://t.co/Qf0Oc2HqeZ",negative
@JetBlue going to San Juan!,neutral
@united flights taking off from IAD this afternoon?,neutral
@JetBlue I LOVE JET BLUE!,positive
@JetBlue thanks. I appreciate your prompt response.,positive
"@united diverged to Burlington, Vermont. This sucks.",negative
@SouthwestAir and thx for not responding,negative
@AmericanAir  @SouthwestAir  — Y'all will like this one.
http://t.co/hF8aJZ4ffl,neutral
@USAirways you guys lost my luggage,negative

Figure 5.8 An excerpt of an example of a CSV file with a text field

\includegraphics[]{images/sources/source-example-text-field} — Figure 5.9 An example of a source with a text field