Sources with the BigML Dashboard

1 Introduction

BigML is consumable, programmable, and scalable Machine Learning software that helps solving Classification, Regression, Cluster Analysis, Anomaly Detection, and Association Discovery problems, using a number of patent-pending technologies.

BigML helps you address these problems end-to-end. That is, you can seamlessly transform data into actionable predictive models, and later use these models (either as remote services or locally embedded into your applications) to make predictions.

To be processed by BigML, your data need to be first in Machine Learning-Ready Format (see section 1.1 ) and stored in a data source (a source for short). Basically, a source is a collection of Instancess of the Entity that you want to model stored in tabular format in a computer file. Typically, in a source, each row represents one of the instances and each column represents a Field of the entity (see Figure 1.6 ). section 1.1 describes the structure BigML expects a source to have. There are different types and data formats of BigML Sources. There are also different file formats that BigML can process. They are all covered in Chapter 2 .

Every time a new source is brought to BigML, a corresponding BigML Source is created. section 1.2 gives you a first example of how to create a BigML source. BigML uses the icon in Figure 1.1 to represent a BigML source.

\includegraphics[width=2cm]{images/sources/source}
Figure 1.1 Source icon

The main purpose of BigML sources is to make sure that BigML parses and interprets each instance in your source correctly. This can save you some time before proceeding with any modeling on your data that involves heavier computation. BigML analyzes the initial part of each source to automatically infer the type of each field. BigML accepts fields of type: numeric, categorical, date-time, text, and items. These types are explained in detail in Chapter 5 . The BigML Dashboard lets you update each field type individually to fix those cases in which BigML does not recognize the type of a field correctly (see section 6.11 ). The BigML Dashboard also allows you to configure many other settings to ensure that your sources are correctly parsed. Chapter 6 describes all the available settings.

BigML is able to ingest sources from three different origins:

  • Local Sources that are accessible in your local computer. (See Chapter 7 .)

  • Remote Sources that can be accessed using different transfer protocols or configuring different cloud storage providers. (See Chapter 8 .)

  • Inline Sources that can be created using a simple editor provided by the BigML Dashboard. (See Chapter 9 .)

The first tab of the BigML Dashboard’s main menu allows you to list all your available sources. When you first create an account at BigML, you will find a list of promotional BigML sources. (See Figure 1.2 .) In this source list view (Figure 1.2 ), you can see, for each source, the Type, Name, Age (time since the BigML source was created), Size, and Number of Datasets that have been created using that BigML source.

\includegraphics[]{images/sources/source-listing}
Figure 1.2 Source list view

On the top right corner of the source list view, you can see the menu options shown on Figure 1.3 .

\includegraphics[width=0.5\textwidth ]{images/sources/source-listing-menu-options}
Figure 1.3 Menu options of the source list view

These menu options perform the following operations (from right to left):

  1. Create a source from a local source opens a file dialog that helps you browse files in your local drives. (See Chapter 7 .)

  2. Create a source from a URL opens a modal window that helps you input the URL of that BigML will use to automatically download a remote source. (See Chapter 8 .)

  3. Create a inline source opens an editor where you can directly input or paste data into it. (See Chapter 9 .)

  4. Cloud Storage Drop Down helps you browse through previously configured cloud storage providers. (See subsection 8.7.1 .)

  5. Search searches your sources by name.

By default, every time you start a new Project, your list of sources will be empty. (See Figure 1.4 .)

\includegraphics[width=\textwidth ]{images/sources/empty-listing}
Figure 1.4 Empty Dashboard sources view

BigML does not impose any limit on the number of sources you can have under an individual BigML account or Project. In addition, there are no limits on either the number of instances or the number of fields per source, though there are some limits on the total size a source can have, as explained in Chapter 10 .

Each BigML source has a Name, a Description, a Category, and Tags. These allow you to provide documentation, and can also be helpful when searching through your sources. More details are in Chapter 11 .

A BigML source can be associated with a specific project. You can move a source between projects. To perform this operation, see Chapter 13 . A source can also be deleted permanently from your account. (See Chapter 14 .)

A BigML source is the first Resource that you need to create to apply Machine Learning to your own data using BigML. The only direct operation you can perform on a BigML source is creating a BigML Dataset. BigML makes a clear distinction between sources and datasets: BigML sources allow you to ensure that BigML correctly transfers, parses, and interprets the content in your data, while a BigML dataset is a structured version of your data with basic statistics computed for each field. The main purpose of BigML sources is, therefore, to give you configuration options to ensure that your data is being parsed correctly. For a detailed explanation of BigML datasets, read the Datasets with the BigML Dashboard document [ 23 ] .