Sources with the BigML Dashboard

8.8 External Data Stores

You can create BigML Sources directly from external data stores, which are databases or search engines. For databases, BigML supports PostgreSQL, MySQL and SQL Sever. The support of Spark SQL will be added in the near future. BigML also integrates with the Elasticsearch engine so that you can directly create Sources from it.

In order to create Sources, you will need establish connections to external data stores by creating external connectors.

In the BigML Dashboard, under the Sources tab, you can find a database icon with a dropdown for external data stores as shown below:

\includegraphics[]{images/sources/import-source-from-datastore}
Figure 8.23 Menu options to create a source from an external data store

After choosing your desired data store from the dropdown list, you can then either select from existing external connectors or create a new external connector. You can also switch to a different data store here.

\includegraphics[]{images/sources/import-source-select-connector}
Figure 8.24 Options to configure an external connector for a datastore

8.8.1 Creating a New External Connector for Data Stores

To create a new external connector for a data store, you will need to provide all the necessary information. This can vary depending on the different data stores.

For databases such as PostgreSQL, MySQL and SQL Sever, you will need to provide the Host name, the Port number, the authentication information which are the Username and Password, and the name of the Database you want to import. You also can specify to use SSL or Verify certificates.

\includegraphics[]{images/sources/import-source-new-db-connector}
Figure 8.25 Form to create a new external connector for a datastore

For Elasticsearch, you will need to prvoide the Host name and the authentication information as well. But instead of the name of the Database, optionally you can provide the Indice or Alias for the search, as shown here:

\includegraphics[]{images/sources/import-source-new-es-connector}
Figure 8.26 Form to create a new external connector for Elasticsearch

8.8.2 Editing an External Connector for Data Stores

You can edit an existing external connector by clicking on the pencil icon next to the connector you’ve selected.

\includegraphics[]{images/sources/import-source-click-edit-connector}
Figure 8.27 Click to edit an external connector for a data store

You then can modify the parameters of the external connector:

\includegraphics[]{images/sources/import-source-edit-connector}
Figure 8.28 Edit an external connector for a database

Here is an example to edit the parameters of an external connector for Elasticsearch:

\includegraphics[]{images/sources/import-source-edit-es-connector}
Figure 8.29 Edit an external connector for Elasticsearch

8.8.3 Creating a Source from a Data Store

After you select an existing external connector to the desired data store or the newly created external connector from the dropdown list, you will see a Query box and a Table list:

\includegraphics[]{images/sources/import-source-connector-selected}
Figure 8.30 After selecting an external connector for a datastore

Here you have two ways to import your data.

Creating Sources by Selecting Tables

The Table list contains all tables and views (where applicable) from your data store. By clicking on the checkboxes, you can select one or more tables and views. Then you import them into your BigML account as BigML Sources by clicking on the Import button. Each table will be imported into a separate Source.

\includegraphics[]{images/sources/import-source-import-table}
Figure 8.31 Selecting tables to import

If you would like first to take a look at the data from a given table before importing, you can click on the table int the Table list for a preview. In the preview, you can see the table columns and some sample data. See this for an example of database table preview:

\includegraphics[]{images/sources/import-source-preview-table}
Figure 8.32 Previewing a database table

Creating Sources by Constructing Queries

You can also create a Source by constructing a SQL query. This is useful because sometimes the database table import is too simplistic. By using a SQL Select statement, you can select the exact data you want, even from multiple tables and views.

If you only wish to import a subset of columns from a table, the query can be as simple as

select sepal_width, petal_width, species from iris2

When you preview a whole table, you can see that the select statement in the Query box (See Figure 8.32 .):

select * from iris2

After writing the query in the Query box, you can click on the Search button on the right to execute it. This can verify the query is valid in your data store. It also gives you a preview by showing the initial results, which allows you to confirm your intentions before creating the BigML Source.

You can actually take advantage of your data store’s full query language. Below is a more advanced example, which shows a select statement with both a join and group-by clause. In the database, there is one table with school district information, and another containing individual teacher statistics. By using the select statement, we are creating a BigML Source with information about school districts that include the average teach salary in each district:

select d.graduation_rate, d.students district_size,
      avg(t.salary) average_teacher_salary
from district d, teacher t
where t.district_id = d.id
group by d.id, d.students, d.graduate_rate
\includegraphics[]{images/sources/import-source-advanced-query}
Figure 8.33 Importing by using an advanced query

8.8.4 Creating a Source from a Data Store via API

Creating Sources from external data sources is fully supported in the BigML API. You can use BigML API to programmatically create, retrieve, update, delete and list external connectors, and then use them to import.

Just like on the Dashboard, you can import data with either a table or a custom query. Here is an example using curl that imports a “Sales” table as a BigML Source.

curl "https://bigml.io/source?\$BIGML_AUTH" \
    -X POST \
    -H 'content-type: application/json' \
    -d '{"external_data": {
                "source": "sqlserver",
                "connection": {
                    "host": "db.bigbox.com",
                    "port": 1433,
                    "database": "biztel",
                    "user": "autosource",
                    "password": "********"
                },
                "tables": "Sales"}}'

With BigML API, you actually have a few options to control the imported data without using a custom query. You can specify which fields to include or which to exclude, as well as limiting the number of records to import. You can also specify an offset along with an ordering to put that offset in context.

For Elasticsearch, creating a BigML Source using API is similar, as shown by the following example:

curl "https://bigml.io/source?\$BIGML_AUTH" \
    -X POST \
    -H 'content-type: application/json' \
    -d '{"external_data": {
                "source": "elasticsearch",
                "connection": {
                    "host": "localhost",
                    "port": 9200,
                    "user": "username",
                    "password": "********"
                },
                "tables": "kibana_sample_data_logs"}}'

For more information on importing data directly through the BigML API, please refer to the documentation.