Sources with the BigML Dashboard

1.1 Machine Learning-Ready Format

A data source is in Machine Learning-ready (ML-ready) format when a collection of instances of the Entity you want to model has been transformed into tabular format (see Figure 1.5 ), in order to solve a specific Machine Learning task (i.e., classification, regression, cluster analysis, anomaly detection, or association discovery).

To get your data in ML-ready format requires:

  1. Selecting a modeling task appropriate to your needs.

  2. Denormalizing, aggregating, pivoting, and other data wrangling tasks to generate a suitable “feature space” for your selected modeling task.

  3. Using domain knowledge and Machine Learning expertise to generate additional features that help better represent the instances.

  4. Choosing the right file format to store each type of feature into a field and each instance into a record using a tabular structure. Each row is used to represent one of the instances, and each column is used to represent a field that describes all the instances. Each field can be: numeric, categorical, text, items, or date-time. (See Chapter 5 .)

\includegraphics[width=0.5\textwidth ]{images/sources/instances-vs-fields}
Figure 1.5 Instances and fields in tabular format

By structuring your data into ML-ready format before uploading it to BigML, you will better prepared to maximize the BigML capabilities and discover more insightful patterns and build better predictive models.