Association Discovery with the BigML Dashboard

Association Discovery with the BigML Dashboard
Understanding Associations
How to Structure your Data

2.2 How to Structure your Data

Association Discovery models require the data to be structured in a specific way. In section Items of the Sources with the BigML Dashboard document [ 22 ] there is an introduction to the items field (when a field contains an arbitrary number of items, i.e., categories or labels). This section shows some data structures that lend themselves particularly well for Association Discovery.

It is common in Association Discovery to have a great number of different values per instance, e.g., a commercial dataset containing the transactions with all the products bought by customers; or medical datasets containing all the medicines prescribed per patient.

See Figure 2.1 for an example of CSV file transactional data where each transaction-ID is associated to a set of purchased products.

trans-ID/12345, product_A, product_B, product_C, product_D
trans-ID/67890, product_A, product_E
trans-ID/67890, product_B, product_C, product_F

Figure 2.1 Example of transactional data

The transactional data from Figure 2.1 can be structured in several ways:

Binary data representation:

Tran-ID	prod_A	prod_B	prod_C	prod_D	prod_E	prod_F
12345	1	1	1	1	0	0
67890	1	0	0	0	1	0
98540	0	1	1	0	0	1

Table 2.1 Example of binary representation for transactional data

Vertical data layout:

Trans-ID	1st_prod	2nd_prod	3rd_prod	4th_prod
12345	prod_A	prod_B	prod_C	prod_D
67890	prod_A	prod_E
67890	prod_B	prod_C	prod_F

Table 2.2 Example of vertical layout for transactional data

Horizontal data layout:

Trans-ID	Products
12345	product_A, product_B, product_C, product_D
67890	product_A, product_E
67890	product_B, product_C, product_F

Table 2.3 Example of horizontal layout for transactional data

The ideal way to structure your data for Association Discovery is the one shown in the horizontal data layout example. By using this data structure the field “Products” will be considered an items field, and each product will be a unique item.

Note: you need to separate your items by a unique separator (e.g., the above example items are separated by a comma).