Association Discovery with the BigML Dashboard
2.2 How to Structure your Data
Association Discovery models require the data to be structured in a specific way. In section Items of the Sources with the BigML Dashboard document [ 22 ] there is an introduction to the items field (when a field contains an arbitrary number of items, i.e., categories or labels). This section shows some data structures that lend themselves particularly well for Association Discovery.
It is common in Association Discovery to have a great number of different values per instance, e.g., a commercial dataset containing the transactions with all the products bought by customers; or medical datasets containing all the medicines prescribed per patient.
See Figure 2.1 for an example of CSV file transactional data where each transaction-ID is associated to a set of purchased products.
The transactional data from Figure 2.1 can be structured in several ways:
Binary data representation:
Tran-ID
prod_A
prod_B
prod_C
prod_D
prod_E
prod_F
12345
1
1
1
1
0
0
67890
1
0
0
0
1
0
98540
0
1
1
0
0
1
Vertical data layout:
Trans-ID
1st_prod
2nd_prod
3rd_prod
4th_prod
12345
prod_A
prod_B
prod_C
prod_D
67890
prod_A
prod_E
67890
prod_B
prod_C
prod_F
Horizontal data layout:
Trans-ID
Products
12345
product_A, product_B, product_C, product_D
67890
product_A, product_E
67890
product_B, product_C, product_F
The ideal way to structure your data for Association Discovery is the one shown in the horizontal data layout example. By using this data structure the field “Products” will be considered an items field, and each product will be a unique item.
Note: you need to separate your items by a unique separator (e.g., the above example items are separated by a comma).