Association Discovery with the BigML Dashboard

2.2 How to Structure your Data

Association Discovery models require the data to be structured in a specific way. In section Items of the Sources with the BigML Dashboard document [ 22 ] there is an introduction to the items field (when a field contains an arbitrary number of items, i.e., categories or labels). This section shows some data structures that lend themselves particularly well for Association Discovery.

It is common in Association Discovery to have a great number of different values per instance, e.g., a commercial dataset containing the transactions with all the products bought by customers; or medical datasets containing all the medicines prescribed per patient.

See Figure 2.1 for an example of CSV file transactional data where each transaction-ID is associated to a set of purchased products.

trans-ID/12345, product_A, product_B, product_C, product_D
trans-ID/67890, product_A, product_E
trans-ID/67890, product_B, product_C, product_F
Figure 2.1 Example of transactional data

The transactional data from Figure 2.1 can be structured in several ways:

  • Binary data representation:

    Tran-ID

    prod_A

    prod_B

    prod_C

    prod_D

    prod_E

    prod_F

    12345

    1

    1

    1

    1

    0

    0

    67890

    1

    0

    0

    0

    1

    0

    98540

    0

    1

    1

    0

    0

    1

    Table 2.1 Example of binary representation for transactional data
  • Vertical data layout:

    Trans-ID

    1st_prod

    2nd_prod

    3rd_prod

    4th_prod

    12345

    prod_A

    prod_B

    prod_C

    prod_D

    67890

    prod_A

    prod_E

       

    67890

    prod_B

    prod_C

    prod_F

     
    Table 2.2 Example of vertical layout for transactional data
  • Horizontal data layout:

    Trans-ID

    Products

    12345

    product_A, product_B, product_C, product_D

    67890

    product_A, product_E

    67890

    product_B, product_C, product_F

    Table 2.3 Example of horizontal layout for transactional data

The ideal way to structure your data for Association Discovery is the one shown in the horizontal data layout example. By using this data structure the field “Products” will be considered an items field, and each product will be a unique item.

Note: you need to separate your items by a unique separator (e.g., the above example items are separated by a comma).