Association Discovery with the BigML Dashboard

2 Understanding Associations

This chapter describes internal details about the BigML associations, providing the foundations to understand the associations’ configuration options. Association Discovery has been extensively researched over the last two decades. It is distinguished from existing statistical techniques for categorical association analysis in three respects:

  • Association Discovery techniques scale to high-dimensional data. The standard statistical approach to categorical association analysis, log-linear analysis has complexity that is exponential with respect to the number of variables. In contrast, Association Discovery techniques can typically handle many thousands of variables.

  • Association Discovery concentrates on discovering relationships between values rather than variables. This is a non-trivial distinction. If someone is told that there is an association between gender and some medical condition, they are likely to immediately wish to know which gender is positively associated with the condition and which is not. Association Discovery goes directly to this question of interest. Furthermore, associations between values, rather than variables, can be more powerful (i.e., discover weaker relationships) when variables have more than two values. Statistical techniques may have difficulty detecting an association when there are many values for each variable and two values are strongly associated, but there are only weak interactions among the remaining values.

  • Association Discovery focuses on finding associations that are useful for the user, whereas statistical techniques focus on controlling the risk of making false discoveries. In contexts where there are very large numbers of associations, it is critical to help users quickly identify which are the most important for their immediate applications.

Historically, the main body of Association Discovery research has concentrated on developing efficient techniques for finding frequent itemsets, and has paid little attention to the questions of what types of association are useful to find and how those types of associations might be found. The dominant association mining paradigm, frequent association mining, has significant limitations and often discovers so many spurious associations that it is next to impossible to identify the potentially useful ones.

The filtered-top-k association technique that underlies the BigML associations implementation was developed by Professor Geoff Webb. It focuses on finding the most useful associations for the user specific application. This approach has been successfuly used in numerous scientific applications ranging from health data mining and cancer mortality studies to controlling robots and to improving e-learning.