Association Discovery with the BigML Dashboard

4.6 Minimum Levels for the Association Measures

You can set minimum levels for a number of association measures (Figure 4.7 ) that let you focus on more interesting association rules, while filtering out potentially spurious ones. As for interestingness of an association rule, there is no single measure that is always more important than others. Similarly, there are no general thresholds to consider as essential rules. Analyze your results according to your main goals, which may be different depending on the problem you are trying to solve.

For example, you may be interested in very frequent associations, so you will have to pay more attention to the support rule. Perhaps you want to find some more infrequent associations, but with a stronger relationship between the items (i.e., rules with higher lift). Usually it is not one single measure, but the combination and coherence of all measures that makes one rule more relevant and useful than others.

The following subsections explain the meaning of each association measure.

\includegraphics[]{images/assoc-measures}
Figure 4.7 Association measures

4.6.1 Minimum Support

In Figure 4.7 , support is the portion of instances in the dataset which contain the rule’s antecedent and rule’s consequent together, divided by the total number of instances (N) in the dataset. It gives a measure of the prevalance of the rule in your dataset.

You can set a support threshold between 0% and 100% by moving the min. support slider or by typing the percentage in the input box. BigML will automatically discard associations below this support level. As the minimum support percentage increases, your association rules will be based on higher occurance in your dataset.

4.6.2 Minimum Confidence

In Figure 4.7 , confidence is the percentage of instances which contain the consequent and antecedent together over the number of instances which only contain the antecedent. Think of it as an estimate of the probability that the consequent will occur in case the antecedent occurs. Some publications also refer to confidence as strength.

You can set a confidence threshold between 0% and 100% by moving the min. confidence slider or by typing the percentage in the input box. Associations below this confidence will be automatically discarded.

4.6.3 Minimum Leverage

In Figure 4.7 , leverage measures the difference between the probability of the rule and the expected probability if the items were statistically independent. Leverage ranges between [-1, 1]. A leverage of 0 suggests there is no association between the items. Higher positive leverage values suggest a stronger positive association between the antecedent and consequent. Negative values for leverage suggest a negative relationship.

You can set a leverage threshold between -100% and 100% by moving the min. leverage slider or by typing the percentage in the input box. Associations below this leverage will be discarded.

4.6.4 Significance Level

In Figure 4.7 , significance level is the maximum level of risk you are willing to take to discover a spurious association. BigML applies statistical tests to control the risk of finding spurious associations. The lower the significance level, the less likely this rule is spurious, either because the antecedent and consequent are unrelated to one another, or because one or more of the values in the antecedent do not contribute to the association with the consequent. It is set to 5% (or 0.05) by default, but you can change this value by moving the max. significance level slider or by typing the number you wish in the input box.

4.6.5 Minimum Lift

Finally, in Figure 4.7 , lift represents how much more often antecedent and consequent occur together, than expected, if they were statistically independent, e.g., a lift of 5 for the following rule \((onions \to potatoes)\) means that buying onions makes it 5 times more likely the shopper will buy potatoes. Lift is always a real positive number. A lift of 1 suggests there is no association between the items. A lift between 0 and 1 indicates a negative correlation. Higher values suggest stronger relationships between the items.

You can set any positive real number by typing the number in the input box. Associations below this lift will be discarded.