Association Discovery with the BigML Dashboard

4.7 Discretization

Associations do not support numeric fields. Your numeric fields will be automatically converted into categorical fields to create your association. This process is called Discretization. For instance, a numeric field like “Age”, with values between 0 and 50, can be discretized in 5 different segments or classes: 1-10, 11-20, 21-30, 31-40, and 41-50. These five segments will be the classes for your new categorical field.

BigML allows you to configure the following discretization options. If you do not configure them, BigML will apply the default values. (See Figure 4.8 .)

\includegraphics[]{images/assoc-discretization}
Figure 4.8 Discretization options

4.7.1 Pretty

It is highly likely that during discretization, numeric fields may have boundaries that are decimal numbers. By enabling the Pretty discretization option (Figure 4.8 ), you can force segment boundaries and widths for numeric fields to be set in a way that are easy to read, e.g., instead of \(segment {\gt} 20.678\) you will get \(segment {\gt} 20\). If Pretty is enabled, the specified Size may act as a maximum. (See subsection 4.7.4 and subsection 4.7.2 .)

4.7.2 Size

The Size discretization option (Figure 4.8 ) lets you specify the number of groups (or classes) for your numeric fields, e.g., if you set Size = 2 and Type = width, for a field ranging from 1 to 10 containing integer values, you will get two equal width segments, from 1 to 5, and from 6 to 10. The default value is Size = 5. You can set up to 50 segments by moving the size slider or by typing the number of segments you wish in the input box.

If the Pretty option is enabled, then this value acts as a maximum size.

4.7.3 Trim

The Trim discretization option (Figure 4.8 ), is the portion of the overall population that may be removed from either tail of the distribution. You can set a number between 0% and 10% by moving the trim slider or by typing the percentage in the input box.

For example, 0.01 indicates that 1% of the data may be removed from either tail. A trim of 1% usually gives good results, because it tends to eliminate most of the outliers.

4.7.4 Type

Finally, the Type discretization option (Figure 4.8 ), lets you select whether you want to discretize the field by using an equal width or equal population strategy for each segment. The right choice depends on the distribution of your numeric field.