4 . 5
M I N I N G A S S O C I AT I O N RU L E S
1 1 3
accuracy (the same number expressed as a proportion of the number of
instances to which the rule applies). This approach is quite infeasible. (Note that,
as we mentioned in Section 3.4, what we are calling coverage is often called
support and what we are calling accuracy is often called confidence.)
Instead, we capitalize on the fact that we are only interested in association
rules with high coverage. We ignore, for the moment, the distinction between
the left- and right-hand sides of a rule and seek combinations of attribute–value
pairs that have a prespecified minimum coverage. These are called item sets: an
attribute–value pair is an item. The terminology derives from market basket
analysis, in which the items are articles in your shopping cart and the super-
market manager is looking for associations among these purchases.
Item sets
The first column of Table 4.10 shows the individual items for the weather data
of Table 1.2, with the number of times each item appears in the dataset given
at the right. These are the one-item sets. The next step is to generate the two-
item sets by making pairs of one-item ones. Of course, there is no point in
generating a set containing two different values of the same attribute (such as
outlook
= sunny and outlook = overcast), because that cannot occur in any actual
instance.
Assume that we seek association rules with minimum coverage 2: thus we
discard any item sets that cover fewer than two instances. This leaves 47 two-
item sets, some of which are shown in the second column along with the
number of times they appear. The next step is to generate the three-item sets,
of which 39 have a coverage of 2 or greater. There are 6 four-item sets, and no
five-item sets—for this data, a five-item set with coverage 2 or greater could only
correspond to a repeated instance. The first row of the table, for example, shows
that there are five days when outlook
= sunny, two of which have temperature =
mild, and, in fact, on both of those days
humidity
= high and play = no as well.
Association rules
Shortly we will explain how to generate these item sets efficiently. But first let
us finish the story. Once all item sets with the required coverage have been gen-
erated, the next step is to turn each into a rule, or set of rules, with at least the
specified minimum accuracy. Some item sets will produce more than one rule;
others will produce none. For example, there is one three-item set with a cov-
erage of 4 (row 38 of Table 4.10):
humidity
= normal, windy = false, play = yes
This set leads to seven potential rules:
P088407-Ch004.qxd 4/30/05 11:13 AM Page 113
1 1 4
C H A P T E R 4
|
A LG O R I T H M S : T H E BA S I C M E T H O D S
Table 4.10
Item sets for the weather data with coverage 2 or greater.
One-item sets
Two-item sets
Three-item sets
Four-item sets
1
outlook
= sunny (5)
outlook
= sunny
outlook
= sunny
outlook
= sunny
temperature
= mild (2)
temperature
= hot
temperature
= hot
humidity
= high (2)
humidity
= high
play
= no (2)
2
outlook
= overcast (4)
outlook
= sunny
outlook
= sunny
outlook
= sunny
temperature
= hot (2)
temperature
= hot
humidity
= high
play
= no (2)
windy
= false
play
= no (2)
3
outlook
= rainy (5)
outlook
= sunny
outlook
= sunny
outlook
= overcast
humidity
= normal (2)
humidity
= normal
temperature
= hot
play
= yes (2)
windy
= false
play
= yes (2)
4
temperature
= cool (4)
outlook
= sunny
outlook
= sunny
outlook
= rainy
humidity
= high (3)
humidity
= high
temperature
= mild
windy
= false (2)
windy
= false
play
= yes (2)
5
temperature
= mild (6)
outlook
= sunny
outlook
= sunny
outlook
= rainy
windy
= true (2)
humidity
= high
humidity
= normal
play
= no (3)
windy
= false
play
= yes (2)
6
temperature
= hot (4)
outlook
= sunny
outlook
= sunny
temperature
= cool
windy
= false (3)
windy
= false
humidity
= normal
play
= no (2)
windy
= false
play
= yes (2)
7
humidity
= normal (7)
outlook
= sunny
outlook
= overcast
play
= yes (2)
temperature
= hot
windy
= false (2)
8
humidity
= high (7)
outlook
= sunny
outlook
= overcast
play
= no (3)
temperature
= hot
play
= yes (2)
9
windy
= true (6)
outlook
= overcast
outlook
= overcast
temperature
= hot (2)
humidity
= normal
play
= yes (2)
10
windy
= false (8)
outlook
= overcast
outlook
= overcast
humidity
= normal (2)
humidity
= high
play
= yes (2)
11
play
= yes (9)
outlook
= overcast
outlook
= overcast
humidity
= high (2)
windy
= true
play
= yes (2)
12
play
= no (5)
outlook
= overcast
outlook
= overcast
windy
= true (2)
windy
= false
play
= yes (2)
13
outlook
= overcast
outlook
= rainy
windy
= false (2)
temperature
= cool
humidity
= normal (2)
P088407-Ch004.qxd 4/30/05 11:13 AM Page 114