4 . 5
M I N I N G A S S O C I AT I O N RU L E S
1 1 7
which has coverage 2. Three subsets of this item set also have coverage 2:
temperature
= cool, windy = false
temperature
= cool, humidity = normal, windy = false
temperature
= cool, windy = false, play = yes
and these lead to rules 9, 10, and 11, all of which are 100% accurate (on the
training data).
Generating rules efficiently
We now consider in more detail an algorithm for producing association rules
with specified minimum coverage and accuracy. There are two stages: generat-
ing item sets with the specified minimum coverage, and from each item set
determining the rules that have the specified minimum accuracy.
The first stage proceeds by generating all one-item sets with the given
minimum coverage (the first column of Table 4.10) and then using this to gen-
erate the two-item sets (second column), three-item sets (third column), and so
on. Each operation involves a pass through the dataset to count the items in
each set, and after the pass the surviving item sets are stored in a hash table—
a standard data structure that allows elements stored in it to be found very
quickly. From the one-item sets, candidate two-item sets are generated, and then
a pass is made through the dataset, counting the coverage of each two-item set;
at the end the candidate sets with less than minimum coverage are removed
from the table. The candidate two-item sets are simply all of the one-item sets
taken in pairs, because a two-item set cannot have the minimum coverage unless
both its constituent one-item sets have minimum coverage, too. This applies in
general: a three-item set can only have the minimum coverage if all three of its
two-item subsets have minimum coverage as well, and similarly for four-item
sets.
An example will help to explain how candidate item sets are generated.
Suppose there are five three-item sets—(A B C), (A B D), (A C D), (A C E), and
(B C D)—where, for example, A is a feature such as outlook
= sunny. The union
of the first two, (A B C D), is a candidate four-item set because its other three-
item subsets (A C D) and (B C D) have greater than minimum coverage. If the
three-item sets are sorted into lexical order, as they are in this list, then we need
only consider pairs whose first two members are the same. For example, we do
not consider (A C D) and (B C D) because (A B C D) can also be generated
from (A B C) and (A B D), and if these two are not candidate three-item sets
then (A B C D) cannot be a candidate four-item set. This leaves the pairs (A B
C) and (A B D), which we have already explained, and (A C D) and (A C E).
This second pair leads to the set (A C D E) whose three-item subsets do not all
have the minimum coverage, so it is discarded. The hash table assists with this
check: we simply remove each item from the set in turn and check that the
P088407-Ch004.qxd 4/30/05 11:13 AM Page 117
remaining three-item set is indeed present in the hash table. Thus in this
example there is only one candidate four-item set, (A B C D). Whether or not
it actually has minimum coverage can only be determined by checking the
instances in the dataset.
The second stage of the procedure takes each item set and generates rules
from it, checking that they have the specified minimum accuracy. If only rules
with a single test on the right-hand side were sought, it would be simply a matter
of considering each condition in turn as the consequent of the rule, deleting it
from the item set, and dividing the coverage of the entire item set by the cov-
erage of the resulting subset—obtained from the hash table—to yield the accu-
racy of the corresponding rule. Given that we are also interested in association
rules with multiple tests in the consequent, it looks like we have to evaluate the
effect of placing each subset of the item set on the right-hand side, leaving the
remainder of the set as the antecedent.
This brute-force method will be excessively computation intensive unless
item sets are small, because the number of possible subsets grows exponentially
with the size of the item set. However, there is a better way. We observed when
describing association rules in Section 3.4 that if the double-consequent rule
If windy
= false and play = no then outlook = sunny
and humidity
= high
holds with a given minimum coverage and accuracy, then both single-
consequent rules formed from the same item set must also hold:
If humidity
= high and windy = false and play = no
then outlook
= sunny
If outlook
= sunny and windy = false and play = no
then humidity
= high
Conversely, if one or other of the single-consequent rules does not hold, there
is no point in considering the double-consequent one. This gives a way of build-
ing up from single-consequent rules to candidate double-consequent ones, from
double-consequent rules to candidate triple-consequent ones, and so on. Of
course, each candidate rule must be checked against the hash table to see if it
really does have more than the specified minimum accuracy. But this generally
involves checking far fewer rules than the brute force method. It is interesting
that this way of building up candidate (n
+ 1)-consequent rules from actual n-
consequent ones is really just the same as building up candidate (n
+ 1)-item
sets from actual n-item sets, described earlier.
Discussion
Association rules are often sought for very large datasets, and efficient algo-
rithms are highly valued. The method described previously makes one pass
1 1 8
C H A P T E R 4
|
A LG O R I T H M S : T H E BA S I C M E T H O D S
P088407-Ch004.qxd 4/30/05 11:13 AM Page 118