Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	61/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 57 58 59 60 61 62 63 64 ... 219

Generating rules efﬁciently

4 . 5

M I N I N G A S S O C I AT I O N RU L E S

1 1 7

which has coverage 2. Three subsets of this item set also have coverage 2:

temperature

= cool, windy = false

temperature

= cool, humidity = normal, windy = false

temperature

= cool, windy = false, play = yes

and these lead to rules 9, 10, and 11, all of which are 100% accurate (on the

training data).

Generating rules efﬁciently

We now consider in more detail an algorithm for producing association rules

with speciﬁed minimum coverage and accuracy. There are two stages: generat-

ing item sets with the speciﬁed minimum coverage, and from each item set

determining the rules that have the speciﬁed minimum accuracy.

The ﬁrst stage proceeds by generating all one-item sets with the given

minimum coverage (the ﬁrst column of Table 4.10) and then using this to gen-

erate the two-item sets (second column), three-item sets (third column), and so

on. Each operation involves a pass through the dataset to count the items in

each set, and after the pass the surviving item sets are stored in a hash table—

a standard data structure that allows elements stored in it to be found very

quickly. From the one-item sets, candidate two-item sets are generated, and then

a pass is made through the dataset, counting the coverage of each two-item set;

at the end the candidate sets with less than minimum coverage are removed

from the table. The candidate two-item sets are simply all of the one-item sets

taken in pairs, because a two-item set cannot have the minimum coverage unless

both its constituent one-item sets have minimum coverage, too. This applies in

general: a three-item set can only have the minimum coverage if all three of its

two-item subsets have minimum coverage as well, and similarly for four-item

sets.

An example will help to explain how candidate item sets are generated.

Suppose there are ﬁve three-item sets—(A B C), (A B D), (A C D), (A C E), and

(B C D)—where, for example, A is a feature such as outlook

= sunny. The union

of the ﬁrst two, (A B C D), is a candidate four-item set because its other three-

item subsets (A C D) and (B C D) have greater than minimum coverage. If the

three-item sets are sorted into lexical order, as they are in this list, then we need

only consider pairs whose ﬁrst two members are the same. For example, we do

not consider (A C D) and (B C D) because (A B C D) can also be generated

from (A B C) and (A B D), and if these two are not candidate three-item sets

then (A B C D) cannot be a candidate four-item set. This leaves the pairs (A B

C) and (A B D), which we have already explained, and (A C D) and (A C E).

This second pair leads to the set (A C D E) whose three-item subsets do not all

have the minimum coverage, so it is discarded. The hash table assists with this

check: we simply remove each item from the set in turn and check that the

P088407-Ch004.qxd 4/30/05 11:13 AM Page 117

remaining three-item set is indeed present in the hash table. Thus in this

example there is only one candidate four-item set, (A B C D). Whether or not

it actually has minimum coverage can only be determined by checking the

instances in the dataset.

The second stage of the procedure takes each item set and generates rules

from it, checking that they have the speciﬁed minimum accuracy. If only rules

with a single test on the right-hand side were sought, it would be simply a matter

of considering each condition in turn as the consequent of the rule, deleting it

from the item set, and dividing the coverage of the entire item set by the cov-

erage of the resulting subset—obtained from the hash table—to yield the accu-

racy of the corresponding rule. Given that we are also interested in association

rules with multiple tests in the consequent, it looks like we have to evaluate the

effect of placing each subset of the item set on the right-hand side, leaving the

remainder of the set as the antecedent.

This brute-force method will be excessively computation intensive unless

item sets are small, because the number of possible subsets grows exponentially

with the size of the item set. However, there is a better way. We observed when

describing association rules in Section 3.4 that if the double-consequent rule

If windy

= false and play = no then outlook = sunny

and humidity

= high

holds with a given minimum coverage and accuracy, then both single-

consequent rules formed from the same item set must also hold:

If humidity

= high and windy = false and play = no

then outlook

= sunny

If outlook

= sunny and windy = false and play = no

then humidity

= high

Conversely, if one or other of the single-consequent rules does not hold, there

is no point in considering the double-consequent one. This gives a way of build-

ing up from single-consequent rules to candidate double-consequent ones, from

double-consequent rules to candidate triple-consequent ones, and so on. Of

course, each candidate rule must be checked against the hash table to see if it

really does have more than the speciﬁed minimum accuracy. But this generally

involves checking far fewer rules than the brute force method. It is interesting

that this way of building up candidate (n

+ 1)-consequent rules from actual n-

consequent ones is really just the same as building up candidate (n

+ 1)-item

sets from actual n-item sets, described earlier.

Discussion

Association rules are often sought for very large datasets, and efﬁcient algo-

rithms are highly valued. The method described previously makes one pass

1 1 8

C H A P T E R 4

A LG O R I T H M S : T H E BA S I C M E T H O D S

P088407-Ch004.qxd 4/30/05 11:13 AM Page 118

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 57 58 59 60 61 62 63 64 ... 219