Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə46/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   42   43   44   45   46   47   48   49   ...   219

contribute independently and equally to the final outcome. A third might have

a simple logical structure, involving just a few attributes that can be captured

by a decision tree. In a fourth, there may be a few independent rules that govern

the assignment of instances to different classes. A fifth might exhibit depend-

encies among different subsets of attributes. A sixth might involve linear

dependence among numeric attributes, where what matters is a weighted sum

of attribute values with appropriately chosen weights. In a seventh, classifica-

tions appropriate to particular regions of instance space might be governed by

the distances between the instances themselves. And in an eighth, it might be

that no class values are provided: the learning is unsupervised.

In the infinite variety of possible datasets there are many different kinds of

structure that can occur, and a data mining tool—no matter how capable—that

is looking for one class of structure may completely miss regularities of a dif-

ferent kind, regardless of how rudimentary those may be. The result is a baroque

and opaque classification structure of one kind instead of a simple, elegant,

immediately comprehensible structure of another.

Each of the eight examples of different kinds of datasets sketched previously

leads to a different machine learning method well suited to discovering it. The

sections of this chapter look at each of these structures in turn.

4.1 Inferring rudimentary rules

Here’s an easy way to find very simple classification rules from a set of instances.

Called 1R for 1-rule, it generates a one-level decision tree expressed in the form

of a set of rules that all test one particular attribute. 1R is a simple, cheap method

that often comes up with quite good rules for characterizing the structure in

data. It turns out that simple rules frequently achieve surprisingly high accu-

racy. Perhaps this is because the structure underlying many real-world datasets

is quite rudimentary, and just one attribute is sufficient to determine the class

of an instance quite accurately. In any event, it is always a good plan to try the

simplest things first.

The idea is this: we make rules that test a single attribute and branch accord-

ingly. Each branch corresponds to a different value of the attribute. It is obvious

what is the best classification to give each branch: use the class that occurs most

often in the training data. Then the error rate of the rules can easily be deter-

mined. Just count the errors that occur on the training data, that is, the number

of instances that do not have the majority class.

Each attribute generates a different set of rules, one rule for every value 

of the attribute. Evaluate the error rate for each attribute’s rule set and choose

the best. It’s that simple! Figure 4.1 shows the algorithm in the form of

pseudocode.

8 4

C H A P T E R   4



|

A LG O R I T H M S : T H E   BA S I C   M E T H O D S

P088407-Ch004.qxd  4/30/05  11:13 AM  Page 84



To see the 1R method at work, consider the weather data of Table 1.2 (we will

encounter it many times again when looking at how learning algorithms work).

To classify on the final column, play, 1R considers four sets of rules, one for each

attribute. These rules are shown in Table 4.1. An asterisk indicates that a random

choice has been made between two equally likely outcomes. The number of

errors is given for each rule, along with the total number of errors for the rule

set as a whole. 1R chooses the attribute that produces rules with the smallest

number of errors—that is, the first and third rule sets. Arbitrarily breaking the

tie between these two rule sets gives:

outlook: sunny

Æ no

overcast 



Æ yes

rainy


Æ yes

4 . 1


I N F E R R I N G   RU D I M E N TA RY   RU L E S

8 5


For each attribute,

  For each value of that attribute, make a rule as follows:

    count how often each class appears

    find the most frequent class

    make the rule assign that class to this attribute-value.

  Calculate the error rate of the rules.

Choose the rules with the smallest error rate.

Figure 4.1 Pseudocode for 1R.

Table 4.1

Evaluating the attributes in the weather data.

Attribute

Rules

Errors


Total errors

1

outlook



sunny 

Æ no


2/5

4/14


overcast 

Æ yes


0/4

rainy 


Æ yes

2/5


2

temperature

hot 

Æ no*


2/4

5/14


mild 

Æ yes


2/6

cool 


Æ yes

1/4


3

humidity


high 

Æ no


3/7

4/14


normal 

Æ yes


1/7

4

windy



false 

Æ yes


2/8

5/14


true 

Æ no*


3/6

* A random choice was made between two equally likely outcomes.

P088407-Ch004.qxd  4/30/05  11:13 AM  Page 85



We noted at the outset that the game for the weather data is unspecified.

Oddly enough, it is apparently played when it is overcast or rainy but not when

it is sunny. Perhaps it’s an indoor pursuit.

Missing values and numeric attributes

Although a very rudimentary learning method, 1R does accommodate both

missing values and numeric attributes. It deals with these in simple but effec-

tive ways. Missing is treated as just another attribute value so that, for example,

if the weather data had contained missing values for the outlook attribute, a rule

set formed on outlook would specify four possible class values, one each for



sunny, overcast, and rainy and a fourth for missing.

We can convert numeric attributes into nominal ones using a simple dis-

cretization method. First, sort the training examples according to the values of

the numeric attribute. This produces a sequence of class values. For example,

sorting the numeric version of the weather data (Table 1.3) according to the

values of temperature produces the sequence

64

65

68



69

70

71



72

72

75



75

80

81



83

85

yes



no  yes yes

yes  no 


 no  yes yes  yes  no 

 

yes  yes  no



Discretization involves partitioning this sequence by placing breakpoints in

it. One possibility is to place breakpoints wherever the class changes, producing

eight categories:

yes 


| no | yes yes yes | no no | yes yes yes | no | yes yes | no

Choosing breakpoints halfway between the examples on either side places

them at 64.5, 66.5, 70.5, 72, 77.5, 80.5, and 84. However, the two instances with

value 72 cause a problem because they have the same value of temperature but

fall into different classes. The simplest fix is to move the breakpoint at 72 up

one example, to 73.5, producing a mixed partition in which no is the majority

class.

A more serious problem is that this procedure tends to form a large number



of categories. The 1R method will naturally gravitate toward choosing an attri-

bute that splits into many categories, because this will partition the dataset into

many classes, making it more likely that instances will have the same class as the

majority in their partition. In fact, the limiting case is an attribute that has a

different value for each instance—that is, an identification code attribute that

pinpoints instances uniquely—and this will yield a zero error rate on the train-

ing set because each partition contains just one instance. Of course, highly

branching attributes do not usually perform well on test examples; indeed, the



identification code attribute will never predict any examples outside the training

set correctly. This phenomenon is known as overfitting; we have already

8 6

C H A P T E R   4



|

A LG O R I T H M S : T H E   BA S I C   M E T H O D S

P088407-Ch004.qxd  4/30/05  11:13 AM  Page 86



Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   42   43   44   45   46   47   48   49   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə