Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	32/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 28 29 30 31 32 33 34 35 ... 219

2.3 What’s in an attribute
Table 2.5 Another relation represented as a table.

The real drawbacks of such techniques, however, are that they do not cope

well with noisy data, and they tend to be so slow as to be unusable on anything

but small artiﬁcial datasets. They are not covered in this book; see Bergadano

and Gunetti (1996) for a comprehensive treatment.

In summary, the input to a data mining scheme is generally expressed as a

table of independent instances of the concept to be learned. Because of this, it

has been suggested, disparagingly, that we should really talk of ﬁle mining rather

than database mining. Relational data is more complex than a ﬂat ﬁle. A ﬁnite

set of ﬁnite relations can always be recast into a single table, although often at

enormous cost in space. Moreover, denormalization can generate spurious

regularities in the data, and it is essential to check the data for such artifacts

before applying a learning method. Finally, potentially inﬁnite concepts can be

dealt with by learning rules that are recursive, although that is beyond the scope

of this book.

2.3 What’s in an attribute?

Each individual, independent instance that provides the input to machine

learning is characterized by its values on a ﬁxed, predeﬁned set of features or

attributes. The instances are the rows of the tables that we have shown for the

weather, contact lens, iris, and CPU performance problems, and the attributes

are the columns. (The labor negotiations data was an exception: we presented

this with instances in columns and attributes in rows for space reasons.)

The use of a ﬁxed set of features imposes another restriction on the kinds of

problems generally considered in practical data mining. What if different

2 . 3

W H AT ’ S I N A N AT T R I BU T E ?

4 9

Table 2.5

Another relation represented as a table.

First person

Second person

Ancestor

Name

Gender

Parent1

Parent2

Name

Gender

Parent1

Parent2

of?

Peter

male

Steven

male

Peter

Peggy

yes

Peter

male

Pam

female

Peter

Peggy

yes

Peter

male

Anna

female

Pam

Ian

yes

Peter

male

Nikki

female

Pam

Ian

yes

Pam

female

Peter

Peggy

Nikki

female

Pam

Ian

yes

Grace

female

Ian

male

Grace

Ray

yes

Grace

female

Nikki

female

Pam

Ian

yes

other examples here

yes

all the rest

P088407-Ch002.qxd 4/30/05 11:10 AM Page 49

instances have different features? If the instances were transportation vehicles,

then number of wheels is a feature that applies to many vehicles but not to ships,

for example, whereas number of masts might be a feature that applies to ships

but not to land vehicles. The standard workaround is to make each possible

feature an attribute and to use a special “irrelevant value” ﬂag to indicate that a

particular attribute is not available for a particular case. A similar situation arises

when the existence of one feature (say, spouse’s name) depends on the value of

another (married or single).

The value of an attribute for a particular instance is a measurement of the

quantity to which the attribute refers. There is a broad distinction between quan-

tities that are numeric and ones that are nominal. Numeric attributes, sometimes

called continuous attributes, measure numbers—either real or integer valued.

Note that the term continuous is routinely abused in this context: integer-valued

attributes are certainly not continuous in the mathematical sense. Nominal

attributes take on values in a prespeciﬁed, ﬁnite set of possibilities and are some-

times called categorical. But there are other possibilities. Statistics texts often

introduce “levels of measurement” such as nominal, ordinal, interval, and ratio.

Nominal quantities have values that are distinct symbols. The values them-

selves serve just as labels or names—hence the term nominal, which comes from

the Latin word for name. For example, in the weather data the attribute

outlook

has values

sunny

overcast

, and

rainy

. No relation is implied among these

three—no ordering or distance measure. It certainly does not make sense to add

the values together, multiply them, or even compare their size. A rule using such

an attribute can only test for equality or inequality, as follows:

outlook: sunny

Æ no

overcast

Æ yes

rainy

Æ yes

Ordinal quantities are ones that make it possible to rank order the categories.

However, although there is a notion of ordering, there is no notion of distance.

For example, in the weather data the attribute

temperature

has values

hot

mild

and

cool

. These are ordered. Whether you say

hot

> mild > cool or hot < mild < cool

is a matter of convention—it does not matter which is used as long as consis-

tency is maintained. What is important is that mild lies between the other two.

Although it makes sense to compare two values, it does not make sense to add

or subtract them—the difference between

hot

and

mild

cannot be compared

with the difference between

mild

and

cool

. A rule using such an attribute might

involve a comparison, as follows:

temperature

= hot Æ no

temperature

< hot Æ yes

5 0

C H A P T E R 2

I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S

P088407-Ch002.qxd 4/30/05 11:10 AM Page 50

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 28 29 30 31 32 33 34 35 ... 219