Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə32/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   28   29   30   31   32   33   34   35   ...   219

The real drawbacks of such techniques, however, are that they do not cope

well with noisy data, and they tend to be so slow as to be unusable on anything

but small artificial datasets. They are not covered in this book; see Bergadano

and Gunetti (1996) for a comprehensive treatment.

In summary, the input to a data mining scheme is generally expressed as a

table of independent instances of the concept to be learned. Because of this, it

has been suggested, disparagingly, that we should really talk of file mining rather

than database mining. Relational data is more complex than a flat file. A finite

set of finite relations can always be recast into a single table, although often at

enormous cost in space. Moreover, denormalization can generate spurious 

regularities in the data, and it is essential to check the data for such artifacts

before applying a learning method. Finally, potentially infinite concepts can be

dealt with by learning rules that are recursive, although that is beyond the scope

of this book.



2.3 What’s in an attribute?

Each individual, independent instance that provides the input to machine 

learning is characterized by its values on a fixed, predefined set of features or

attributes. The instances are the rows of the tables that we have shown for the

weather, contact lens, iris, and CPU performance problems, and the attributes

are the columns. (The labor negotiations data was an exception: we presented

this with instances in columns and attributes in rows for space reasons.)

The use of a fixed set of features imposes another restriction on the kinds of

problems generally considered in practical data mining. What if different

2 . 3

W H AT ’ S   I N   A N   AT T R I BU T E ?



4 9

Table 2.5

Another relation represented as a table.

First person

Second person

Ancestor


Name

Gender


Parent1

Parent2


Name

Gender


Parent1

Parent2


of?

Peter


male

?

?



Steven

male


Peter

Peggy


yes

Peter


male

?

?



Pam

female


Peter

Peggy


yes

Peter


male

?

?



Anna

female


Pam

Ian


yes

Peter


male

?

?



Nikki

female


Pam

Ian


yes

Pam


female

Peter


Peggy

Nikki


female

Pam


Ian

yes


Grace

female


?

?

Ian



male

Grace


Ray

yes


Grace

female


?

?

Nikki



female

Pam


Ian

yes


other examples here

yes


all the rest

no

P088407-Ch002.qxd  4/30/05  11:10 AM  Page 49




instances have different features? If the instances were transportation vehicles,

then number of wheels is a feature that applies to many vehicles but not to ships,

for example, whereas number of masts might be a feature that applies to ships

but not to land vehicles. The standard workaround is to make each possible

feature an attribute and to use a special “irrelevant value” flag to indicate that a

particular attribute is not available for a particular case. A similar situation arises

when the existence of one feature (say, spouse’s name) depends on the value of

another (married or single).

The value of an attribute for a particular instance is a measurement of the

quantity to which the attribute refers. There is a broad distinction between quan-

tities that are numeric and ones that are nominal. Numeric attributes, sometimes

called  continuous attributes, measure numbers—either real or integer valued.

Note that the term continuous is routinely abused in this context: integer-valued

attributes are certainly not continuous in the mathematical sense. Nominal

attributes take on values in a prespecified, finite set of possibilities and are some-

times called categorical. But there are other possibilities. Statistics texts often

introduce “levels of measurement” such as nominal, ordinal, interval, and ratio.

Nominal quantities have values that are distinct symbols. The values them-

selves serve just as labels or names—hence the term nominal, which comes from

the Latin word for name. For example, in the weather data the attribute 

outlook

has values 



sunny

,

overcast



, and 

rainy


. No relation is implied among these

three—no ordering or distance measure. It certainly does not make sense to add

the values together, multiply them, or even compare their size. A rule using such

an attribute can only test for equality or inequality, as follows:

outlook: sunny

Æ no


overcast 

Æ yes


rainy

Æ yes


Ordinal quantities are ones that make it possible to rank order the categories.

However, although there is a notion of ordering, there is no notion of distance.

For example, in the weather data the attribute 

temperature

has values 

hot


,

mild


,

and 


cool

. These are ordered. Whether you say



hot

mild cool or hot mild cool

is a matter of convention—it does not matter which is used as long as consis-

tency is maintained. What is important is that mild lies between the other two.

Although it makes sense to compare two values, it does not make sense to add

or subtract them—the difference between 

hot

and 


mild

cannot be compared

with the difference between 

mild


and 

cool


. A rule using such an attribute might

involve a comparison, as follows:

temperature 

= hot Æ no

temperature 

< hot Æ yes

5 0


C H A P T E R   2

|

I N P U T: C O N C E P TS , I N S TA N C E S , A N D   AT T R I BU T E S



P088407-Ch002.qxd  4/30/05  11:10 AM  Page 50


Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   28   29   30   31   32   33   34   35   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə