Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	33/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 29 30 31 32 33 34 35 36 ... 219

2.4 Preparing the input
Gathering the data together

Notice that the distinction between nominal and ordinal quantities is not

always straightforward and obvious. Indeed, the very example of an ordinal

quantity that we used previously,

outlook

, is not completely clear: you might

argue that the three values do have an ordering—

overcast

being somehow inter-

mediate between

sunny

and

rainy

as weather turns from good to bad.

Interval quantities have values that are not only ordered but also measured

in ﬁxed and equal units. A good example is temperature, expressed in degrees

(say, degrees Fahrenheit) rather than on the nonnumeric scale implied by cool,

mild, and hot. It makes perfect sense to talk about the difference between two

temperatures, say 46 and 48 degrees, and compare that with the difference

between another two temperatures, say 22 and 24 degrees. Another example is

dates. You can talk about the difference between the years 1939 and 1945 (6

years) or even the average of the years 1939 and 1945 (1942), but it doesn’t make

much sense to consider the sum of the years 1939 and 1945 (3884) or three

times the year 1939 (5817), because the starting point, year 0, is completely

arbitrary—indeed, it has changed many times throughout the course of his-

tory. (Children sometimes wonder what the year 300

 was called in 300 .)

Ratio quantities are ones for which the measurement method inherently

deﬁnes a zero point. For example, when measuring the distance from one object

to others, the distance between the object and itself forms a natural zero. Ratio

quantities are treated as real numbers: any mathematical operations are allowed.

It certainly does make sense to talk about three times the distance and even to

multiply one distance by another to get an area.

However, the question of whether there is an “inherently” deﬁned zero point

depends on our scientiﬁc knowledge—it’s culture relative. For example, Daniel

Fahrenheit knew no lower limit to temperature, and his scale is an interval one.

Nowadays, however, we view temperature as a ratio scale based on absolute zero.

Measurement of time in years since some culturally deﬁned zero such as

 0

is not a ratio scale; years since the big bang is. Even the zero point of money—

where we are usually quite happy to say that something cost twice as much as

something else—may not be quite clearly deﬁned for those of us who constantly

max out our credit cards.

Most practical data mining systems accommodate just two of these four levels

of measurement: nominal and ordinal. Nominal attributes are sometimes called

categorical, enumerated, or discrete. Enumerated is the standard term used in

computer science to denote a categorical data type; however, the strict deﬁni-

tion of the term—namely, to put into one-to-one correspondence with the

natural numbers—implies an ordering, which is speciﬁcally not implied in the

machine learning context. Discrete also has connotations of ordering because

you often discretize a continuous, numeric quantity. Ordinal attributes are

generally called numeric, or perhaps continuous, but without the implication of

mathematical continuity. A special case of the nominal scale is the dichotomy,

2 . 3

W H AT ’ S I N A N AT T R I BU T E ?

5 1

P088407-Ch002.qxd 4/30/05 11:10 AM Page 51

which has only two members—often designated as true and false, or yes and no

in the weather data. Such attributes are sometimes called Boolean.

Machine learning systems can use a wide variety of other information about

attributes. For instance, dimensional considerations could be used to restrict the

search to expressions or comparisons that are dimensionally correct. Circular

ordering could affect the kinds of tests that are considered. For example, in a

temporal context, tests on a day attribute could involve next day, previous day,

next weekday, and same day next week. Partial orderings, that is, generalization

or specialization relations, frequently occur in practical situations. Information

of this kind is often referred to as metadata, data about data. However, the kinds

of practical methods used for data mining are rarely capable of taking metadata

into account, although it is likely that these capabilities will develop rapidly in

the future. (We return to this in Chapter 8.)

2.4 Preparing the input

Preparing input for a data mining investigation usually consumes the bulk of

the effort invested in the entire data mining process. Although this book is not

really about the problems of data preparation, we want to give you a feeling for

the issues involved so that you can appreciate the complexities. Following that,

we look at a particular input ﬁle format, the attribute-relation ﬁle format (ARFF

format), that is used in the Java package described in Part II. Then we consider

issues that arise when converting datasets to such a format, because there are

some simple practical points to be aware of. Bitter experience shows that real

data is often of disappointingly low in quality, and careful checking—a process

that has become known as data cleaning—pays off many times over.

Gathering the data together

When beginning work on a data mining problem, it is ﬁrst necessary to bring

all the data together into a set of instances. We explained the need to denor-

malize relational data when describing the family tree example. Although it

illustrates the basic issue, this self-contained and rather artiﬁcial example does

not really convey a feeling for what the process will be like in practice. In a real

business application, it will be necessary to bring data together from different

departments. For example, in a marketing study data will be needed from the

sales department, the customer billing department, and the customer service

department.

Integrating data from different sources usually presents many challenges—

not deep issues of principle but nasty realities of practice. Different departments

will use different styles of record keeping, different conventions, different time

periods, different degrees of data aggregation, different primary keys, and will

have different kinds of error. The data must be assembled, integrated, and

5 2

C H A P T E R 2

I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S

P088407-Ch002.qxd 4/30/05 11:10 AM Page 52

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 29 30 31 32 33 34 35 36 ... 219