Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	36/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 32 33 34 35 36 37 38 39 ... 219

Missing values
Inaccurate values

which is a more compact, and hence more satisfactory, way of saying the same

thing.

Missing values

Most datasets encountered in practice, such as the labor negotiations data in

Table 1.6, contain missing values. Missing values are frequently indicated by out-

of-range entries, perhaps a negative number (e.g.,

-1) in a numeric ﬁeld that is

normally only positive or a 0 in a numeric ﬁeld that can never normally be 0.

For nominal attributes, missing values may be indicated by blanks or dashes.

Sometimes different kinds of missing values are distinguished (e.g., unknown

vs. unrecorded vs. irrelevant values) and perhaps represented by different

negative integers (

-1, -2, etc.).

You have to think carefully about the signiﬁcance of missing values. They may

occur for several reasons, such as malfunctioning measurement equipment,

changes in experimental design during data collection, and collation of several

similar but not identical datasets. Respondents in a survey may refuse to answer

certain questions such as age or income. In an archaeological study, a specimen

such as a skull may be damaged so that some variables cannot be measured.

In a biologic one, plants or animals may die before all variables have been

measured. What do these things mean about the example under consideration?

Might the skull damage have some signiﬁcance in itself, or is it just because of

some random event? Does the plants’ early death have some bearing on the case

or not?

Most machine learning methods make the implicit assumption that there is

no particular signiﬁcance in the fact that a certain instance has an attribute value

missing: the value is simply not known. However, there may be a good reason

why the attribute’s value is unknown—perhaps a decision was made, on the evi-

dence available, not to perform some particular test—and that might convey

some information about the instance other than the fact that the value is simply

missing. If this is the case, then it would be more appropriate to record not tested

as another possible value for this attribute or perhaps as another attribute in the

dataset. As the preceding examples illustrate, only someone familiar with the data

can make an informed judgment about whether a particular value being missing

has some extra signiﬁcance or whether it should simply be coded as an ordinary

missing value. Of course, if there seem to be several types of missing value, that

is prima facie evidence that something is going on that needs to be investigated.

If missing values mean that an operator has decided not to make a particu-

lar measurement, that may convey a great deal more than the mere fact that the

value is unknown. For example, people analyzing medical databases have

noticed that cases may, in some circumstances, be diagnosable simply from the

tests that a doctor decides to make regardless of the outcome of the tests. Then

5 8

C H A P T E R 2

I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S

P088407-Ch002.qxd 4/30/05 11:10 AM Page 58

a record of which values are “missing” is all that is needed for a complete

diagnosis—the actual values can be ignored completely!

Inaccurate values

It is important to check data mining ﬁles carefully for rogue attributes and

attribute values. The data used for mining has almost certainly not been gath-

ered expressly for that purpose. When originally collected, many of the ﬁelds

probably didn’t matter and were left blank or unchecked. Provided that it does

not affect the original purpose of the data, there is no incentive to correct it.

However, when the same database is used for mining, the errors and omissions

suddenly start to assume great signiﬁcance. For example, banks do not really need

to know the age of their customers, so their databases may contain many missing

or incorrect values. But age may be a very signiﬁcant feature in mined rules.

Typographic errors in a dataset will obviously lead to incorrect values. Often

the value of a nominal attribute is misspelled, creating an extra possible value

for that attribute. Or perhaps it is not a misspelling but different names for the

same thing, such as Pepsi and Pepsi Cola. Obviously the point of a deﬁned

format such as ARFF is to allow data ﬁles to be checked for internal consistency.

However, errors that occur in the original data ﬁle are often preserved through

the conversion process into the ﬁle that is used for data mining; thus the list of

possible values that each attribute takes on should be examined carefully.

Typographic or measurement errors in numeric values generally cause out-

liers that can be detected by graphing one variable at a time. Erroneous values

often deviate signiﬁcantly from the pattern that is apparent in the remaining

values. Sometimes, however, inaccurate values are hard to ﬁnd, particularly

without specialist domain knowledge.

Duplicate data presents another source of error. Most machine learning tools

will produce different results if some of the instances in the data ﬁles are dupli-

cated, because repetition gives them more inﬂuence on the result.

People often make deliberate errors when entering personal data into data-

bases. They might make minor changes in the spelling of their street to try to

identify whether the information they have provided was sold to advertising

agencies that burden them with junk mail. They might adjust the spelling of

their name when applying for insurance if they have had insurance refused in

the past. Rigid computerized data entry systems often impose restrictions that

require imaginative workarounds. One story tells of a foreigner renting a vehicle

in the United States. Being from abroad, he had no ZIP code, yet the computer

insisted on one; in desperation the operator suggested that he use the ZIP code

of the rental agency. If this is common practice, future data mining projects may

notice a cluster of customers who apparently live in the same district as the agency!

Similarly, a supermarket checkout operator sometimes uses his own frequent

2 . 4

P R E PA R I N G T H E I N P U T

5 9

P088407-Ch002.qxd 4/30/05 11:10 AM Page 59

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 32 33 34 35 36 37 38 39 ... 219