which
is a more compact, and hence more satisfactory, way of saying the same
thing.
Missing values
Most datasets encountered in practice, such as the labor negotiations data in
Table 1.6, contain missing values. Missing values are frequently indicated by out-
of-range entries, perhaps a negative number (e.g.,
-1) in a numeric field that is
normally only positive or a 0 in a numeric field that can never normally be 0.
For nominal attributes, missing values may be indicated by blanks or dashes.
Sometimes different kinds of missing values are distinguished (e.g., unknown
vs. unrecorded vs. irrelevant values) and perhaps represented by different
negative integers (
-1, -2, etc.).
You have to think carefully about the significance of missing values. They may
occur for several reasons, such as malfunctioning measurement equipment,
changes in experimental design during data collection, and collation of several
similar but not identical datasets. Respondents in a survey may refuse to answer
certain questions such as age or income. In an archaeological study, a specimen
such as a skull may be damaged so that some variables cannot be measured.
In a biologic one, plants or animals may die before all variables have been
measured. What do these things mean about the example under consideration?
Might the skull damage have some significance in itself, or is it just because of
some random event? Does the plants’ early death have some bearing on the case
or not?
Most machine learning methods make the implicit assumption that there is
no particular significance in the fact that a certain instance has an attribute value
missing: the value is simply not known. However, there may be a good reason
why the attribute’s value is unknown—perhaps a decision was made, on the evi-
dence available, not to perform some particular test—and that might convey
some information about the instance other than the fact that the value is simply
missing. If this is the case, then it would be more appropriate to record not tested
as another possible value for this attribute or perhaps as another attribute in the
dataset. As the preceding examples illustrate, only someone familiar with the data
can make an informed judgment about whether a particular value being missing
has some extra significance or whether it should simply be coded as an ordinary
missing value. Of course, if there seem to be several types of missing value, that
is prima facie evidence that something is going on that needs to be investigated.
If missing values mean that an operator has decided not to make a particu-
lar measurement, that may convey a great deal more than the mere fact that the
value is unknown. For example, people analyzing medical databases have
noticed that cases may, in some circumstances, be diagnosable simply from the
tests that a doctor decides to make regardless of the outcome of the tests. Then
5 8
C H A P T E R 2
|
I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S
P088407-Ch002.qxd 4/30/05 11:10 AM Page 58
a record of which values are “missing” is all that is needed for a complete
diagnosis—the actual values can be ignored completely!
Inaccurate values
It is important to check data mining files carefully for rogue attributes and
attribute values. The data used for mining has almost certainly not been gath-
ered expressly for that purpose. When originally collected, many of the fields
probably didn’t matter and were left blank or unchecked. Provided that it does
not affect the original purpose of the data, there is no incentive to correct it.
However, when the same database is used for mining, the errors and omissions
suddenly start to assume great significance. For example, banks do not really need
to know the age of their customers, so their databases may contain many missing
or incorrect values. But age may be a very significant feature in mined rules.
Typographic errors in a dataset will obviously lead to incorrect values. Often
the value of a nominal attribute is misspelled, creating an extra possible value
for that attribute. Or perhaps it is not a misspelling but different names for the
same thing, such as Pepsi and Pepsi Cola. Obviously the point of a defined
format such as ARFF is to allow data files to be checked for internal consistency.
However, errors that occur in the original data file are often preserved through
the conversion process into the file that is used for data mining; thus the list of
possible values that each attribute takes on should be examined carefully.
Typographic or measurement errors in numeric values generally cause out-
liers that can be detected by graphing one variable at a time. Erroneous values
often deviate significantly from the pattern that is apparent in the remaining
values. Sometimes, however, inaccurate values are hard to find, particularly
without specialist domain knowledge.
Duplicate data presents another source of error. Most machine learning tools
will produce different results if some of the instances in the data files are dupli-
cated, because repetition gives them more influence on the result.
People often make deliberate errors when entering personal data into data-
bases. They might make minor changes in the spelling of their street to try to
identify whether the information they have provided was sold to advertising
agencies that burden them with junk mail. They might adjust the spelling of
their name when applying for insurance if they have had insurance refused in
the past. Rigid computerized data entry systems often impose restrictions that
require imaginative workarounds. One story tells of a foreigner renting a vehicle
in the United States. Being from abroad, he had no ZIP code, yet the computer
insisted on one; in desperation the operator suggested that he use the ZIP code
of the rental agency. If this is common practice, future data mining projects may
notice a cluster of customers who apparently live in the same district as the agency!
Similarly, a supermarket checkout operator sometimes uses his own frequent
2 . 4
P R E PA R I N G T H E I N P U T
5 9
P088407-Ch002.qxd 4/30/05 11:10 AM Page 59