The real
drawbacks of such techniques, however, are that they do not cope
well with noisy data, and they tend to be so slow as to be unusable on anything
but small artificial datasets. They are not covered in this book; see Bergadano
and Gunetti (1996) for a comprehensive treatment.
In summary, the input to a data mining scheme is generally expressed as a
table of independent instances of the concept to be learned. Because of this, it
has been suggested, disparagingly, that we should really talk of file mining rather
than database mining. Relational data is more complex than a flat file. A finite
set of finite relations can always be recast into a single table, although often at
enormous cost in space. Moreover, denormalization can generate spurious
regularities in the data, and it is essential to check the data for such artifacts
before applying a learning method. Finally, potentially infinite concepts can be
dealt with by learning rules that are recursive, although that is beyond the scope
of this book.
2.3 What’s in an attribute?
Each individual, independent instance that provides the input to machine
learning is characterized by its values on a fixed, predefined set of features or
attributes. The instances are the rows of the tables that we have shown for the
weather, contact lens, iris, and CPU performance problems, and the attributes
are the columns. (The labor negotiations data was an exception: we presented
this with instances in columns and attributes in rows for space reasons.)
The use of a fixed set of features imposes another restriction on the kinds of
problems generally considered in practical data mining. What if different
2 . 3
W H AT ’ S I N A N AT T R I BU T E ?
4 9
Table 2.5
Another relation represented as a table.
First person
Second person
Ancestor
Name
Gender
Parent1
Parent2
Name
Gender
Parent1
Parent2
of?
Peter
male
?
?
Steven
male
Peter
Peggy
yes
Peter
male
?
?
Pam
female
Peter
Peggy
yes
Peter
male
?
?
Anna
female
Pam
Ian
yes
Peter
male
?
?
Nikki
female
Pam
Ian
yes
Pam
female
Peter
Peggy
Nikki
female
Pam
Ian
yes
Grace
female
?
?
Ian
male
Grace
Ray
yes
Grace
female
?
?
Nikki
female
Pam
Ian
yes
other examples here
yes
all the rest
no
P088407-Ch002.qxd 4/30/05 11:10 AM Page 49
instances have different features? If the instances
were transportation vehicles,
then number of wheels is a feature that applies to many vehicles but not to ships,
for example, whereas number of masts might be a feature that applies to ships
but not to land vehicles. The standard workaround is to make each possible
feature an attribute and to use a special “irrelevant value” flag to indicate that a
particular attribute is not available for a particular case. A similar situation arises
when the existence of one feature (say, spouse’s name) depends on the value of
another (married or single).
The value of an attribute for a particular instance is a measurement of the
quantity to which the attribute refers. There is a broad distinction between quan-
tities that are numeric and ones that are nominal. Numeric attributes, sometimes
called continuous attributes, measure numbers—either real or integer valued.
Note that the term continuous is routinely abused in this context: integer-valued
attributes are certainly not continuous in the mathematical sense. Nominal
attributes take on values in a prespecified, finite set of possibilities and are some-
times called categorical. But there are other possibilities. Statistics texts often
introduce “levels of measurement” such as nominal, ordinal, interval, and ratio.
Nominal quantities have values that are distinct symbols. The values them-
selves serve just as labels or names—hence the term nominal, which comes from
the Latin word for name. For example, in the weather data the attribute
outlook
has values
sunny
,
overcast
, and
rainy
. No relation is implied among these
three—no ordering or distance measure. It certainly does not make sense to add
the values together, multiply them, or even compare their size. A rule using such
an attribute can only test for equality or inequality, as follows:
outlook: sunny
Æ no
overcast
Æ yes
rainy
Æ yes
Ordinal quantities are ones that make it possible to rank order the categories.
However, although there is a notion of ordering, there is no notion of distance.
For example, in the weather data the attribute
temperature
has values
hot
,
mild
,
and
cool
. These are ordered. Whether you say
hot
> mild > cool or hot < mild < cool
is a matter of convention—it does not matter which is used as long as consis-
tency is maintained. What is important is that mild lies between the other two.
Although it makes sense to compare two values, it does not make sense to add
or subtract them—the difference between
hot
and
mild
cannot be compared
with the difference between
mild
and
cool
. A rule
using such an attribute might
involve a comparison, as follows:
temperature
= hot Æ no
temperature
< hot Æ yes
5 0
C H A P T E R 2
|
I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S
P088407-Ch002.qxd 4/30/05 11:10 AM Page 50