Notice that the distinction between nominal and
ordinal quantities is not
always straightforward and obvious. Indeed, the very example of an ordinal
quantity that we used previously,
outlook
, is not completely clear: you might
argue that the three values do have an ordering—
overcast
being somehow inter-
mediate between
sunny
and
rainy
as weather turns from good to bad.
Interval quantities have values that are not only ordered but also measured
in fixed and equal units. A good example is temperature, expressed in degrees
(say, degrees Fahrenheit) rather than on the nonnumeric scale implied by cool,
mild, and hot. It makes perfect sense to talk about the difference between two
temperatures, say 46 and 48 degrees, and compare that with the difference
between another two temperatures, say 22 and 24 degrees. Another example is
dates. You can talk about the difference between the years 1939 and 1945 (6
years) or even the average of the years 1939 and 1945 (1942), but it doesn’t make
much sense to consider the sum of the years 1939 and 1945 (3884) or three
times the year 1939 (5817), because the starting point, year 0, is completely
arbitrary—indeed, it has changed many times throughout the course of his-
tory. (Children sometimes wonder what the year 300
was called in 300 .)
Ratio quantities are ones for which the measurement method inherently
defines a zero point. For example, when measuring the distance from one object
to others, the distance between the object and itself forms a natural zero. Ratio
quantities are treated as real numbers: any mathematical operations are allowed.
It certainly does make sense to talk about three times the distance and even to
multiply one distance by another to get an area.
However, the question of whether there is an “inherently” defined zero point
depends on our scientific knowledge—it’s culture relative. For example, Daniel
Fahrenheit knew no lower limit to temperature, and his scale is an interval one.
Nowadays, however, we view temperature as a ratio scale based on absolute zero.
Measurement of time in years since some culturally defined zero such as
0
is
not a ratio scale; years since the big bang is. Even the zero point of money—
where we are usually quite happy to say that something cost twice as much as
something else—may not be quite clearly defined for those of us who constantly
max out our credit cards.
Most practical data mining systems accommodate just two of these four levels
of measurement: nominal and ordinal. Nominal attributes are sometimes called
categorical, enumerated, or
discrete. Enumerated is the standard term used in
computer science to denote a categorical data type; however, the strict defini-
tion of the term—namely, to put into one-to-one correspondence with the
natural numbers—implies an ordering, which is specifically not implied in the
machine learning context. Discrete also has connotations of ordering because
you often discretize a continuous, numeric quantity. Ordinal attributes are
generally called numeric, or perhaps continuous, but without the implication of
mathematical continuity. A special case of the nominal scale is the dichotomy,
2 . 3
W H AT ’ S I N A N AT T R I BU T E ?
5 1
P088407-Ch002.qxd 4/30/05 11:10 AM Page 51
which has only two members—often
designated as true and
false, or
yes and
no
in the weather data. Such attributes are sometimes called Boolean.
Machine learning systems can use a wide variety of other information about
attributes. For instance, dimensional considerations could be used to restrict the
search to expressions or comparisons that are dimensionally correct. Circular
ordering could affect the kinds of tests that are considered. For example, in a
temporal context, tests on a day attribute could involve next day, previous day,
next weekday, and same day next week. Partial orderings, that is, generalization
or specialization relations, frequently occur in practical situations. Information
of this kind is often referred to as metadata, data about data. However, the kinds
of practical methods used for data mining are rarely capable of taking metadata
into account, although it is likely that these capabilities will develop rapidly in
the future. (We return to this in Chapter 8.)
2.4 Preparing the input
Preparing input for a data mining investigation usually consumes the bulk of
the effort invested in the entire data mining process. Although this book is not
really about the problems of data preparation, we want to give you a feeling for
the issues involved so that you can appreciate the complexities. Following that,
we look at a particular input file format, the attribute-relation file format (ARFF
format), that is used in the Java package described in Part II. Then we consider
issues that arise when converting datasets to such a format, because there are
some simple practical points to be aware of. Bitter experience shows that real
data is often of disappointingly low in quality, and careful checking—a process
that has become known as data cleaning—pays off many times over.
Gathering the data together
When beginning work on a data mining problem, it is first necessary to bring
all the data together into a set of instances. We explained the need to denor-
malize relational data when describing the family tree example. Although it
illustrates the basic issue, this self-contained and rather artificial example does
not really convey a feeling for what the process will be like in practice. In a real
business application, it will be necessary to bring data together from different
departments. For example, in a marketing study data will be needed from the
sales department, the customer billing department, and the customer service
department.
Integrating data from different sources usually presents many challenges—
not deep issues of principle but nasty realities of practice. Different departments
will use different styles of record keeping, different conventions, different time
periods, different degrees of data aggregation, different primary keys, and will
have different kinds of error. The data must be assembled, integrated, and
5 2
C H A P T E R 2
|
I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S
P088407-Ch002.qxd 4/30/05 11:10 AM Page 52