matter how
big the shopping expedition, customers never purchase more than
a tiny portion of the items a store offers. The market basket data contains the
quantity of each item that the customer purchases, and this is zero for almost
all items in stock. The data file can be viewed as a matrix whose rows and
columns represent customers and stock items, and the matrix is “sparse”—
nearly all its elements are zero. Another example occurs in text mining, in which
the instances are documents. Here, the columns and rows represent documents
and words, and the numbers indicate how many times a particular word appears
in a particular document. Most documents have a rather small vocabulary, so
most entries are zero.
It can be impractical to represent each element of a sparse matrix explicitly,
writing each value in order, as follows:
0, 26, 0, 0, 0, 0, 63, 0, 0, 0, “class A”
0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”
Instead, the nonzero attributes can be explicitly identified by attribute number
and their value stated:
{1 26, 6 63, 10 “class A”}
{3 42, 10 “class B”}
Each instance is enclosed in curly braces and contains the index number of each
nonzero attribute (indexes start from 0) and its value. Sparse data files have the
same
@relation
and
@attribute
tags, followed by an
@data
line, but the data
section is different and contains specifications in braces such as those shown
previously. Note that the omitted values have a value of 0—they are not
“missing” values! If a value is unknown, it must be explicitly represented with
a question mark.
Attribute types
ARFF files accommodate the two basic data types, nominal and numeric. String
attributes and date attributes are effectively nominal and numeric, respectively,
although before they are used strings are often converted into a numeric form
such as a word vector. But how the two basic types are interpreted depends on
the learning method being used. For example, most methods treat numeric
attributes as ordinal scales and only use less-than and greater-than comparisons
between the values. However, some treat them as ratio scales and use distance
calculations. You need to understand how machine learning methods work
before using them for data mining.
If a learning method treats numeric attributes as though they are measured
on ratio scales, the question of normalization arises. Attributes are often nor-
malized to lie in a fixed range, say, from zero to one, by dividing all values by
the maximum value encountered or by subtracting the minimum value and
5 6
C H A P T E R 2
|
I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S
P088407-Ch002.qxd 4/30/05 11:10 AM Page 56
dividing by the range between the maximum and the minimum values. Another
normalization technique is to calculate the statistical mean and standard
deviation of the attribute values, subtract the mean from each value, and divide
the result by the standard deviation. This process is called standardizing a sta-
tistical variable and results in a set of values whose mean is zero and standard
deviation is one.
Some learning methods—for example, varieties of instance-based learning
and regression methods—deal only with ratio scales because they calculate
the “distance” between two instances based on the values of their attributes. If
the actual scale is ordinal, a numeric distance function must be defined. One
way of doing this is to use a two-level distance: one if the two values are differ-
ent and zero if they are the same. Any nominal quantity can be treated as numeric
by using this distance function. However, it is rather a crude technique and con-
ceals the true degree of variation between instances. Another possibility is to gen-
erate several synthetic binary attributes for each nominal attribute: we return to
this in Section 6.5 when we look at the use of trees for numeric prediction.
Sometimes there is a genuine mapping between nominal quantities and
numeric scales. For example, postal ZIP codes indicate areas that could be rep-
resented by geographic coordinates; the leading digits of telephone numbers
may do so, too, depending on where you live. The first two digits of a student’s
identification number may be the year in which she first enrolled.
It is very common for practical datasets to contain nominal values that are
coded as integers. For example, an integer identifier may be used as a code for
an attribute such as part number, yet such integers are not intended for use in
less-than or greater-than comparisons. If this is the case, it is important to
specify that the attribute is nominal rather than numeric.
It is quite possible to treat an ordinal quantity as though it were nominal.
Indeed, some machine learning methods only deal with nominal elements. For
example, in the contact lens problem the age attribute is treated as nominal, and
the rules generated included the following:
If age
= young and astigmatic = no and
tear production rate
= normal then recommendation = soft
If age
= pre-presbyopic and astigmatic = no and
tear production rate
= normal then recommendation = soft
But in fact age, specified in this way, is really an ordinal quantity for which the
following is true:
young
< pre-presbyopic < presbyopic
If it were treated as ordinal, the two rules could be collapsed into one:
If age
£ pre-presbyopic and astigmatic = no and
tear production rate
= normal then recommendation = soft
2 . 4
P R E PA R I N G T H E I N P U T
5 7
P088407-Ch002.qxd 4/30/05 11:10 AM Page 57