Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə35/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   31   32   33   34   35   36   37   38   ...   219

matter how big the shopping expedition, customers never purchase more than

a tiny portion of the items a store offers. The market basket data contains the

quantity of each item that the customer purchases, and this is zero for almost

all items in stock. The data file can be viewed as a matrix whose rows and

columns represent customers and stock items, and the matrix is “sparse”—

nearly all its elements are zero. Another example occurs in text mining, in which

the instances are documents. Here, the columns and rows represent documents

and words, and the numbers indicate how many times a particular word appears

in a particular document. Most documents have a rather small vocabulary, so

most entries are zero.

It can be impractical to represent each element of a sparse matrix explicitly,

writing each value in order, as follows:

0, 26, 0,  0, 0, 0, 63, 0, 0, 0, “class A”

0,  0, 0, 42, 0, 0,  0, 0, 0, 0, “class B”

Instead, the nonzero attributes can be explicitly identified by attribute number

and their value stated:

{1 26, 6 63, 10 “class A”}

{3 42, 10 “class B”}

Each instance is enclosed in curly braces and contains the index number of each

nonzero attribute (indexes start from 0) and its value. Sparse data files have the

same 

@relation



and 

@attribute

tags, followed by an 

@data


line, but the data

section is different and contains specifications in braces such as those shown

previously. Note that the omitted values have a value of 0—they are not

“missing” values! If a value is unknown, it must be explicitly represented with

a question mark.

Attribute types

ARFF files accommodate the two basic data types, nominal and numeric. String

attributes and date attributes are effectively nominal and numeric, respectively,

although before they are used strings are often converted into a numeric form

such as a word vector. But how the two basic types are interpreted depends on

the learning method being used. For example, most methods treat numeric

attributes as ordinal scales and only use less-than and greater-than comparisons

between the values. However, some treat them as ratio scales and use distance

calculations. You need to understand how machine learning methods work

before using them for data mining.

If a learning method treats numeric attributes as though they are measured

on ratio scales, the question of normalization arises. Attributes are often nor-

malized to lie in a fixed range, say, from zero to one, by dividing all values by

the maximum value encountered or by subtracting the minimum value and

5 6

C H A P T E R   2



|

I N P U T: C O N C E P TS , I N S TA N C E S , A N D   AT T R I BU T E S

P088407-Ch002.qxd  4/30/05  11:10 AM  Page 56



dividing by the range between the maximum and the minimum values. Another

normalization technique is to calculate the statistical mean and standard 

deviation of the attribute values, subtract the mean from each value, and divide

the result by the standard deviation. This process is called standardizing a sta-

tistical variable and results in a set of values whose mean is zero and standard

deviation is one.

Some learning methods—for example, varieties of instance-based learning

and regression methods—deal only with ratio scales because they calculate 

the “distance” between two instances based on the values of their attributes. If

the actual scale is ordinal, a numeric distance function must be defined. One

way of doing this is to use a two-level distance: one if the two values are differ-

ent and zero if they are the same. Any nominal quantity can be treated as numeric

by using this distance function. However, it is rather a crude technique and con-

ceals the true degree of variation between instances. Another possibility is to gen-

erate several synthetic binary attributes for each nominal attribute: we return to

this in Section 6.5 when we look at the use of trees for numeric prediction.

Sometimes there is a genuine mapping between nominal quantities and

numeric scales. For example, postal ZIP codes indicate areas that could be rep-

resented by geographic coordinates; the leading digits of telephone numbers

may do so, too, depending on where you live. The first two digits of a student’s

identification number may be the year in which she first enrolled.

It is very common for practical datasets to contain nominal values that are

coded as integers. For example, an integer identifier may be used as a code for

an attribute such as part number, yet such integers are not intended for use in

less-than or greater-than comparisons. If this is the case, it is important to

specify that the attribute is nominal rather than numeric.

It is quite possible to treat an ordinal quantity as though it were nominal.

Indeed, some machine learning methods only deal with nominal elements. For

example, in the contact lens problem the age attribute is treated as nominal, and

the rules generated included the following:

If age 

= young and astigmatic = no and 



tear production rate 

= normal then recommendation = soft

If age 

= pre-presbyopic and astigmatic = no and 



tear production rate 

= normal then recommendation = soft

But in fact age, specified in this way, is really an ordinal quantity for which the

following is true:

young 

< pre-presbyopic < presbyopic

If it were treated as ordinal, the two rules could be collapsed into one:

If age 

£ pre-presbyopic and astigmatic = no and 



tear production rate 

= normal then recommendation = soft

2 . 4

P R E PA R I N G   T H E   I N P U T



5 7

P088407-Ch002.qxd  4/30/05  11:10 AM  Page 57




Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   31   32   33   34   35   36   37   38   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə