Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	34/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 30 31 32 33 34 35 36 37 ... 219

cleaned up. The idea of company wide database integration is known as data

warehousing. Data warehouses provide a single consistent point of access to cor-

porate or organizational data, transcending departmental divisions. They are

the place where old data is published in a way that can be used to inform busi-

ness decisions. The movement toward data warehousing is a recognition of the

fact that the fragmented information that an organization uses to support day-

to-day operations at a departmental level can have immense strategic value

when brought together. Clearly, the presence of a data warehouse is a very useful

precursor to data mining, and if it is not available, many of the steps involved

in data warehousing will have to be undertaken to prepare the data for mining.

Often even a data warehouse will not contain all the necessary data, and you

may have to reach outside the organization to bring in data relevant to the

problem at hand. For example, weather data had to be obtained in the load

forecasting example in the last chapter, and demographic data is needed for

marketing and sales applications. Sometimes called overlay data, this is not nor-

mally collected by an organization but is clearly relevant to the data mining

problem. It, too, must be cleaned up and integrated with the other data that has

been collected.

Another practical question when assembling the data is the degree of aggre-

gation that is appropriate. When a dairy farmer decides which cows to sell, the

milk production records—which an automatic milking machine records twice

a day—must be aggregated. Similarly, raw telephone call data is of little use when

telecommunications companies study their clients’ behavior: the data must be

aggregated to the customer level. But do you want usage by month or by quarter,

and for how many months or quarters in arrears? Selecting the right type and

level of aggregation is usually critical for success.

Because so many different issues are involved, you can’t expect to get it right

the ﬁrst time. This is why data assembly, integration, cleaning, aggregating, and

general preparation take so long.

ARFF format

We now look at a standard way of representing datasets that consist of inde-

pendent, unordered instances and do not involve relationships among instances,

called an ARFF ﬁle.

Figure 2.2 shows an ARFF ﬁle for the weather data in Table 1.3, the version

with some numeric features. Lines beginning with a

sign are comments.

Following the comments at the beginning of the ﬁle are the name of the rela-

tion (

weather

) and a block deﬁning the attributes (

outlook, temperature, humid-

ity, windy, play?

). Nominal attributes are followed by the set of values they can

take on, enclosed in curly braces. Values can include spaces; if so, they must be

placed within quotation marks. Numeric values are followed by the keyword

numeric

2 . 4

P R E PA R I N G T H E I N P U T

5 3

P088407-Ch002.qxd 4/30/05 11:10 AM Page 53

5 4

C H A P T E R 2

I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S

% ARFF file for the weather data with some numeric features

@relation weather

@attribute outlook { sunny, overcast, rainy }

@attribute temperature numeric

@attribute humidity numeric

@attribute windy { true, false }

@attribute play? { yes, no }

@data

% 14 instances

sunny, 85, 85, false, no

sunny, 80, 90, true, no

overcast, 83, 86, false, yes

rainy, 70, 96, false, yes

rainy, 68, 80, false, yes

rainy, 65, 70, true, no

overcast, 64, 65, true, yes

sunny, 72, 95, false, no

sunny, 69, 70, false, yes

rainy, 75, 80, false, yes

sunny, 75, 70, true, yes

overcast, 72, 90, true, yes

overcast, 81, 75, false, yes

rainy, 71, 91, true, no

Figure 2.2 ARFF ﬁle for the weather data.

Although the weather problem is to predict the class value

play?

from the values of the other attributes, the class attribute is not dis-

tinguished in any way in the data ﬁle. The ARFF format merely gives

a dataset; it does not specify which of the attributes is the one that

is supposed to be predicted. This means that the same ﬁle can be used

for investigating how well each attribute can be predicted from the

others, or to ﬁnd association rules, or for clustering.

Following the attribute deﬁnitions is an

@data

line that signals the

start of the instances in the dataset. Instances are written one per line,

with values for each attribute in turn, separated by commas. If a value

is missing it is represented by a single question mark (there are no

P088407-Ch002.qxd 4/30/05 11:10 AM Page 54

missing values in this dataset). The attribute speciﬁcations in ARFF ﬁles allow

the dataset to be checked to ensure that it contains legal values for all attributes,

and programs that read ARFF ﬁles do this checking automatically.

In addition to nominal and numeric attributes, exempliﬁed by the weather

data, the ARFF format has two further attribute types: string attributes and date

attributes. String attributes have values that are textual. Suppose you have a

string attribute that you want to call description. In the block deﬁning the attrib-

utes, it is speciﬁed as follows:

@attribute description string

Then, in the instance data, include any character string in quotation marks (to

include quotation marks in your string, use the standard convention of pre-

ceding each one by a backslash, \). Strings are stored internally in a string table

and represented by their address in that table. Thus two strings that contain the

same characters will have the same value.

String attributes can have values that are very long—even a whole document.

To be able to use string attributes for text mining, it is necessary to be able to

manipulate them. For example, a string attribute might be converted into many

numeric attributes, one for each word in the string, whose value is the number

of times that word appears. These transformations are described in Section 7.3.

Date attributes are strings with a special format and are introduced like this:

@attribute today date

(for an attribute called today). Weka, the machine learning software discussed

in Part II of this book, uses the ISO-8601 combined date and time format yyyy-

MM-dd-THH:mm:ss with four digits for the year, two each for the month and

day, then the letter T followed by the time with two digits for each of hours,

minutes, and seconds.

In the data section of the ﬁle, dates are speciﬁed as the

corresponding string representation of the date and time, for example,

2004-04-

03T12:00:00

. Although they are speciﬁed as strings, dates are converted to

numeric form when the input ﬁle is read. Dates can also be converted internally

to different formats, so you can have absolute timestamps in the data ﬁle and

use transformations to forms such as time of day or day of the week to detect

periodic behavior.

Sparse data

Sometimes most attributes have a value of 0 for most the instances. For example,

market basket data records purchases made by supermarket customers. No

2 . 4

P R E PA R I N G T H E I N P U T

5 5

Weka contains a mechanism for deﬁning a date attribute to have a different format by

including a special string in the attribute deﬁnition.

P088407-Ch002.qxd 4/30/05 11:10 AM Page 55

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 30 31 32 33 34 35 36 37 ... 219