cleaned up. The idea of company wide database integration is known as data
warehousing. Data warehouses provide a single consistent point of access to cor-
porate or organizational data, transcending departmental divisions. They are
the place where old data is published in a way that can be used to inform busi-
ness decisions. The movement toward data warehousing is a recognition of the
fact that the fragmented information that an organization uses to support day-
to-day operations at a departmental level can have immense strategic value
when brought together. Clearly, the presence of a data warehouse is a very useful
precursor to data mining, and if it is not available, many of the steps involved
in data warehousing will have to be undertaken to prepare the data for mining.
Often even a data warehouse will not contain all the necessary data, and you
may have to reach outside the organization to bring in data relevant to the
problem at hand. For example, weather data had to be obtained in the load
forecasting example in the last chapter, and demographic data is needed for
marketing and sales applications. Sometimes called overlay data, this is not nor-
mally collected by an organization but is clearly relevant to the data mining
problem. It, too, must be cleaned up and integrated with the other data that has
been collected.
Another practical question when assembling the data is the degree of aggre-
gation that is appropriate. When a dairy farmer decides which cows to sell, the
milk production records—which an automatic milking machine records twice
a day—must be aggregated. Similarly, raw telephone call data is of little use when
telecommunications companies study their clients’ behavior: the data must be
aggregated to the customer level. But do you want usage by month or by quarter,
and for how many months or quarters in arrears? Selecting the right type and
level of aggregation is usually critical for success.
Because so many different issues are involved, you can’t expect to get it right
the first time. This is why data assembly, integration, cleaning, aggregating, and
general preparation take so long.
ARFF format
We now look at a standard way of representing datasets that consist of inde-
pendent, unordered instances and do not involve relationships among instances,
called an ARFF file.
Figure 2.2 shows an ARFF file for the weather data in Table 1.3, the version
with some numeric features. Lines beginning with a
%
sign are comments.
Following the comments at the beginning of the file are the name of the rela-
tion (
weather
) and a block defining the attributes (
outlook, temperature, humid-
ity, windy, play?
). Nominal attributes are followed by the set of values they can
take on, enclosed in curly braces. Values can include spaces; if so, they must be
placed within quotation marks. Numeric values are followed by the keyword
numeric
.
2 . 4
P R E PA R I N G T H E I N P U T
5 3
P088407-Ch002.qxd 4/30/05 11:10 AM Page 53
5 4
C H A P T E R 2
|
I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S
% ARFF file for the weather data with some numeric features
%
@relation weather
@attribute outlook { sunny, overcast, rainy }
@attribute temperature numeric
@attribute humidity numeric
@attribute windy { true, false }
@attribute play? { yes, no }
@data
%
% 14 instances
%
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
rainy, 70, 96, false, yes
rainy, 68, 80, false, yes
rainy, 65, 70, true, no
overcast, 64, 65, true, yes
sunny, 72, 95, false, no
sunny, 69, 70, false, yes
rainy, 75, 80, false, yes
sunny, 75, 70, true, yes
overcast, 72, 90, true, yes
overcast, 81, 75, false, yes
rainy, 71, 91, true, no
Figure 2.2 ARFF file for the weather data.
Although the weather problem is to predict the class value
play?
from the values of the other attributes, the class attribute is not dis-
tinguished in any way in the data file. The ARFF format merely gives
a dataset; it does not specify which of the attributes is the one that
is supposed to be predicted. This means that the same file can be used
for investigating how well each attribute can be predicted from the
others, or to find association rules, or for clustering.
Following the attribute definitions is an
@data
line that signals the
start of the instances in the dataset. Instances are written one per line,
with values for each attribute in turn, separated by commas. If a value
is missing it is represented by a single question mark (there are no
P088407-Ch002.qxd 4/30/05 11:10 AM Page 54
missing values in this dataset). The attribute specifications in ARFF files allow
the dataset to be checked to ensure that it contains legal values for all attributes,
and programs that read ARFF files do this checking automatically.
In addition to nominal and numeric attributes, exemplified by the weather
data, the ARFF format has two further attribute types: string attributes and date
attributes. String attributes have values that are textual. Suppose you have a
string attribute that you want to call description. In the block defining the attrib-
utes, it is specified as follows:
@attribute description string
Then, in the instance data, include any character string in quotation marks (to
include quotation marks in your string, use the standard convention of pre-
ceding each one by a backslash, \). Strings are stored internally in a string table
and represented by their address in that table. Thus two strings that contain the
same characters will have the same value.
String attributes can have values that are very long—even a whole document.
To be able to use string attributes for text mining, it is necessary to be able to
manipulate them. For example, a string attribute might be converted into many
numeric attributes, one for each word in the string, whose value is the number
of times that word appears. These transformations are described in Section 7.3.
Date attributes are strings with a special format and are introduced like this:
@attribute today date
(for an attribute called today). Weka, the machine learning software discussed
in Part II of this book, uses the ISO-8601 combined date and time format yyyy-
MM-dd-THH:mm:ss with four digits for the year, two each for the month and
day, then the letter T followed by the time with two digits for each of hours,
minutes, and seconds.
1
In the data section of the file, dates are specified as the
corresponding string representation of the date and time, for example,
2004-04-
03T12:00:00
. Although they are specified as strings, dates are converted to
numeric form when the input file is read. Dates can also be converted internally
to different formats, so you can have absolute timestamps in the data file and
use transformations to forms such as time of day or day of the week to detect
periodic behavior.
Sparse data
Sometimes most attributes have a value of 0 for most the instances. For example,
market basket data records purchases made by supermarket customers. No
2 . 4
P R E PA R I N G T H E I N P U T
5 5
1
Weka contains a mechanism for defining a date attribute to have a different format by
including a special string in the attribute definition.
P088407-Ch002.qxd 4/30/05 11:10 AM Page 55
Dostları ilə paylaş: |