HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 110
#28
110
Chapter 3 Data Preprocessing
representative sample, especially when the data are skewed. For example, a stratified
sample may be obtained from customer data, where a stratum is created for each cus-
tomer age group. In this way, the age group having the smallest number of customers
will be sure to be represented.
An advantage of sampling for data reduction is that the cost of obtaining a sample
is proportional to the size of the sample, s, as opposed to N , the data set size. Hence,
sampling complexity is potentially sublinear to the size of the data. Other data reduc-
tion techniques can require at least one complete pass through D. For a fixed sample
size, sampling complexity increases only linearly as the number of data dimensions,
n, increases, whereas techniques using histograms, for example, increase exponentially
in n.
When applied to data reduction, sampling is most commonly used to estimate the
answer to an aggregate query. It is possible (using the central limit theorem) to deter-
mine a sufficient sample size for estimating a given function within a specified degree
of error. This sample size, s, may be extremely small in comparison to N . Sampling is
a natural choice for the progressive refinement of a reduced data set. Such a set can be
further refined by simply increasing the sample size.
3.4.9
Data Cube Aggregation
Imagine that you have collected the data for your analysis. These data consist of the
AllElectronics sales per quarter, for the years 2008 to 2010. You are, however, interested
in the annual sales (total per year), rather than the total per quarter. Thus, the data can
be aggregated so that the resulting data summarize the total sales per year instead of per
quarter. This aggregation is illustrated in Figure 3.10. The resulting data set is smaller in
volume, without loss of information necessary for the analysis task.
Data cubes are discussed in detail in Chapter 4 on data warehousing and Chapter 5
on data cube technology. We briefly introduce some concepts here. Data cubes store
Quarter
Year 2010
Sales
Q1
Q2
Q3
Q4
$224,000
$408,000
$350,000
$586,000
Quarter
Year 2009
Sales
Q1
Q2
Q3
Q4
$224,000
$408,000
$350,000
$586,000
Quarter
Year 2008
Sales
Q1
Q2
Q3
Q4
$224,000
$408,000
$350,000
$586,000
Year
Sales
2008
2009
2010
$1,568,000
$2,356,000
$3,594,000
Figure 3.10
Sales data for a given branch of AllElectronics for the years 2008 through 2010. On the left,
the sales are shown per quarter. On the right, the data are aggregated to provide the annual
sales.
HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 111
#29
3.5 Data Transformation and Data Discretization
111
568
A
B
C
D
750
150
50
home
entertainment
computer
phone
security
2008 2009
year
item_type
branch
2010
Figure 3.11
A data cube for sales at AllElectronics.
multidimensional aggregated information. For example, Figure 3.11 shows a data cube
for multidimensional analysis of sales data with respect to annual sales per item type
for each AllElectronics branch. Each cell holds an aggregate data value, corresponding
to the data point in multidimensional space. (For readability, only some cell values are
shown.) Concept hierarchies may exist for each attribute, allowing the analysis of data
at multiple abstraction levels. For example, a hierarchy for branch could allow branches
to be grouped into regions, based on their address. Data cubes provide fast access to
precomputed, summarized data, thereby benefiting online analytical processing as well
as data mining.
The cube created at the lowest abstraction level is referred to as the base cuboid. The
base cuboid should correspond to an individual entity of interest such as sales or cus-
tomer. In other words, the lowest level should be usable, or useful for the analysis. A cube
at the highest level of abstraction is the apex cuboid. For the sales data in Figure 3.11,
the apex cuboid would give one total—the total sales for all three years, for all item
types, and for all branches. Data cubes created for varying levels of abstraction are often
referred to as cuboids, so that a data cube may instead refer to a lattice of cuboids. Each
higher abstraction level further reduces the resulting data size. When replying to data
mining requests, the smallest available cuboid relevant to the given task should be used.
This issue is also addressed in Chapter 4.
3.5
Data Transformation and Data Discretization
This section presents methods of data transformation. In this preprocessing step, the
data are transformed or consolidated so that the resulting mining process may be more
efficient, and the patterns found may be easier to understand. Data discretization, a form
of data transformation, is also discussed.