Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	62/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 58 59 60 61 62 63 64 65 ... 343

Data Cube Aggregation
Data Transformation and Data Discretization

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 110

#28

110

Chapter 3 Data Preprocessing

representative sample, especially when the data are skewed. For example, a stratiﬁed

sample may be obtained from customer data, where a stratum is created for each cus-

tomer age group. In this way, the age group having the smallest number of customers

will be sure to be represented.

An advantage of sampling for data reduction is that the cost of obtaining a sample

is proportional to the size of the sample, s, as opposed to N , the data set size. Hence,

sampling complexity is potentially sublinear to the size of the data. Other data reduc-

tion techniques can require at least one complete pass through D. For a ﬁxed sample

size, sampling complexity increases only linearly as the number of data dimensions,

n, increases, whereas techniques using histograms, for example, increase exponentially

in n.

When applied to data reduction, sampling is most commonly used to estimate the

answer to an aggregate query. It is possible (using the central limit theorem) to deter-

mine a sufﬁcient sample size for estimating a given function within a speciﬁed degree

of error. This sample size, s, may be extremely small in comparison to N . Sampling is

a natural choice for the progressive reﬁnement of a reduced data set. Such a set can be

further reﬁned by simply increasing the sample size.

3.4.9

Data Cube Aggregation

Imagine that you have collected the data for your analysis. These data consist of the

AllElectronics sales per quarter, for the years 2008 to 2010. You are, however, interested

in the annual sales (total per year), rather than the total per quarter. Thus, the data can

be aggregated so that the resulting data summarize the total sales per year instead of per

quarter. This aggregation is illustrated in Figure 3.10. The resulting data set is smaller in

volume, without loss of information necessary for the analysis task.

Data cubes are discussed in detail in Chapter 4 on data warehousing and Chapter 5

on data cube technology. We brieﬂy introduce some concepts here. Data cubes store

Quarter

Year 2010

Sales

$224,000

$408,000

$350,000

$586,000

Quarter

Year 2009

Sales

$224,000

$408,000

$350,000

$586,000

Quarter

Year 2008

Sales

$224,000

$408,000

$350,000

$586,000

Year

Sales

2008

2009

2010

$1,568,000

$2,356,000

$3,594,000

Figure 3.10

Sales data for a given branch of AllElectronics for the years 2008 through 2010. On the left,

the sales are shown per quarter. On the right, the data are aggregated to provide the annual

sales.

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 111

#29

3.5 Data Transformation and Data Discretization

111

568

750

150

home

entertainment

computer

phone

security

2008 2009

year

item_type

branch

2010

Figure 3.11

A data cube for sales at AllElectronics.

multidimensional aggregated information. For example, Figure 3.11 shows a data cube

for multidimensional analysis of sales data with respect to annual sales per item type

for each AllElectronics branch. Each cell holds an aggregate data value, corresponding

to the data point in multidimensional space. (For readability, only some cell values are

shown.) Concept hierarchies may exist for each attribute, allowing the analysis of data

at multiple abstraction levels. For example, a hierarchy for branch could allow branches

to be grouped into regions, based on their address. Data cubes provide fast access to

precomputed, summarized data, thereby beneﬁting online analytical processing as well

as data mining.

The cube created at the lowest abstraction level is referred to as the base cuboid. The

base cuboid should correspond to an individual entity of interest such as sales or cus-

tomer. In other words, the lowest level should be usable, or useful for the analysis. A cube

at the highest level of abstraction is the apex cuboid. For the sales data in Figure 3.11,

the apex cuboid would give one total—the total sales for all three years, for all item

types, and for all branches. Data cubes created for varying levels of abstraction are often

referred to as cuboids, so that a data cube may instead refer to a lattice of cuboids. Each

higher abstraction level further reduces the resulting data size. When replying to data

mining requests, the smallest available cuboid relevant to the given task should be used.

This issue is also addressed in Chapter 4.

3.5

Data Transformation and Data Discretization

This section presents methods of data transformation. In this preprocessing step, the

data are transformed or consolidated so that the resulting mining process may be more

efﬁcient, and the patterns found may be easier to understand. Data discretization, a form

of data transformation, is also discussed.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 58 59 60 61 62 63 64 65 ... 343