HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 11
#11
1.3 What Kinds of Data Can Be Mined?
11
Data source in Chicago
Data source in Toronto
Data source in Vancouver
Data source in New York
Data
Warehouse
Clean
Integrate
Transform
Load
Refresh
Query and
analysis tools
Client
Client
Figure 1.6
Typical framework of a data warehouse for AllElectronics.
or sum
(sales amount). A data cube provides a multidimensional view of data and allows
the precomputation and fast access of summarized data.
Example 1.3
A data cube for
AllElectronics. A data cube for summarized sales data of AllElectronics
is presented in Figure 1.7(a). The cube has three dimensions: address (with city values
Chicago, New York, Toronto, Vancouver),
time (with quarter values
Q1, Q2, Q3, Q4), and
item (with item type values
home entertainment, computer, phone, security). The aggregate
value stored in each cell of the cube is sales amount (in thousands). For example, the total
sales for the first quarter, Q1, for the items related to security systems in Vancouver is
$
400, 000, as stored in cell Vancouver, Q1, security . Additional cubes may be used to store
aggregate sums over each dimension, corresponding to the aggregate values obtained using
different SQL group-bys (e.g., the total sales amount per city and quarter, or per city and
item, or per quarter and item, or per each individual dimension).
By providing multidimensional data views and the precomputation of summarized
data, data warehouse systems can provide inherent support for OLAP. Online analyti-
cal processing operations make use of background knowledge regarding the domain of
the data being studied to allow the presentation of data at different levels of abstraction.
Such operations accommodate different user viewpoints. Examples of OLAP opera-
tions include drill-down and roll-up, which allow the user to view the data at differing
degrees of summarization, as illustrated in Figure 1.7(b). For instance, we can drill
down on sales data summarized by quarter to see data summarized by month. Sim-
ilarly, we can roll up on sales data summarized by city to view data summarized by
country.
Although data warehouse tools help support data analysis, additional tools for
data mining are often needed for in-depth analysis. Multidimensional data mining
(also called exploratory multidimensional data mining) performs data mining in
HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 12
#12
12
Chapter 1 Introduction
605
825
14
400
Q1
Q2
Q3
Q4
Chicago
New York
Toronto
440
1560
395
Vancouver
time
(quarters)
address
(cities)
home
entertainment
computer
phone
item (types)
security
Q1, security>
Q1
Q2
Q3
Q4
USA
Canada
2000
1000
time
(quarters)
address
(countries)
home
entertainment
computer
phone
item (types)
security
150
100
150
Jan
Feb
March
Chicago
New York
Toronto
Vancouver
time
(months)
address
(cities)
home
entertainment
computer
phone
item (types)
security
Drill-down
on time data for Q1
Roll-up
on address
(a)
(b)
Figure 1.7
A multidimensional data cube, commonly used for data warehousing, (a) showing summa-
rized data for AllElectronics and (b) showing summarized data resulting from drill-down and
roll-up operations on the cube in (a). For improved readability, only some of the cube cell
values are shown.
HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 13
#13
1.3 What Kinds of Data Can Be Mined?
13
multidimensional space in an OLAP style. That is, it allows the exploration of mul-
tiple combinations of dimensions at varying levels of granularity in data mining,
and thus has greater potential for discovering interesting patterns representing knowl-
edge. An overview of data warehouse and OLAP technology is provided in Chapter 4.
Advanced issues regarding data cube computation and multidimensional data mining
are discussed in Chapter 5.
1.3.3
Transactional Data
In general, each record in a transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web page. A transaction typ-
ically includes a unique transaction identity number (trans ID) and a list of the items
making up the transaction, such as the items purchased in the transaction. A trans-
actional database may have additional tables, which contain other information related
to the transactions, such as item description, information about the salesperson or the
branch, and so on.
Example 1.4
A transactional database for
AllElectronics. Transactions can be stored in a table, with
one record per transaction. A fragment of a transactional database for AllElectronics is
shown in Figure 1.8. From the relational database point of view, the sales table in the
figure is a nested relation because the attribute list of item IDs contains a set of items.
Because most relational database systems do not support nested relational structures,
the transactional database is usually either stored in a flat file in a format similar to
the table in Figure 1.8 or unfolded into a standard relation in a format similar to the
items sold table in Figure 1.5.
As an analyst of AllElectronics, you may ask,“Which items sold well together?” This
kind of market basket data analysis would enable you to bundle groups of items together
as a strategy for boosting sales. For example, given the knowledge that printers are
commonly purchased together with computers, you could offer certain printers at a
steep discount (or even for free) to customers buying selected computers, in the hopes
of selling more computers (which are often more expensive than printers). A tradi-
tional database system is not able to perform market basket data analysis. Fortunately,
data mining on transactional data can do so by mining frequent itemsets, that is, sets
trans ID
list of item IDs
T100
I1, I3, I8, I16
T200
I2, I8
...
...
Figure 1.8
Fragment of a transactional database for sales at AllElectronics.