Data Mining. Concepts and Techniques, 3rd Edition

HAN 08-ch01-001-038-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	16/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 12 13 14 15 16 17 18 19 ... 343

Figure 1.5

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 9

#9

1.3 What Kinds of Data Can Be Mined?

1.3.1

Database Data

A database system, also called a database management system (DBMS), consists of a

collection of interrelated data, known as a database, and a set of software programs to

manage and access the data. The software programs provide mechanisms for deﬁning

database structures and data storage; for specifying and managing concurrent, shared,

or distributed data access; and for ensuring consistency and security of the information

stored despite system crashes or attempts at unauthorized access.

A relational database is a collection of tables, each of which is assigned a unique

name. Each table consists of a set of attributes (columns or ﬁelds) and usually stores

a large set of tuples (records or rows). Each tuple in a relational table represents an

object identiﬁed by a unique key and described by a set of attribute values. A semantic

data model, such as an entity-relationship (ER) data model, is often constructed for

relational databases. An ER data model represents the database as a set of entities and

their relationships.

Example 1.2

A relational database for

AllElectronics. The ﬁctitious AllElectronics store is used to

illustrate concepts throughout this book. The company is described by the following

relation tables: customer, item, employee, and branch. The headers of the tables described

here are shown in Figure 1.5. (A header is also called the schema of a relation.)

The relation customer consists of a set of attributes describing the customer infor-

mation, including a unique customer identity number (cust ID), customer name,

address, age, occupation, annual income, credit information, and category.

Similarly, each of the relations item, employee, and branch consists of a set of attri-

butes describing the properties of these entities.

Tables can also be used to represent the relationships between or among multiple

entities. In our example, these include purchases (customer purchases items, creating

a sales transaction handled by an employee), items sold (lists items sold in a given

transaction), and works at (employee works at a branch of AllElectronics).

customer

(cust ID, name, address, age, occupation, annual income, credit information,

category, . . .

)

item

(item ID, brand, category, type, price, place made, supplier, cost, . . . )

employee

(empl ID, name, category, group, salary, commission, . . . )

branch

(branch ID, name, address, . . . )

purchases

(trans ID, cust ID, empl ID, date, time, method paid, amount)

items sold

(trans ID, item ID, qty)

works at

(empl ID, branch ID)

Figure 1.5

Relational schema for a relational database, AllElectronics.

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 10

#10

10

Chapter 1 Introduction

Relational data can be accessed by database queries written in a relational query

language (e.g., SQL) or with the assistance of graphical user interfaces. A given query is

transformed into a set of relational operations, such as join, selection, and projection,

and is then optimized for efﬁcient processing. A query allows retrieval of speciﬁed sub-

sets of the data. Suppose that your job is to analyze the AllElectronics data. Through the

use of relational queries, you can ask things like, “Show me a list of all items that were

sold in the last quarter.” Relational languages also use aggregate functions such as

sum

avg

(average),

count

max

(maximum), and

min

(minimum). Using aggregates allows you

to ask: “Show me the total sales of the last month, grouped by branch,” or “How many sales

transactions occurred in the month of December?” or “Which salesperson had the highest

sales?”

When mining relational databases, we can go further by searching for trends or

data patterns. For example, data mining systems can analyze customer data to predict

the credit risk of new customers based on their income, age, and previous credit

information. Data mining systems may also detect deviations—that is, items with sales

that are far from those expected in comparison with the previous year. Such deviations

can then be further investigated. For example, data mining may discover that there has

been a change in packaging of an item or a signiﬁcant increase in price.

Relational databases are one of the most commonly available and richest information

repositories, and thus they are a major data form in the study of data mining.

1.3.2

Data Warehouses

Suppose that AllElectronics is a successful international company with branches around

the world. Each branch has its own set of databases. The president of AllElectronics has

asked you to provide an analysis of the company’s sales per item type per branch for the

third quarter. This is a difﬁcult task, particularly since the relevant data are spread out

over several databases physically located at numerous sites.

If AllElectronics had a data warehouse, this task would be easy. A data warehouse

is a repository of information collected from multiple sources, stored under a uniﬁed

schema, and usually residing at a single site. Data warehouses are constructed via a

process of data cleaning, data integration, data transformation, data loading, and peri-

odic data refreshing. This process is discussed in Chapters 3 and 4. Figure 1.6 shows the

typical framework for construction and use of a data warehouse for AllElectronics.

To facilitate decision making, the data in a data warehouse are organized around

major subjects (e.g., customer, item, supplier, and activity). The data are stored to pro-

vide information from a historical perspective, such as in the past 6 to 12 months, and are

typically summarized. For example, rather than storing the details of each sales transac-

tion, the data warehouse may store a summary of the transactions per item type for each

store or, summarized to a higher level, for each sales region.

A data warehouse is usually modeled by a multidimensional data structure, called a

data cube, in which each dimension corresponds to an attribute or a set of attributes

in the schema, and each cell stores the value of some aggregate measure such as count

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 12 13 14 15 16 17 18 19 ... 343