HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 2
#2
2
Chapter 1 Introduction
society, science and engineering, medicine, and almost every other aspect of daily life.
This explosive growth of available data volume is a result of the computerization of
our society and the fast development of powerful data collection and storage tools.
Businesses worldwide generate gigantic data sets, including sales transactions, stock
trading records, product descriptions, sales promotions, company profiles and perfor-
mance, and customer feedback. For example, large stores, such as Wal-Mart, handle
hundreds of millions of transactions per week at thousands of branches around the
world. Scientific and engineering practices generate high orders of petabytes of data in
a continuous manner, from remote sensing, process measuring, scientific experiments,
system performance, engineering observations, and environment surveillance.
Global backbone telecommunication networks carry tens of petabytes of data traffic
every day. The medical and health industry generates tremendous amounts of data from
medical records, patient monitoring, and medical imaging. Billions of Web searches
supported by search engines process tens of petabytes of data daily. Communities and
social media have become increasingly important data sources, producing digital pic-
tures and videos, blogs, Web communities, and various kinds of social networks. The
list of sources that generate huge amounts of data is endless.
This explosively growing, widely available, and gigantic body of data makes our
time truly the data age. Powerful and versatile tools are badly needed to automatically
uncover valuable information from the tremendous amounts of data and to transform
such data into organized knowledge. This necessity has led to the birth of data mining.
The field is young, dynamic, and promising. Data mining has and will continue to make
great strides in our journey from the data age toward the coming information age.
Example 1.1
Data mining turns a large collection of data into knowledge. A search engine (e.g.,
Google) receives hundreds of millions of queries every day. Each query can be viewed
as a transaction where the user describes her or his information need. What novel and
useful knowledge can a search engine learn from such a huge collection of queries col-
lected from users over time? Interestingly, some patterns found in user search queries
can disclose invaluable knowledge that cannot be obtained by reading individual data
items alone. For example, Google’s Flu Trends uses specific search terms as indicators of
flu activity. It found a close relationship between the number of people who search for
flu-related information and the number of people who actually have flu symptoms. A
pattern emerges when all of the search queries related to flu are aggregated. Using aggre-
gated Google search data, Flu Trends can estimate flu activity up to two weeks faster
than traditional systems can.
2
This example shows how data mining can turn a large
collection of data into knowledge that can help meet a current global challenge.
1.1.2
Data Mining as the Evolution of Information Technology
Data mining can be viewed as a result of the natural evolution of information tech-
nology. The database and data management industry evolved in the development of
2
This is reported in [GMP
+
09].
HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 3
#3
1.1 Why Data Mining?
3
Data Collection and Database Creation
(1960s and earlier)
Primitive file processing
Database Management Systems
(1970s to early 1980s)
Hierarchical and network database systems
Relational database systems
Data modeling: entity-relationship models, etc.
Indexing and accessing methods
Query languages: SQL, etc.
User interfaces, forms, and reports
Query processing and optimization
Transactions, concurrency control, and recovery
Online transaction processing (OLTP)
Advanced Database Systems
(mid-1980s to present)
Advanced data models: extended-relational,
object relational, deductive, etc.
Managing complex data: spatial, temporal,
multimedia, sequence and structured,
scientific, engineering, moving objects, etc.
Data streams and cyber-physical data systems
Web-based databases (XML, semantic web)
Managing uncertain data and data cleaning
Integration of heterogeneous sources
Text database systems and integration with
information retrieval
Extremely large data management
Database system tuning and adaptive systems
Advanced queries: ranking, skyline, etc.
Cloud computing and parallel data processing
Issues of data privacy and security
Advanced Data Analysis
(late- 1980s to present)
Data warehouse and OLAP
Data mining and knowledge discovery:
classification, clustering, outlier analysis,
association and correlation, comparative
summary, discrimination analysis, pattern
discovery, trend and deviation analysis, etc.
Mining complex types of data: streams,
sequence, text, spatial, temporal, multimedia,
Web, networks, etc.
Data mining applications: business, society,
retail, banking, telecommunications, science
and engineering, blogs, daily life, etc.
Data mining and society: invisible data
mining, privacy-preserving data mining,
mining social and information networks,
recommender systems, etc.
Future Generation of Information Systems
(Present to future)
Figure 1.1
The evolution of database system technology.
several critical functionalities (Figure 1.1): data collection and database creation, data
management (including data storage and retrieval and database transaction processing),
and advanced data analysis (involving data warehousing and data mining). The early
development of data collection and database creation mechanisms served as a prerequi-
site for the later development of effective mechanisms for data storage and retrieval,
as well as query and transaction processing. Nowadays numerous database systems
offer query and transaction processing as common practice. Advanced data analysis has
naturally become the next step.