HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 14
#14
14
Chapter 1 Introduction
of items that are frequently sold together. The mining of such frequent patterns from
transactional data is discussed in Chapters 6 and 7.
1.3.4
Other Kinds of Data
Besides relational database data, data warehouse data, and transaction data, there are
many other kinds of data that have versatile forms and structures and rather different
semantic meanings. Such kinds of data can be seen in many applications: time-related
or sequence data (e.g., historical records, stock exchange data, and time-series and bio-
logical sequence data), data streams (e.g., video surveillance and sensor data, which are
continuously transmitted), spatial data (e.g., maps), engineering design data (e.g., the
design of buildings, system components, or integrated circuits), hypertext and multi-
media data (including text, image, video, and audio data), graph and networked data
(e.g., social and information networks), and the Web (a huge, widely distributed infor-
mation repository made available by the Internet). These applications bring about new
challenges, like how to handle data carrying special structures (e.g., sequences, trees,
graphs, and networks) and specific semantics (such as ordering, image, audio and video
contents, and connectivity), and how to mine patterns that carry rich structures and
semantics.
Various kinds of knowledge can be mined from these kinds of data. Here, we list
just a few. Regarding temporal data, for instance, we can mine banking data for chang-
ing trends, which may aid in the scheduling of bank tellers according to the volume of
customer traffic. Stock exchange data can be mined to uncover trends that could help
you plan investment strategies (e.g., the best time to purchase AllElectronics stock). We
could mine computer network data streams to detect intrusions based on the anomaly of
message flows, which may be discovered by clustering, dynamic construction of stream
models or by comparing the current frequent patterns with those at a previous time.
With spatial data, we may look for patterns that describe changes in metropolitan
poverty rates based on city distances from major highways. The relationships among
a set of spatial objects can be examined in order to discover which subsets of objects
are spatially autocorrelated or associated. By mining text data, such as literature on data
mining from the past ten years, we can identify the evolution of hot topics in the field. By
mining user comments on products (which are often submitted as short text messages),
we can assess customer sentiments and understand how well a product is embraced by
a market. From multimedia data, we can mine images to identify objects and classify
them by assigning semantic labels or tags. By mining video data of a hockey game, we
can detect video sequences corresponding to goals. Web mining can help us learn about
the distribution of information on the WWW in general, characterize and classify web
pages, and uncover web dynamics and the association and other relationships among
different web pages, users, communities, and web-based activities.
It is important to keep in mind that, in many applications, multiple types of data
are present. For example, in web mining, there often exist text data and multimedia
data (e.g., pictures and videos) on web pages, graph data like web graphs, and map
data on some web sites. In bioinformatics, genomic sequences, biological networks, and
HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 15
#15
1.4 What Kinds of Patterns Can Be Mined?
15
3-D spatial structures of genomes may coexist for certain biological objects. Mining
multiple data sources of complex data often leads to fruitful findings due to the mutual
enhancement and consolidation of such multiple sources. On the other hand, it is also
challenging because of the difficulties in data cleaning and data integration, as well as
the complex interactions among the multiple sources of such data.
While such data require sophisticated facilities for efficient storage, retrieval, and
updating, they also provide fertile ground and raise challenging research and imple-
mentation issues for data mining. Data mining on such data is an advanced topic. The
methods involved are extensions of the basic techniques presented in this book.
1.4
What Kinds of Patterns Can Be Mined?
We have observed various types of data and information repositories on which data
mining can be performed. Let us now examine the kinds of patterns that can be mined.
There are a number of data mining functionalities. These include characterization
and discrimination (Section 1.4.1); the mining of frequent patterns, associations, and
correlations (Section 1.4.2); classification and regression (Section 1.4.3); clustering anal-
ysis (Section 1.4.4); and outlier analysis (Section 1.4.5). Data mining functionalities are
used to specify the kinds of patterns to be found in data mining tasks. In general, such
tasks can be classified into two categories: descriptive and predictive. Descriptive min-
ing tasks characterize properties of the data in a target data set. Predictive mining tasks
perform induction on the current data in order to make predictions.
Data mining functionalities, and the kinds of patterns they can discover, are described
below. In addition, Section 1.4.6 looks at what makes a pattern interesting. Interesting
patterns represent knowledge.
1.4.1
Class/Concept Description: Characterization
and Discrimination
Data entries can be associated with classes or concepts. For example, in the AllElectronics
store, classes of items for sale include computers and printers, and concepts of customers
include bigSpenders and budgetSpenders. It can be useful to describe individual classes
and concepts in summarized, concise, and yet precise terms. Such descriptions of a class
or a concept are called class/concept descriptions. These descriptions can be derived
using (1) data characterization, by summarizing the data of the class under study (often
called the target class) in general terms, or (2) data discrimination, by comparison of
the target class with one or a set of comparative classes (often called the contrasting
classes), or (3) both data characterization and discrimination.
Data characterization is a summarization of the general characteristics or features
of a target class of data. The data corresponding to the user-specified class are typically
collected by a query. For example, to study the characteristics of software products with
sales that increased by 10% in the previous year, the data related to such products can
be collected by executing an SQL query on the sales database.