HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 31
#31
1.7 Major Issues in Data Mining
31
into the knowledge discovery process. Such knowledge can be used for pattern
evaluation as well as to guide the search toward interesting patterns.
Ad hoc data mining and data mining query languages: Query languages (e.g., SQL)
have played an important role in flexible searching because they allow users to pose
ad hoc queries. Similarly, high-level data mining query languages or other high-level
flexible user interfaces will give users the freedom to define ad hoc data mining tasks.
This should facilitate specification of the relevant sets of data for analysis, the domain
knowledge, the kinds of knowledge to be mined, and the conditions and constraints
to be enforced on the discovered patterns. Optimization of the processing of such
flexible mining requests is another promising area of study.
Presentation and visualization of data mining results: How can a data mining system
present data mining results, vividly and flexibly, so that the discovered knowledge
can be easily understood and directly usable by humans? This is especially crucial
if the data mining process is interactive. It requires the system to adopt expressive
knowledge representations, user-friendly interfaces, and visualization techniques.
1.7.3
Efficiency and Scalability
Efficiency and scalability are always considered when comparing data mining algo-
rithms. As data amounts continue to multiply, these two factors are especially critical.
Efficiency and scalability of data mining algorithms: Data mining algorithms must be
efficient and scalable in order to effectively extract information from huge amounts
of data in many data repositories or in dynamic data streams. In other words, the
running time of a data mining algorithm must be predictable, short, and acceptable
by applications. Efficiency, scalability, performance, optimization, and the ability to
execute in real time are key criteria that drive the development of many new data
mining algorithms.
Parallel, distributed, and incremental mining algorithms: The humongous size of many
data sets, the wide distribution of data, and the computational complexity of some
data mining methods are factors that motivate the development of parallel and dis-
tributed data-intensive mining algorithms. Such algorithms first partition the data
into “pieces.” Each piece is processed, in parallel, by searching for patterns. The par-
allel processes may interact with one another. The patterns from each partition are
eventually merged.
Cloud computing and
cluster computing, which use computers in a distributed
and collaborative way to tackle very large-scale computational tasks, are also active
research themes in parallel data mining. In addition, the high cost of some data min-
ing processes and the incremental nature of input promote incremental data mining,
which incorporates new data updates without having to mine the entire data “from
scratch.” Such methods perform knowledge modification incrementally to amend
and strengthen what was previously discovered.
HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 32
#32
32
Chapter 1 Introduction
1.7.4
Diversity of Database Types
The wide diversity of database types brings about challenges to data mining. These
include
Handling complex types of data: Diverse applications generate a wide spectrum of
new data types, from structured data such as relational and data warehouse data to
semi-structured and unstructured data; from stable data repositories to dynamic data
streams; from simple data objects to temporal data, biological sequences, sensor data,
spatial data, hypertext data, multimedia data, software program code, Web data, and
social network data. It is unrealistic to expect one data mining system to mine all
kinds of data, given the diversity of data types and the different goals of data mining.
Domain- or application-dedicated data mining systems are being constructed for in-
depth mining of specific kinds of data. The construction of effective and efficient
data mining tools for diverse applications remains a challenging and active area of
research.
Mining dynamic, networked, and global data repositories: Multiple sources of data
are connected by the Internet and various kinds of networks, forming gigantic, dis-
tributed, and heterogeneous global information systems and networks. The discovery
of knowledge from different sources of structured, semi-structured, or unstructured
yet interconnected data with diverse data semantics poses great challenges to data
mining. Mining such gigantic, interconnected information networks may help dis-
close many more patterns and knowledge in heterogeneous data sets than can be dis-
covered from a small set of isolated data repositories. Web mining, multisource data
mining, and information network mining have become challenging and fast-evolving
data mining fields.
1.7.5
Data Mining and Society
How does data mining impact society? What steps can data mining take to preserve the
privacy of individuals? Do we use data mining in our daily lives without even knowing
that we do? These questions raise the following issues:
Social impacts of data mining: With data mining penetrating our everyday lives, it is
important to study the impact of data mining on society. How can we use data mining
technology to benefit society? How can we guard against its misuse? The improper
disclosure or use of data and the potential violation of individual privacy and data
protection rights are areas of concern that need to be addressed.
Privacy-preserving data mining: Data mining will help scientific discovery, business
management, economy recovery, and security protection (e.g., the real-time dis-
covery of intruders and cyberattacks). However, it poses the risk of disclosing an
individual’s personal information. Studies on privacy-preserving data publishing and
data mining are ongoing. The philosophy is to observe data sensitivity and preserve
people’s privacy while performing successful data mining.