Data Mining. Concepts and Techniques, 3rd Edition

HAN 08-ch01-001-038-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	23/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 19 20 21 22 23 24 25 26 ... 343

Semi-supervised learning
Active learning
Figure 1.12
Database Systems and Data Warehouses Database systems research
Information Retrieval Information retrieval (IR

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 25

#25

1.5 Which Technologies Are Used?

25

Unsupervised learning is essentially a synonym for clustering. The learning process

is unsupervised since the input examples are not class labeled. Typically, we may use

clustering to discover classes within the data. For example, an unsupervised learning

method can take, as input, a set of images of handwritten digits. Suppose that it ﬁnds

10 clusters of data. These clusters may correspond to the 10 distinct digits of 0 to

9, respectively. However, since the training data are not labeled, the learned model

cannot tell us the semantic meaning of the clusters found.

Semi-supervised learning is a class of machine learning techniques that make use

of both labeled and unlabeled examples when learning a model. In one approach,

labeled examples are used to learn class models and unlabeled examples are used to

reﬁne the boundaries between classes. For a two-class problem, we can think of the

set of examples belonging to one class as the positive examples and those belonging

to the other class as the negative examples. In Figure 1.12, if we do not consider the

unlabeled examples, the dashed line is the decision boundary that best partitions

the positive examples from the negative examples. Using the unlabeled examples,

we can reﬁne the decision boundary to the solid line. Moreover, we can detect that

the two positive examples at the top right corner, though labeled, are likely noise or

outliers.

Active learning is a machine learning approach that lets users play an active role

in the learning process. An active learning approach can ask a user (e.g., a domain

expert) to label an example, which may be from a set of unlabeled examples or

synthesized by the learning program. The goal is to optimize the model quality by

actively acquiring knowledge from human users, given a constraint on how many

examples they can be asked to label.

Positive example

Negative example

Unlabeled example

Decision boundary without unlabeled examples

Decision boundary with unlabeled examples

Noise/outliers

Figure 1.12

Semi-supervised learning.

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 26

#26

26

Chapter 1 Introduction

You can see there are many similarities between data mining and machine learning.

For classiﬁcation and clustering tasks, machine learning research often focuses on the

accuracy of the model. In addition to accuracy, data mining research places strong

emphasis on the efﬁciency and scalability of mining methods on large data sets, as well

as on ways to handle complex types of data and explore new, alternative methods.

1.5.3

Database Systems and Data Warehouses

Database systems research focuses on the creation, maintenance, and use of databases

for organizations and end-users. Particularly, database systems researchers have estab-

lished highly recognized principles in data models, query languages, query processing

and optimization methods, data storage, and indexing and accessing methods. Database

systems are often well known for their high scalability in processing very large, relatively

structured data sets.

Many data mining tasks need to handle large data sets or even real-time, fast stream-

ing data. Therefore, data mining can make good use of scalable database technologies to

achieve high efﬁciency and scalability on large data sets. Moreover, data mining tasks can

be used to extend the capability of existing database systems to satisfy advanced users’

sophisticated data analysis requirements.

Recent database systems have built systematic data analysis capabilities on database

data using data warehousing and data mining facilities. A data warehouse integrates

data originating from multiple sources and various timeframes. It consolidates data

in multidimensional space to form partially materialized data cubes. The data cube

model not only facilitates OLAP in multidimensional databases but also promotes

multidimensional data mining (see Section 1.3.2).

1.5.4

Information Retrieval

Information retrieval (IR) is the science of searching for documents or information

in documents. Documents can be text or multimedia, and may reside on the Web. The

differences between traditional information retrieval and database systems are twofold:

Information retrieval assumes that (1) the data under search are unstructured; and (2)

the queries are formed mainly by keywords, which do not have complex structures

(unlike SQL queries in database systems).

The typical approaches in information retrieval adopt probabilistic models. For

example, a text document can be regarded as a bag of words, that is, a multiset of words

appearing in the document. The document’s language model is the probability density

function that generates the bag of words in the document. The similarity between two

documents can be measured by the similarity between their corresponding language

models.

Furthermore, a topic in a set of text documents can be modeled as a probability dis-

tribution over the vocabulary, which is called a topic model. A text document, which

may involve one or multiple topics, can be regarded as a mixture of multiple topic mod-

els. By integrating information retrieval models and data mining techniques, we can ﬁnd

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 19 20 21 22 23 24 25 26 ... 343