Data Mining. Concepts and Techniques, 3rd Edition

HAN 09-ch02-039-082-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	42/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 38 39 40 41 42 43 44 45 ... 343

Data Matrix versus Dissimilarity Matrix
Data matrix
Dissimilarity matrix
Proximity Measures for Nominal Attributes

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 67

#29

2.4 Measuring Data Similarity and Dissimilarity

This section presents similarity and dissimilarity measures, which are referred to as

measures of proximity. Similarity and dissimilarity are related. A similarity measure for

two objects, i and j, will typically return the value 0 if the objects are unalike. The higher

the similarity value, the greater the similarity between objects. (Typically, a value of 1

indicates complete similarity, that is, the objects are identical.) A dissimilarity measure

works the opposite way. It returns a value of 0 if the objects are the same (and therefore,

far from being dissimilar). The higher the dissimilarity value, the more dissimilar the

two objects are.

In Section 2.4.1 we present two data structures that are commonly used in the

above types of applications: the data matrix (used to store the data objects) and the

dissimilarity matrix (used to store dissimilarity values for pairs of objects). We also

switch to a different notation for data objects than previously used in this chapter

since now we are dealing with objects described by more than one attribute. We then

discuss how object dissimilarity can be computed for objects described by nominal

attributes (Section 2.4.2), by binary attributes (Section 2.4.3), by numeric attributes

(Section 2.4.4), by ordinal attributes (Section 2.4.5), or by combinations of these

attribute types (Section 2.4.6). Section 2.4.7 provides similarity measures for very long

and sparse data vectors, such as term-frequency vectors representing documents in

information retrieval. Knowing how to compute dissimilarity is useful in studying

attributes and will also be referenced in later topics on clustering (Chapters 10 and 11),

outlier analysis (Chapter 12), and nearest-neighbor classiﬁcation (Chapter 9).

2.4.1

Data Matrix versus Dissimilarity Matrix

In Section 2.2, we looked at ways of studying the central tendency, dispersion, and spread

of observed values for some attribute X. Our objects there were one-dimensional, that

is, described by a single attribute. In this section, we talk about objects described by mul-

tiple attributes. Therefore, we need a change in notation. Suppose that we have n objects

(e.g., persons, items, or courses) described by p attributes (also called measurements or

features, such as age, height, weight, or gender). The objects are x

= (x

, x

,

...,x

),

x

= (x

, x

,

...,x

), and so on, where x

is the value for object x

of the jth attribute.

For brevity, we hereafter refer to object x

i

as object i. The objects may be tuples in a

relational database, and are also referred to as data samples or feature vectors.

Main memory-based clustering and nearest-neighbor algorithms typically operate

on either of the following two data structures:

Data matrix (or object-by-attribute structure): This structure stores the n data objects

in the form of a relational table, or n-by-p matrix (n objects ×p attributes):











x

· · · x

· · ·

x

i1

· · ·

x

if

· · ·

x

ip

· · ·

x

n1

· · · x







(2.8)

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 68

#30

68

Chapter 2 Getting to Know Your Data

Each row corresponds to an object. As part of our notation, we may use f to index

through the p attributes.

Dissimilarity matrix (or object-by-object structure): This structure stores a collection

of proximities that are available for all pairs of n objects. It is often represented by an

n-by-n table:







0

d

(2, 1)

0

d

(3, 1) d(3, 2)

..

.

.

d

(n, 1) d(n, 2) ··· ··· 0











(2.9)

where d

(i, j) is the measured dissimilarity or “difference” between objects i and j. In

general, d

(i, j) is a non-negative number that is close to 0 when objects i and j are

highly similar or “near” each other, and becomes larger the more they differ. Note

that d

(i, i) = 0; that is, the difference between an object and itself is 0. Furthermore,

(i, j) = d(j, i). (For readability, we do not show the d(j, i) entries; the matrix is

symmetric.) Measures of dissimilarity are discussed throughout the remainder of this

chapter.

Measures of similarity can often be expressed as a function of measures of dissimilarity.

For example, for nominal data,

sim

(i, j) = 1 − d(i, j),

(2.10)

where sim

(i, j) is the similarity between objects i and j. Throughout the rest of this

chapter, we will also comment on measures of similarity.

A data matrix is made up of two entities or “things,” namely rows (for objects)

and columns (for attributes). Therefore, the data matrix is often called a two-mode

matrix. The dissimilarity matrix contains one kind of entity (dissimilarities) and so is

called a one-mode matrix. Many clustering and nearest-neighbor algorithms operate

on a dissimilarity matrix. Data in the form of a data matrix can be transformed into a

dissimilarity matrix before applying such algorithms.

2.4.2

Proximity Measures for Nominal Attributes

A nominal attribute can take on two or more states (Section 2.1.2). For example,

map color is a nominal attribute that may have, say, ﬁve states: red, yellow, green, pink,

and blue.

Let the number of states of a nominal attribute be M. The states can be denoted by

letters, symbols, or a set of integers, such as 1, 2,

..., M. Notice that such integers are

used just for data handling and do not represent any speciﬁc ordering.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 38 39 40 41 42 43 44 45 ... 343