Data Mining. Concepts and Techniques, 3rd Edition

HAN 09-ch02-039-082-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	45/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 41 42 43 44 45 46 47 48 ... 343

Figure 2.23
Example 2.20 Supremum distance.
Proximity Measures for Ordinal Attributes

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 73

#35

2.4 Measuring Data Similarity and Dissimilarity

73

Symmetry: d

(i, j) = d(j, i): Distance is a symmetric function.

Triangle inequality: d

(i, j) ≤ d(i, k) + d(k, j): Going directly from object i to object j

in space is no more than making a detour over any other object k.

A measure that satisﬁes these conditions is known as metric. Please note that the

non-negativity property is implied by the other three properties.

Example 2.19

Euclidean distance and Manhattan distance. Let x

1

= (1, 2) and x

= (3, 5) repre-

sent two objects as shown in Figure 2.23. The Euclidean distance between the two is

√

+ 3

= 3.61. The Manhattan distance between the two is 2 + 3 = 5.

Minkowski distance is a generalization of the Euclidean and Manhattan distances.

It is deﬁned as

(i, j) =

− x

|

h

+ |x

i2

− x

|

h

+ · · · + |x

ip

− x

|

h

(2.18)

where h is a real number such that h ≥ 1. (Such a distance is also called L

p

norm in

some literature, where the symbol p refers to our notation of h. We have kept p as

the number of attributes to be consistent with the rest of this chapter.) It represents

the Manhattan distance when h = 1 (i.e., L

norm) and Euclidean distance when h = 2

(i.e., L

norm).

The supremum distance (also referred to as L

max

, L

∞

norm and as the Chebyshev

distance) is a generalization of the Minkowski distance for h → ∞. To compute it, we

ﬁnd the attribute f that gives the maximum difference in values between the two objects.

This difference is the supremum distance, deﬁned more formally as:

d

(i, j) = lim

h→∞





p

f =1

− x

|

h





1

=

p

max

f

− x

(2.19)

The L

∞

norm is also known as the uniform norm.

1

x

= (3, 5)

= (1, 2)

Euclidean distance

= (2

+ 3

)

1/2

= 3.61

Manhattan distance

= 2 + 3 = 5

Supremum distance

= 5 – 2 = 3

Figure 2.23

Euclidean, Manhattan, and supremum distances between two objects.

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 74

#36

74

Chapter 2 Getting to Know Your Data

Example 2.20

Supremum distance. Let’s use the same two objects, x

1

= (1, 2) and x

= (3, 5), as in

Figure 2.23. The second attribute gives the greatest difference between values for the

objects, which is 5 − 2 = 3. This is the supremum distance between both objects.

If each attribute is assigned a weight according to its perceived importance, the

weighted Euclidean distance can be computed as

d

(i, j) = w

− x

+ w

− x

+ · · · + w

m

− x

(2.20)

Weighting can also be applied to other distance measures as well.

2.4.5

Proximity Measures for Ordinal Attributes

The values of an ordinal attribute have a meaningful order or ranking about them,

yet the magnitude between successive values is unknown (Section 2.1.4). An exam-

ple includes the sequence small, medium, large for a size attribute. Ordinal attributes

may also be obtained from the discretization of numeric attributes by splitting the value

range into a ﬁnite number of categories. These categories are organized into ranks. That

is, the range of a numeric attribute can be mapped to an ordinal attribute f having M

f

states. For example, the range of the interval-scaled attribute temperature (in Celsius)

can be organized into the following states: −30 to −10, −10 to 10, 10 to 30, repre-

senting the categories cold temperature, moderate temperature, and warm temperature,

respectively. Let M represent the number of possible states that an ordinal attribute can

have. These ordered states deﬁne the ranking 1,

..., M

f

.

“How are ordinal attributes handled?” The treatment of ordinal attributes is

quite similar to that of numeric attributes when computing dissimilarity between

objects. Suppose that f is an attribute from a set of ordinal attributes describing

n objects. The dissimilarity computation with respect to f involves the following

steps:

The value of f for the ith object is x

, and f has M

ordered states, representing the

ranking 1,

..., M

. Replace each x

by its corresponding rank, r

∈ {1, . . . , M

}.

2.

Since each ordinal attribute can have a different number of states, it is often

necessary to map the range of each attribute onto [0.0, 1.0] so that each attribute

has equal weight. We perform such data normalization by replacing the rank r

if

of the ith object in the f th attribute by

z

if

=

r

− 1

M

f

− 1

(2.21)

Dissimilarity can then be computed using any of the distance measures described

in Section 2.4.4 for numeric attributes, using z

if

to represent the f value for the ith

object.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 41 42 43 44 45 46 47 48 ... 343