HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 73
#35
2.4 Measuring Data Similarity and Dissimilarity
73
Symmetry: d
(i, j) = d(j, i): Distance is a symmetric function.
Triangle inequality: d
(i, j) ≤ d(i, k) + d(k, j): Going directly from object i to object j
in space is no more than making a detour over any other object k.
A measure that satisfies these conditions is known as metric. Please note that the
non-negativity property is implied by the other three properties.
Example 2.19
Euclidean distance and Manhattan distance. Let x
1
= (1, 2) and x
2
= (3, 5) repre-
sent two objects as shown in Figure 2.23. The Euclidean distance between the two is
√
2
2
+ 3
2
= 3.61. The Manhattan distance between the two is 2 + 3 = 5.
Minkowski distance is a generalization of the Euclidean and Manhattan distances.
It is defined as
d
(i, j) =
h
|x
i1
− x
j1
|
h
+ |x
i2
− x
j2
|
h
+ · · · + |x
ip
− x
jp
|
h
,
(2.18)
where
h is a real number such that
h ≥ 1. (Such a distance is also called
L
p
norm in
some literature, where the symbol p refers to our notation of h. We have kept p as
the number of attributes to be consistent with the rest of this chapter.) It represents
the Manhattan distance when h = 1 (i.e., L
1
norm) and Euclidean distance when h = 2
(i.e.,
L
2
norm).
The
supremum distance (also referred to as
L
max
, L
∞
norm and as the Chebyshev
distance) is a generalization of the Minkowski distance for h → ∞. To compute it, we
find the attribute f that gives the maximum difference in values between the two objects.
This difference is the supremum distance, defined more formally as:
d
(i, j) = lim
h→∞
p
f =1
|x
if
− x
jf
|
h
1
h
=
p
max
f
|x
if
− x
jf
|.
(2.19)
The
L
∞
norm is also known as the uniform norm.
1
2
2
3
5
4
3
3
2
1
x
2
= (3, 5)
x
1
= (1, 2)
Euclidean
distance
= (2
2
+ 3
2
)
1/2
= 3.61
Manhattan distance
= 2 + 3 = 5
Supremum distance
= 5 – 2 = 3
Figure 2.23
Euclidean, Manhattan, and supremum distances between two objects.
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 74
#36
74
Chapter 2 Getting to Know Your Data
Example 2.20
Supremum distance. Let’s use the same two objects,
x
1
= (1, 2) and x
2
= (3, 5), as in
Figure 2.23. The second attribute gives the greatest difference between values for the
objects, which is 5 − 2 = 3. This is the supremum distance between both objects.
If each attribute is assigned a weight according to its perceived importance, the
weighted Euclidean distance can be computed as
d
(i, j) = w
1
|x
i1
− x
j1
|
2
+
w
2
|x
i2
− x
j2
|
2
+ · · · +
w
m
|x
ip
− x
jp
|
2
.
(2.20)
Weighting can also be applied to other distance measures as well.
2.4.5
Proximity Measures for Ordinal Attributes
The values of an ordinal attribute have a meaningful order or ranking about them,
yet the magnitude between successive values is unknown (Section 2.1.4). An exam-
ple includes the sequence small, medium, large for a size attribute. Ordinal attributes
may also be obtained from the discretization of numeric attributes by splitting the value
range into a finite number of categories. These categories are organized into ranks. That
is, the range of a numeric attribute can be mapped to an ordinal attribute f having M
f
states. For example, the range of the interval-scaled attribute temperature (in Celsius)
can be organized into the following states: −30 to −10, −10 to 10, 10 to 30, repre-
senting the categories cold temperature, moderate temperature, and warm temperature,
respectively. Let M represent the number of possible states that an ordinal attribute can
have. These ordered states define the ranking 1,
..., M
f
.
“How are ordinal attributes handled?” The treatment of ordinal attributes is
quite similar to that of numeric attributes when computing dissimilarity between
objects. Suppose that f is an attribute from a set of ordinal attributes describing
n objects. The dissimilarity computation with respect to
f involves the following
steps:
1.
The value of f for the ith object is x
if
, and f has M
f
ordered states, representing the
ranking 1,
..., M
f
. Replace each x
if
by its corresponding rank, r
if
∈ {1, . . . , M
f
}.
2.
Since each ordinal attribute can have a different number of states, it is often
necessary to map the range of each attribute onto [0.0, 1.0] so that each attribute
has equal weight. We perform such data normalization by replacing the rank r
if
of the ith object in the f th attribute by
z
if
=
r
if
− 1
M
f
− 1
.
(2.21)
3.
Dissimilarity can then be computed using any of the distance measures described
in Section 2.4.4 for numeric attributes, using z
if
to represent the f value for the ith
object.