HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 75
#37
2.4 Measuring Data Similarity and Dissimilarity
75
Example 2.21
Dissimilarity between ordinal attributes. Suppose that we have the sample data shown
earlier in Table 2.2, except that this time only the object-identifier and the continuous
ordinal attribute, test-2, are available. There are three states for test-2: fair, good, and
excellent, that is, M
f
= 3. For step 1, if we replace each value for test-2 by its rank, the
four objects are assigned the ranks 3, 1, 2, and 3, respectively. Step 2 normalizes the
ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. For step 3, we can
use, say, the Euclidean distance (Eq. 2.16), which results in the following dissimilarity
matrix:
0
1.0
0
0.5
0.5
0
0
1.0
0.5
0
.
Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e., d
(2,1) =
1.0 and d
(4,2) = 1.0). This makes intuitive sense since objects 1 and 4 are both
excellent.
Object 2 is fair, which is at the opposite end of the range of values for test-2.
Similarity values for ordinal attributes can be interpreted from dissimilarity as
sim
(i,j) = 1 − d(i,j).
2.4.6
Dissimilarity for Attributes of Mixed Types
Sections 2.4.2 through 2.4.5 discussed how to compute the dissimilarity between objects
described by attributes of the same type, where these types may be either nominal, sym-
metric binary, asymmetric binary, numeric, or ordinal. However, in many real databases,
objects are described by a mixture of attribute types. In general, a database can contain
all of these attribute types.
“So, how can we compute the dissimilarity between objects of mixed attribute types?”
One approach is to group each type of attribute together, performing separate data
mining (e.g., clustering) analysis for each type. This is feasible if these analyses derive
compatible results. However, in real applications, it is unlikely that a separate analysis
per attribute type will generate compatible results.
A more preferable approach is to process all attribute types together, performing a
single analysis. One such technique combines the different attributes into a single dis-
similarity matrix, bringing all of the meaningful attributes onto a common scale of the
interval [0.0, 1.0].
Suppose that the data set contains p attributes of mixed type. The dissimilarity d
(i, j)
between objects i and j is defined as
d
(i, j) =
p
f =1
δ
(f )
ij
d
(f )
ij
p
f =1
δ
(f )
ij
,
(2.22)
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 76
#38
76
Chapter 2 Getting to Know Your Data
where the indicator
δ
(f )
ij
= 0 if either (1) x
if
or x
jf
is missing (i.e., there is no mea-
surement of attribute f for object i or object j), or (2) x
if
= x
jf
= 0 and attribute
f is asymmetric binary; otherwise,
δ
(f )
ij
= 1. The contribution of attribute f to the
dissimilarity between i and j (i.e., d
(f )
ij
) is computed dependent on its type:
If f is numeric: d
(f )
ij
=
|x
if
−x
jf
|
max
h
x
hf
−min
h
x
hf
, where h runs over all nonmissing objects for
attribute f .
If f is nominal or binary: d
(f )
ij
= 0 if x
if
= x
jf
; otherwise, d
(f )
ij
= 1.
If
f is ordinal: compute the ranks
r
if
and z
if
=
r
if
−1
M
f
−1
, and treat z
if
as numeric.
These steps are identical to what we have already seen for each of the individual
attribute types. The only difference is for numeric attributes, where we normalize so
that the values map to the interval [0.0, 1.0]. Thus, the dissimilarity between objects
can be computed even when the attributes describing the objects are of different
types.
Example 2.22
Dissimilarity between attributes of mixed type. Let’s compute a dissimilarity matrix
for the objects in Table 2.2. Now we will consider all of the attributes, which are of
different types. In Examples 2.17 and 2.21, we worked out the dissimilarity matrices
for each of the individual attributes. The procedures we followed for test-1 (which is
nominal) and test-2 (which is ordinal) are the same as outlined earlier for processing
attributes of mixed types. Therefore, we can use the dissimilarity matrices obtained for
test-1 and
test-2 later when we compute Eq. (2.22). First, however, we need to compute
the dissimilarity matrix for the third attribute, test-3 (which is numeric). That is, we
must compute d
(3)
ij
. Following the case for numeric attributes, we let max
h
x
h
= 64 and
min
h
x
h
= 22. The difference between the two is used in Eq. (2.22) to normalize the
values of the dissimilarity matrix. The resulting dissimilarity matrix for test-3 is
0
0.55
0
0.45
1.00
0
0.40
0.14
0.86
0
.
We can now use the dissimilarity matrices for the three attributes in our computation of
Eq. (2.22). The indicator
δ
(f )
ij
= 1 for each of the three attributes, f . We get, for example,
d
(3, 1) =
1
(1)+1(0.50)+1(0.45)
3
= 0.65. The resulting dissimilarity matrix obtained for the