HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 67
#29
2.4 Measuring Data Similarity and Dissimilarity
67
This section presents similarity and dissimilarity measures, which are referred to as
measures of proximity. Similarity and dissimilarity are related. A similarity measure for
two objects, i and j, will typically return the value 0 if the objects are unalike. The higher
the similarity value, the greater the similarity between objects. (Typically, a value of 1
indicates complete similarity, that is, the objects are identical.) A dissimilarity measure
works the opposite way. It returns a value of 0 if the objects are the same (and therefore,
far from being dissimilar). The higher the dissimilarity value, the more dissimilar the
two objects are.
In Section 2.4.1 we present two data structures that are commonly used in the
above types of applications: the data matrix (used to store the data objects) and the
dissimilarity matrix (used to store dissimilarity values for pairs of objects). We also
switch to a different notation for data objects than previously used in this chapter
since now we are dealing with objects described by more than one attribute. We then
discuss how object dissimilarity can be computed for objects described by nominal
attributes (Section 2.4.2), by binary attributes (Section 2.4.3), by numeric attributes
(Section 2.4.4), by ordinal attributes (Section 2.4.5), or by combinations of these
attribute types (Section 2.4.6). Section 2.4.7 provides similarity measures for very long
and sparse data vectors, such as term-frequency vectors representing documents in
information retrieval. Knowing how to compute dissimilarity is useful in studying
attributes and will also be referenced in later topics on clustering (Chapters 10 and 11),
outlier analysis (Chapter 12), and nearest-neighbor classification (Chapter 9).
2.4.1
Data Matrix versus Dissimilarity Matrix
In Section 2.2, we looked at ways of studying the central tendency, dispersion, and spread
of observed values for some attribute X. Our objects there were one-dimensional, that
is, described by a single attribute. In this section, we talk about objects described by mul-
tiple attributes. Therefore, we need a change in notation. Suppose that we have
n objects
(e.g., persons, items, or courses) described by p attributes (also called measurements or
features, such as age, height, weight, or gender). The objects are
x
1
= (x
11
, x
12
,
...,x
1
p
),
x
2
= (x
21
, x
22
,
...,x
2
p
), and so on, where x
ij
is the value for object x
i
of the jth attribute.
For brevity, we hereafter refer to object x
i
as object i. The objects may be tuples in a
relational database, and are also referred to as data samples or feature vectors.
Main memory-based clustering and nearest-neighbor algorithms typically operate
on either of the following two data structures:
Data matrix (or object-by-attribute structure): This structure stores the n data objects
in the form of a relational table, or n-by-p matrix (n objects ×p attributes):
x
11
· · · x
1
f
· · · x
1p
· · ·
· · ·
· · ·
· · ·
· · ·
x
i1
· · ·
x
if
· · ·
x
ip
· · ·
· · ·
· · ·
· · ·
· · ·
x
n1
· · · x
nf
· · · x
np
.
(2.8)
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 68
#30
68
Chapter 2 Getting to Know Your Data
Each row corresponds to an object. As part of our notation, we may use f to index
through the p attributes.
Dissimilarity matrix (or
object-by-object structure): This structure stores a collection
of proximities that are available for all pairs of n objects. It is often represented by an
n-by-
n table:
0
d
(2, 1)
0
d
(3, 1) d(3, 2)
0
..
.
..
.
..
.
d
(n, 1) d(n, 2) ··· ··· 0
,
(2.9)
where d
(i, j) is the measured dissimilarity or “difference” between objects i and j. In
general, d
(i, j) is a non-negative number that is close to 0 when objects i and j are
highly similar or “near” each other, and becomes larger the more they differ. Note
that d
(i, i) = 0; that is, the difference between an object and itself is 0. Furthermore,
d
(i, j) = d(j, i). (For readability, we do not show the d(j, i) entries; the matrix is
symmetric.) Measures of dissimilarity are discussed throughout the remainder of this
chapter.
Measures of similarity can often be expressed as a function of measures of dissimilarity.
For example, for nominal data,
sim
(i, j) = 1 − d(i, j),
(2.10)
where sim
(
i,
j) is the similarity between objects
i and
j. Throughout the rest of this
chapter, we will also comment on measures of similarity.
A data matrix is made up of two entities or “things,” namely rows (for objects)
and columns (for attributes). Therefore, the data matrix is often called a two-mode
matrix. The dissimilarity matrix contains one kind of entity (dissimilarities) and so is
called a one-mode matrix. Many clustering and nearest-neighbor algorithms operate
on a dissimilarity matrix. Data in the form of a data matrix can be transformed into a
dissimilarity matrix before applying such algorithms.
2.4.2
Proximity Measures for Nominal Attributes
A nominal attribute can take on two or more states (Section 2.1.2). For example,
map color is a nominal attribute that may have, say, five states:
red, yellow, green, pink,
and blue.
Let the number of states of a nominal attribute be M. The states can be denoted by
letters, symbols, or a set of integers, such as 1, 2,
..., M. Notice that such integers are
used just for data handling and do not represent any specific ordering.