HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 77
#39
2.4 Measuring Data Similarity and Dissimilarity
77
data described by the three attributes of mixed types is:
0
0.85
0
0.65
0.83
0
0.13
0.71
0.79
0
.
From Table 2.2, we can intuitively guess that objects 1 and 4 are the most similar, based
on their values for test-1 and test-2. This is confirmed by the dissimilarity matrix, where
d
(4, 1) is the lowest value for any pair of different objects. Similarly, the matrix indicates
that objects 1 and 2 are the least similar.
2.4.7
Cosine Similarity
A document can be represented by thousands of attributes, each recording the frequency
of a particular word (such as a keyword) or phrase in the document. Thus, each docu-
ment is an object represented by what is called a term-frequency vector. For example, in
Table 2.5, we see that Document1 contains five instances of the word team, while hockey
occurs three times. The word coach is absent from the entire document, as indicated by
a count value of 0. Such data can be highly asymmetric.
Term-frequency vectors are typically very long and sparse (i.e., they have many 0 val-
ues). Applications using such structures include information retrieval, text document
clustering, biological taxonomy, and gene feature mapping. The traditional distance
measures that we have studied in this chapter do not work well for such sparse numeric
data. For example, two term-frequency vectors may have many 0 values in common,
meaning that the corresponding documents do not share many words, but this does not
make them similar. We need a measure that will focus on the words that the two docu-
ments do have in common, and the occurrence frequency of such words. In other words,
we need a measure for numeric data that ignores zero-matches.
Cosine similarity is a measure of similarity that can be used to compare docu-
ments or, say, give a ranking of documents with respect to a given vector of query
words. Let x and y be two vectors for comparison. Using the cosine measure as a
Table 2.5
Document Vector or Term-Frequency Vector
Document team coach hockey baseball soccer penalty score win loss season
Document1
5
0
3
0
2
0
0
2
0
0
Document2
3
0
2
0
1
1
0
1
0
1
Document3
0
7
0
2
1
0
0
3
0
0
Document4
0
1
0
0
1
2
2
0
3
0
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 78
#40
78
Chapter 2 Getting to Know Your Data
similarity function, we have
sim
(x, y) =
x ·
y
||x||||y||
,
(2.23)
where ||
x|| is the Euclidean norm of vector
x = (
x
1
, x
2
,
..., x
p
), defined as
x
2
1
+
x
2
2
+ · · · +
x
2
p
. Conceptually, it is the length of the vector. Similarly, ||y|| is the
Euclidean norm of vector y. The measure computes the cosine of the angle between vec-
tors x and y. A cosine value of 0 means that the two vectors are at 90 degrees to each
other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the
angle and the greater the match between vectors. Note that because the cosine similarity
measure does not obey all of the properties of Section 2.4.4 defining metric measures, it
is referred to as a nonmetric measure.
Example 2.23
Cosine similarity between two term-frequency vectors. Suppose that x and y are the
first two term-frequency vectors in Table 2.5. That is, x = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) and
y = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1). How similar are
x and
y? Using Eq. (2.23) to compute the
cosine similarity between the two vectors, we get:
x
t
· y = 5 × 3 + 0 × 0 + 3 × 2 + 0 × 0 + 2 × 1 + 0 × 1 + 0 × 0 + 2 × 1
+ 0 × 0 + 0 × 1 = 25
||x|| =
5
2
+ 0
2
+ 3
2
+ 0
2
+ 2
2
+ 0
2
+ 0
2
+ 2
2
+ 0
2
+ 0
2
= 6.48
||
y|| =
3
2
+ 0
2
+ 2
2
+ 0
2
+ 1
2
+ 1
2
+ 0
2
+ 1
2
+ 0
2
+ 1
2
= 4.12
sim
(x, y) = 0.94
Therefore, if we were using the cosine similarity measure to compare these documents,
they would be considered quite similar.
When attributes are binary-valued, the cosine similarity function can be interpreted
in terms of shared features or attributes. Suppose an object x possesses the ith attribute
if x
i
= 1. Then x
t
· y is the number of attributes possessed (i.e., shared) by both x and
y, and |
x||
y| is the
geometric mean of the number of attributes possessed by
x and the
number possessed by y. Thus, sim
(x, y) is a measure of relative possession of common
attributes.
A simple variation of cosine similarity for the preceding scenario is
sim
(x, y) =
x ·
y
x ·
x +
y ·
y −
x ·
y
,
(2.24)
which is the ratio of the number of attributes shared by
x and
y to the number of
attributes possessed by x or y. This function, known as the Tanimoto coefficient or
Tanimoto distance, is frequently used in information retrieval and biology taxonomy.