Data Mining. Concepts and Techniques, 3rd Edition

HAN 09-ch02-039-082-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	47/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 43 44 45 46 47 48 49 50 ... 343

Cosine similarity
Document team coach hockey baseball soccer penalty score win loss season
Example 2.23 Cosine similarity between two term-frequency vectors.

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 77

#39

2.4 Measuring Data Similarity and Dissimilarity

data described by the three attributes of mixed types is:











0.85

0.65

0.83

0.13

0.71

0.79







From Table 2.2, we can intuitively guess that objects 1 and 4 are the most similar, based

on their values for test-1 and test-2. This is conﬁrmed by the dissimilarity matrix, where

d

(4, 1) is the lowest value for any pair of different objects. Similarly, the matrix indicates

that objects 1 and 2 are the least similar.

2.4.7

Cosine Similarity

A document can be represented by thousands of attributes, each recording the frequency

of a particular word (such as a keyword) or phrase in the document. Thus, each docu-

ment is an object represented by what is called a term-frequency vector. For example, in

Table 2.5, we see that Document1 contains ﬁve instances of the word team, while hockey

occurs three times. The word coach is absent from the entire document, as indicated by

a count value of 0. Such data can be highly asymmetric.

Term-frequency vectors are typically very long and sparse (i.e., they have many 0 val-

ues). Applications using such structures include information retrieval, text document

clustering, biological taxonomy, and gene feature mapping. The traditional distance

measures that we have studied in this chapter do not work well for such sparse numeric

data. For example, two term-frequency vectors may have many 0 values in common,

meaning that the corresponding documents do not share many words, but this does not

make them similar. We need a measure that will focus on the words that the two docu-

ments do have in common, and the occurrence frequency of such words. In other words,

we need a measure for numeric data that ignores zero-matches.

Cosine similarity is a measure of similarity that can be used to compare docu-

ments or, say, give a ranking of documents with respect to a given vector of query

words. Let x and y be two vectors for comparison. Using the cosine measure as a

Table 2.5

Document Vector or Term-Frequency Vector

Document team coach hockey baseball soccer penalty score win loss season

Document1

0

Document2

0

2

1

Document3

0

Document4

1

0

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 78

#40

78

Chapter 2 Getting to Know Your Data

similarity function, we have

sim

(x, y) =

x · y

||x||||y||

(2.23)

where ||x|| is the Euclidean norm of vector x = (x

, x

..., x

), deﬁned as

+ x

+ · · · + x

2

p

. Conceptually, it is the length of the vector. Similarly, ||y|| is the

Euclidean norm of vector y. The measure computes the cosine of the angle between vec-

tors x and y. A cosine value of 0 means that the two vectors are at 90 degrees to each

other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the

angle and the greater the match between vectors. Note that because the cosine similarity

measure does not obey all of the properties of Section 2.4.4 deﬁning metric measures, it

is referred to as a nonmetric measure.

Example 2.23

Cosine similarity between two term-frequency vectors. Suppose that x and y are the

ﬁrst two term-frequency vectors in Table 2.5. That is, x = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) and

y = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1). How similar are x and y? Using Eq. (2.23) to compute the

cosine similarity between the two vectors, we get:

x

t

· y = 5 × 3 + 0 × 0 + 3 × 2 + 0 × 0 + 2 × 1 + 0 × 1 + 0 × 0 + 2 × 1

+ 0 × 0 + 0 × 1 = 25

||x|| =

2

+ 0

+ 3

+ 0

+ 2

+ 0

+ 2

+ 0

= 6.48

||y|| =

+ 0

+ 2

+ 0

+ 1

+ 0

+ 1

+ 0

+ 1

= 4.12

sim

(x, y) = 0.94

Therefore, if we were using the cosine similarity measure to compare these documents,

they would be considered quite similar.

When attributes are binary-valued, the cosine similarity function can be interpreted

in terms of shared features or attributes. Suppose an object x possesses the ith attribute

if x

i

= 1. Then x

· y is the number of attributes possessed (i.e., shared) by both x and

y, and |x||y| is the geometric mean of the number of attributes possessed by x and the

number possessed by y. Thus, sim

(x, y) is a measure of relative possession of common

attributes.

A simple variation of cosine similarity for the preceding scenario is

sim

(x, y) =

x · y

x · x + y · y − x · y

(2.24)

which is the ratio of the number of attributes shared by x and y to the number of

attributes possessed by x or y. This function, known as the Tanimoto coefﬁcient or

Tanimoto distance, is frequently used in information retrieval and biology taxonomy.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 43 44 45 46 47 48 49 50 ... 343