Data Mining. Concepts and Techniques, 3rd Edition

HAN 09-ch02-039-082-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	41/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 37 38 39 40 41 42 43 44 ... 343

Visualizing Complex Data and Relations
Measuring Data Similarity and Dissimilarity

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 64

#26

64

Chapter 2 Getting to Know Your Data

Figure 2.19

“Worlds-within-Worlds” (also known as n-Vision). Source: http://graphics.cs.columbia.edu/

projects/AutoVisual/images/1.dipstick.5.gif.

2.3.5

Visualizing Complex Data and Relations

In early days, visualization techniques were mainly for numeric data. Recently, more

and more non-numeric data, such as text and social networks, have become available.

Visualizing and analyzing such data attracts a lot of interest.

There are many new visualization techniques dedicated to these kinds of data. For

example, many people on the Web tag various objects such as pictures, blog entries, and

product reviews. A tag cloud is a visualization of statistics of user-generated tags. Often,

in a tag cloud, tags are listed alphabetically or in a user-preferred order. The importance

of a tag is indicated by font size or color. Figure 2.21 shows a tag cloud for visualizing

the popular tags used in a Web site.

Tag clouds are often used in two ways. First, in a tag cloud for a single item, we can

use the size of a tag to represent the number of times that the tag is applied to this item

by different users. Second, when visualizing the tag statistics on multiple items, we can

use the size of a tag to represent the number of items that the tag has been applied to,

that is, the popularity of the tag.

In addition to complex data, complex relations among data entries also raise chal-

lenges for visualization. For example, Figure 2.22 uses a disease inﬂuence graph to

visualize the correlations between diseases. The nodes in the graph are diseases, and

the size of each node is proportional to the prevalence of the corresponding disease.

Two nodes are linked by an edge if the corresponding diseases have a strong correlation.

The width of an edge is proportional to the strength of the correlation pattern of the two

corresponding diseases.

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 65

#27

2.4 Measuring Data Similarity and Dissimilarity

65

Figure 2.20

Newsmap: Use of tree-maps to visualize Google news headline stories. Source: www.cs.umd.

edu/class/spring2005/cmsc838s/viz4all/ss/newsmap.png.

In summary, visualization provides effective tools to explore data. We have intro-

duced several popular methods and the essential ideas behind them. There are many

existing tools and methods. Moreover, visualization can be used in data mining in vari-

ous aspects. In addition to visualizing data, visualization can be used to represent the

data mining process, the patterns obtained from a mining method, and user interaction

with the data. Visual data mining is an important research and development direction.

2.4

Measuring Data Similarity and Dissimilarity

In data mining applications, such as clustering, outlier analysis, and nearest-neighbor

classiﬁcation, we need ways to assess how alike or unalike objects are in comparison to

one another. For example, a store may want to search for clusters of customer objects,

resulting in groups of customers with similar characteristics (e.g., similar income, area

of residence, and age). Such information can then be used for marketing. A cluster is

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 66

#28

66

Chapter 2 Getting to Know Your Data

Figure 2.21

Using a tag cloud to visualize popular Web site tags. Source: A snapshot of www.ﬂickr.com/

photos/tags/, January 23, 2010.

High blood pressure (Hb)

Allergies (Al)

Overweight (Ov)

High cholesterol level (Hc)

Arthritis (Ar)

Trouble seeing (Tr)

Risk of diabetes (Ri)

Asthma (As)

Diabetes (Di)

Hayfever (Ha)

Thyroid problem (Th)

Heart disease (He)

Cancer (Cn)

Sleep disorder (Sl)

Eczema (Ec)

Chronic bronchitis (Ch)

Osteoporosis (Os)

Prostate (Pr)

Cardiovascular (Ca)

Glaucoma (Gl)

Stroke (St)

Liver condition (Li)

PSA test abnormal (PS)

Kidney (Ki)

Endometriosis (En)

Emphysema (Em)

Figure 2.22

Disease inﬂuence graph of people at least 20 years old in the NHANES data set.

a collection of data objects such that the objects within a cluster are similar to one

another and dissimilar to the objects in other clusters. Outlier analysis also employs

clustering-based techniques to identify potential outliers as objects that are highly dis-

similar to others. Knowledge of object similarities can also be used in nearest-neighbor

classiﬁcation schemes where a given object (e.g., a patient) is assigned a class label

(relating to, say, a diagnosis) based on its similarity toward other objects in the model.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 37 38 39 40 41 42 43 44 ... 343