HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 64
#26
64
Chapter 2 Getting to Know Your Data
Figure 2.19
“Worlds-within-Worlds” (also known as n-Vision). Source: http://graphics.cs.columbia.edu/
projects/AutoVisual/images/1.dipstick.5.gif.
2.3.5
Visualizing Complex Data and Relations
In early days, visualization techniques were mainly for numeric data. Recently, more
and more non-numeric data, such as text and social networks, have become available.
Visualizing and analyzing such data attracts a lot of interest.
There are many new visualization techniques dedicated to these kinds of data. For
example, many people on the Web tag various objects such as pictures, blog entries, and
product reviews. A tag cloud is a visualization of statistics of user-generated tags. Often,
in a tag cloud, tags are listed alphabetically or in a user-preferred order. The importance
of a tag is indicated by font size or color. Figure 2.21 shows a tag cloud for visualizing
the popular tags used in a Web site.
Tag clouds are often used in two ways. First, in a tag cloud for a single item, we can
use the size of a tag to represent the number of times that the tag is applied to this item
by different users. Second, when visualizing the tag statistics on multiple items, we can
use the size of a tag to represent the number of items that the tag has been applied to,
that is, the popularity of the tag.
In addition to complex data, complex relations among data entries also raise chal-
lenges for visualization. For example, Figure 2.22 uses a disease influence graph to
visualize the correlations between diseases. The nodes in the graph are diseases, and
the size of each node is proportional to the prevalence of the corresponding disease.
Two nodes are linked by an edge if the corresponding diseases have a strong correlation.
The width of an edge is proportional to the strength of the correlation pattern of the two
corresponding diseases.
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 65
#27
2.4 Measuring Data Similarity and Dissimilarity
65
Figure 2.20
Newsmap: Use of tree-maps to visualize Google news headline stories. Source: www.cs.umd.
edu/class/spring2005/cmsc838s/viz4all/ss/newsmap.png.
In summary, visualization provides effective tools to explore data. We have intro-
duced several popular methods and the essential ideas behind them. There are many
existing tools and methods. Moreover, visualization can be used in data mining in vari-
ous aspects. In addition to visualizing data, visualization can be used to represent the
data mining process, the patterns obtained from a mining method, and user interaction
with the data. Visual data mining is an important research and development direction.
2.4
Measuring Data Similarity and Dissimilarity
In data mining applications, such as clustering, outlier analysis, and nearest-neighbor
classification, we need ways to assess how alike or unalike objects are in comparison to
one another. For example, a store may want to search for clusters of customer objects,
resulting in groups of customers with similar characteristics (e.g., similar income, area
of residence, and age). Such information can then be used for marketing. A cluster is
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 66
#28
66
Chapter 2 Getting to Know Your Data
Figure 2.21
Using a tag cloud to visualize popular Web site tags. Source: A snapshot of www.flickr.com/
photos/tags/, January 23, 2010.
High blood pressure (Hb)
Allergies (Al)
Overweight (Ov)
High cholesterol level (Hc)
Arthritis (Ar)
Trouble seeing (Tr)
Risk of diabetes (Ri)
Asthma (As)
Diabetes (Di)
Hayfever (Ha)
Thyroid problem (Th)
Heart disease (He)
Cancer (Cn)
Sleep disorder (Sl)
Eczema (Ec)
Chronic bronchitis (Ch)
Osteoporosis (Os)
Prostate (Pr)
Cardiovascular (Ca)
Glaucoma (Gl)
Stroke (St)
Liver condition (Li)
Li
Ki
En
Ca
Th
He
Em
Os
Cn
Pr
PS
Ec
Sl
Gl
Di
Ar
Hb
Tr
Ov
Al
As
Ch
Li
St
Ri
Ha
Hc
PSA test abnormal (PS)
Kidney (Ki)
Endometriosis (En)
Emphysema (Em)
Figure 2.22
Disease influence graph of people at least 20 years old in the NHANES data set.
a collection of data objects such that the objects within a cluster are similar to one
another and dissimilar to the objects in other clusters. Outlier analysis also employs
clustering-based techniques to identify potential outliers as objects that are highly dis-
similar to others. Knowledge of object similarities can also be used in nearest-neighbor
classification schemes where a given object (e.g., a patient) is assigned a class label
(relating to, say, a diagnosis) based on its similarity toward other objects in the model.