HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 55
#17
2.2 Basic Statistical Descriptions of Data
55
6000
5000
4000
3000
2000
1000
0
Count of items sold
40–59
60–79
80–99
100–119
120–139
Unit price ($)
Figure 2.6
A histogram for the Table 2.1 data set.
Unit price ($)
Items sold
0
700
600
500
400
300
200
100
0
20
40
60
80
100
120
140
Figure 2.7
A scatter plot for the Table 2.1 data set.
(a)
(b)
Figure 2.8
Scatter plots can be used to find (a) positive or (b) negative correlations between attributes.
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 56
#18
56
Chapter 2 Getting to Know Your Data
Figure 2.9
Three cases where there is no observed correlation between the two plotted attributes in each
of the data sets.
from lower left to upper right, this means that the values of X increase as the values
of Y increase, suggesting a positive correlation (Figure 2.8a). If the pattern of plotted
points slopes from upper left to lower right, the values of X increase as the values of Y
decrease, suggesting a negative correlation (Figure 2.8b). A line of best fit can be drawn
to study the correlation between the variables. Statistical tests for correlation are given
in Chapter 3 on data integration (Eq. (3.3)). Figure 2.9 shows three cases for which
there is no correlation relationship between the two attributes in each of the given data
sets. Section 2.3.2 shows how scatter plots can be extended to n attributes, resulting in a
scatter-plot matrix.
In conclusion, basic data descriptions (e.g., measures of central tendency and mea-
sures of dispersion) and graphic statistical displays (e.g., quantile plots, histograms, and
scatter plots) provide valuable insight into the overall behavior of your data. By helping
to identify noise and outliers, they are especially useful for data cleaning.
2.3
Data Visualization
How can we convey data to users effectively? Data visualization aims to communicate
data clearly and effectively through graphical representation. Data visualization has been
used extensively in many applications—for example, at work for reporting, managing
business operations, and tracking progress of tasks. More popularly, we can take advan-
tage of visualization techniques to discover data relationships that are otherwise not
easily observable by looking at the raw data. Nowadays, people also use data visualization
to create fun and interesting graphics.
In this section, we briefly introduce the basic concepts of data visualization. We start
with multidimensional data such as those stored in relational databases. We discuss
several representative approaches, including pixel-oriented techniques, geometric pro-
jection techniques, icon-based techniques, and hierarchical and graph-based techniques.
We then discuss the visualization of complex data and relations.
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 57
#19
2.3 Data Visualization
57
2.3.1
Pixel-Oriented Visualization Techniques
A simple way to visualize the value of a dimension is to use a pixel where the color of
the pixel reflects the dimension’s value. For a data set of m dimensions, pixel-oriented
techniques create m windows on the screen, one for each dimension. The m dimension
values of a record are mapped to m pixels at the corresponding positions in the windows.
The colors of the pixels reflect the corresponding values.
Inside a window, the data values are arranged in some global order shared by all
windows. The global order may be obtained by sorting all data records in a way that’s
meaningful for the task at hand.
Example 2.16
Pixel-oriented visualization. AllElectronics maintains a customer information table,
which consists of four dimensions: income, credit limit, transaction volume, and age. Can
we analyze the correlation between income and the other attributes by visualization?
We can sort all customers in income-ascending order, and use this order to lay out
the customer data in the four visualization windows, as shown in Figure 2.10. The pixel
colors are chosen so that the smaller the value, the lighter the shading. Using pixel-
based visualization, we can easily observe the following: credit limit increases as income
increases; customers whose income is in the middle range are more likely to purchase
more from AllElectronics; there is no clear correlation between income and age.
In pixel-oriented techniques, data records can also be ordered in a query-dependent
way. For example, given a point query, we can sort all records in descending order of
similarity to the point query.
Filling a window by laying out the data records in a linear way may not work well for
a wide window. The first pixel in a row is far away from the last pixel in the previous row,
though they are next to each other in the global order. Moreover, a pixel is next to the
one above it in the window, even though the two are not next to each other in the global
order. To solve this problem, we can lay out the data records in a space-filling curve
(a) income
(b) credit_limit
(c) transaction_volume
(d) age
Figure 2.10
Pixel-oriented visualization of four attributes by sorting all customers in income ascending
order.