HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 43
#5
2.1 Data Objects and Attribute Types
43
2.1.5
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in
integer or real values. Numeric attributes can be interval-scaled or ratio-scaled.
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of
interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition
to providing a ranking of values, such attributes allow us to compare and quantify the
difference between values.
Example 2.4
Interval-scaled attributes. A temperature attribute is interval-scaled. Suppose that we
have the outdoor temperature value for a number of different days, where each day is
an object. By ordering the values, we obtain a ranking of the objects with respect to
temperature. In addition, we can quantify the difference between values. For example, a
temperature of 20
◦
C is five degrees higher than a temperature of 15
◦
C. Calendar dates
are another example. For instance, the years 2002 and 2010 are eight years apart.
Temperatures in Celsius and Fahrenheit do not have a true zero-point, that is, neither
0
◦
C nor 0
◦
F indicates “no temperature.” (On the Celsius scale, for example, the unit of
measurement is 1/100 of the difference between the melting temperature and the boiling
temperature of water in atmospheric pressure.) Although we can compute the difference
between temperature values, we cannot talk of one temperature value as being a multiple
of another. Without a true zero, we cannot say, for instance, that 10
◦
C is twice as warm
as 5
◦
C. That is, we cannot speak of the values in terms of ratios. Similarly, there is no
true zero-point for calendar dates. (The year 0 does not correspond to the beginning of
time.) This brings us to ratio-scaled attributes, for which a true zero-point exits.
Because interval-scaled attributes are numeric, we can compute their mean value, in
addition to the median and mode measures of central tendency.
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if
a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio)
of another value. In addition, the values are ordered, and we can also compute the
difference between values, as well as the mean, median, and mode.
Example 2.5
Ratio-scaled attributes. Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K)
temperature scale has what is considered a true zero-point (0
◦
K = −273.15
◦
C): It is
the point at which the particles that comprise matter have zero kinetic energy. Other
examples of ratio-scaled attributes include count attributes such as years of experience
(e.g., the objects are employees) and number of words (e.g., the objects are documents).
Additional examples include attributes to measure weight, height, latitude and longitude
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 44
#6
44
Chapter 2 Getting to Know Your Data
coordinates (e.g., when clustering houses), and monetary quantities (e.g., you are 100
times richer with $100 than with $1).
2.1.6
Discrete versus Continuous Attributes
In our presentation, we have organized attributes into nominal, binary, ordinal, and
numeric types. There are many ways to organize attribute types. The types are not
mutually exclusive.
Classification algorithms developed from the field of machine learning often talk of
attributes as being either discrete or continuous. Each type may be processed differently.
A discrete attribute has a finite or countably infinite set of values, which may or may not
be represented as integers. The attributes hair color, smoker, medical test, and drink size
each have a finite number of values, and so are discrete. Note that discrete attributes
may have numeric values, such as 0 and 1 for binary attributes or, the values 0 to 110 for
the attribute age. An attribute is countably infinite if the set of possible values is infinite
but the values can be put in a one-to-one correspondence with natural numbers. For
example, the attribute customer ID is countably infinite. The number of customers can
grow to infinity, but in reality, the actual set of values is countable (where the values can
be put in one-to-one correspondence with the set of integers). Zip codes are another
example.
If an attribute is not discrete, it is
continuous. The terms
numeric attribute and
con-
tinuous attribute are often used interchangeably in the literature. (This can be confusing
because, in the classic sense, continuous values are real numbers, whereas numeric val-
ues can be either integers or real numbers.) In practice, real values are represented
using a finite number of digits. Continuous attributes are typically represented as
floating-point variables.
2.2
Basic Statistical Descriptions of Data
For data preprocessing to be successful, it is essential to have an overall picture of your
data. Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
This section discusses three areas of basic statistical descriptions. We start with mea-
sures of central tendency (Section 2.2.1), which measure the location of the middle or
center of a data distribution. Intuitively speaking, given an attribute, where do most of
its values fall? In particular, we discuss the mean, median, mode, and midrange.
In addition to assessing the central tendency of our data set, we also would like to
have an idea of the dispersion of the data. That is, how are the data spread out? The most
common data dispersion measures are the range, quartiles, and interquartile range; the
five-number summary and
boxplots; and the
variance and
standard deviation of the data
These measures are useful for identifying outliers and are described in Section 2.2.2.
Finally, we can use many graphic displays of basic statistical descriptions to visually
inspect our data (Section 2.2.3). Most statistical or graphical data presentation software