Data Mining. Concepts and Techniques, 3rd Edition

HAN 09-ch02-039-082-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	32/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 28 29 30 31 32 33 34 35 ... 343

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 43

#5

2.1 Data Objects and Attribute Types

2.1.5

Numeric Attributes

A numeric attribute is quantitative; that is, it is a measurable quantity, represented in

integer or real values. Numeric attributes can be interval-scaled or ratio-scaled.

Interval-Scaled Attributes

Interval-scaled attributes are measured on a scale of equal-size units. The values of

interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition

to providing a ranking of values, such attributes allow us to compare and quantify the

difference between values.

Example 2.4

Interval-scaled attributes. A temperature attribute is interval-scaled. Suppose that we

have the outdoor temperature value for a number of different days, where each day is

an object. By ordering the values, we obtain a ranking of the objects with respect to

temperature. In addition, we can quantify the difference between values. For example, a

temperature of 20

◦

C is ﬁve degrees higher than a temperature of 15

◦

C. Calendar dates

are another example. For instance, the years 2002 and 2010 are eight years apart.

Temperatures in Celsius and Fahrenheit do not have a true zero-point, that is, neither

◦

C nor 0

◦

F indicates “no temperature.” (On the Celsius scale, for example, the unit of

measurement is 1/100 of the difference between the melting temperature and the boiling

temperature of water in atmospheric pressure.) Although we can compute the difference

between temperature values, we cannot talk of one temperature value as being a multiple

of another. Without a true zero, we cannot say, for instance, that 10

◦

C is twice as warm

as 5

◦

C. That is, we cannot speak of the values in terms of ratios. Similarly, there is no

true zero-point for calendar dates. (The year 0 does not correspond to the beginning of

time.) This brings us to ratio-scaled attributes, for which a true zero-point exits.

Because interval-scaled attributes are numeric, we can compute their mean value, in

addition to the median and mode measures of central tendency.

Ratio-Scaled Attributes

A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if

a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio)

of another value. In addition, the values are ordered, and we can also compute the

difference between values, as well as the mean, median, and mode.

Example 2.5

Ratio-scaled attributes. Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K)

temperature scale has what is considered a true zero-point (0

◦

K = −273.15

◦

C): It is

the point at which the particles that comprise matter have zero kinetic energy. Other

examples of ratio-scaled attributes include count attributes such as years of experience

(e.g., the objects are employees) and number of words (e.g., the objects are documents).

Additional examples include attributes to measure weight, height, latitude and longitude

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 44

#6

44

Chapter 2 Getting to Know Your Data

coordinates (e.g., when clustering houses), and monetary quantities (e.g., you are 100

times richer with $100 than with $1).

2.1.6

Discrete versus Continuous Attributes

In our presentation, we have organized attributes into nominal, binary, ordinal, and

numeric types. There are many ways to organize attribute types. The types are not

mutually exclusive.

Classiﬁcation algorithms developed from the ﬁeld of machine learning often talk of

attributes as being either discrete or continuous. Each type may be processed differently.

A discrete attribute has a ﬁnite or countably inﬁnite set of values, which may or may not

be represented as integers. The attributes hair color, smoker, medical test, and drink size

each have a ﬁnite number of values, and so are discrete. Note that discrete attributes

may have numeric values, such as 0 and 1 for binary attributes or, the values 0 to 110 for

the attribute age. An attribute is countably inﬁnite if the set of possible values is inﬁnite

but the values can be put in a one-to-one correspondence with natural numbers. For

example, the attribute customer ID is countably inﬁnite. The number of customers can

grow to inﬁnity, but in reality, the actual set of values is countable (where the values can

be put in one-to-one correspondence with the set of integers). Zip codes are another

example.

If an attribute is not discrete, it is continuous. The terms numeric attribute and con-

tinuous attribute are often used interchangeably in the literature. (This can be confusing

because, in the classic sense, continuous values are real numbers, whereas numeric val-

ues can be either integers or real numbers.) In practice, real values are represented

using a ﬁnite number of digits. Continuous attributes are typically represented as

ﬂoating-point variables.

2.2

Basic Statistical Descriptions of Data

For data preprocessing to be successful, it is essential to have an overall picture of your

data. Basic statistical descriptions can be used to identify properties of the data and

highlight which data values should be treated as noise or outliers.

This section discusses three areas of basic statistical descriptions. We start with mea-

sures of central tendency (Section 2.2.1), which measure the location of the middle or

center of a data distribution. Intuitively speaking, given an attribute, where do most of

its values fall? In particular, we discuss the mean, median, mode, and midrange.

In addition to assessing the central tendency of our data set, we also would like to

have an idea of the dispersion of the data. That is, how are the data spread out? The most

common data dispersion measures are the range, quartiles, and interquartile range; the

ﬁve-number summary and boxplots; and the variance and standard deviation of the data

These measures are useful for identifying outliers and are described in Section 2.2.2.

Finally, we can use many graphic displays of basic statistical descriptions to visually

inspect our data (Section 2.2.3). Most statistical or graphical data presentation software

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 28 29 30 31 32 33 34 35 ... 343