Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	52/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 48 49 50 51 52 53 54 55 ... 343

Noisy Data “What is noise” Noise

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 88

#6

88

Chapter 3 Data Preprocessing

3.2

Data Cleaning

Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data

cleansing) routines attempt to ﬁll in missing values, smooth out noise while identi-

fying outliers, and correct inconsistencies in the data. In this section, you will study

basic methods for data cleaning. Section 3.2.1 looks at ways of handling missing values.

Section 3.2.2 explains data smoothing techniques. Section 3.2.3 discusses approaches to

data cleaning as a process.

3.2.1

Missing Values

Imagine that you need to analyze AllElectronics sales and customer data. You note that

many tuples have no recorded value for several attributes such as customer income. How

can you go about ﬁlling in the missing values for this attribute? Let’s look at the following

methods.

1.

Ignore the tuple: This is usually done when the class label is missing (assuming the

mining task involves classiﬁcation). This method is not very effective, unless the tuple

contains several attributes with missing values. It is especially poor when the percent-

age of missing values per attribute varies considerably. By ignoring the tuple, we do

not make use of the remaining attributes’ values in the tuple. Such data could have

been useful to the task at hand.

2.

Fill in the missing value manually: In general, this approach is time consuming and

may not be feasible given a large data set with many missing values.

3.

Use a global constant to ﬁll in the missing value: Replace all missing attribute values

by the same constant such as a label like “Unknown” or −∞. If missing values are

replaced by, say, “Unknown,” then the mining program may mistakenly think that

they form an interesting concept, since they all have a value in common—that of

“Unknown.” Hence, although this method is simple, it is not foolproof.

4.

Use a measure of central tendency for the attribute (e.g., the mean or median) to

ﬁll in the missing value: Chapter 2 discussed measures of central tendency, which

indicate the “middle” value of a data distribution. For normal (symmetric) data dis-

tributions, the mean can be used, while skewed data distribution should employ

the median (Section 2.2). For example, suppose that the data distribution regard-

ing the income of AllElectronics customers is symmetric and that the mean income is

$56,000. Use this value to replace the missing value for income.

5.

Use the attribute mean or median for all samples belonging to the same class as

the given tuple: For example, if classifying customers according to credit risk, we

may replace the missing value with the mean income value for customers in the same

credit risk category as that of the given tuple. If the data distribution for a given class

is skewed, the median value is a better choice.

6.

Use the most probable value to ﬁll in the missing value: This may be determined

with regression, inference-based tools using a Bayesian formalism, or decision tree

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 89

#7

3.2 Data Cleaning

induction. For example, using the other customer attributes in your data set, you

may construct a decision tree to predict the missing values for income. Decision trees

and Bayesian inference are described in detail in Chapters 8 and 9, respectively, while

regression is introduced in Section 3.4.5.

Methods 3 through 6 bias the data—the ﬁlled-in value may not be correct. Method 6,

however, is a popular strategy. In comparison to the other methods, it uses the most

information from the present data to predict missing values. By considering the other

attributes’ values in its estimation of the missing value for income, there is a greater

chance that the relationships between income and the other attributes are preserved.

It is important to note that, in some cases, a missing value may not imply an error

in the data! For example, when applying for a credit card, candidates may be asked to

supply their driver’s license number. Candidates who do not have a driver’s license may

naturally leave this ﬁeld blank. Forms should allow respondents to specify values such

as “not applicable.” Software routines may also be used to uncover other null values

(e.g., “don’t know,” “?” or “none”). Ideally, each attribute should have one or more rules

regarding the null condition. The rules may specify whether or not nulls are allowed

and/or how such values should be handled or transformed. Fields may also be inten-

tionally left blank if they are to be provided in a later step of the business process. Hence,

although we can try our best to clean the data after it is seized, good database and data

entry procedure design should help minimize the number of missing values or errors in

the ﬁrst place.

3.2.2

Noisy Data

“What is noise?” Noise is a random error or variance in a measured variable. In

Chapter 2, we saw how some basic statistical description techniques (e.g., boxplots

and scatter plots), and methods of data visualization can be used to identify outliers,

which may represent noise. Given a numeric attribute such as, say, price, how can we

“smooth” out the data to remove the noise? Let’s look at the following data smoothing

techniques.

Binning: Binning methods smooth a sorted data value by consulting its “neighbor-

hood,” that is, the values around it. The sorted values are distributed into a number

of “buckets,” or bins. Because binning methods consult the neighborhood of values,

they perform local smoothing. Figure 3.2 illustrates some binning techniques. In this

example, the data for price are ﬁrst sorted and then partitioned into equal-frequency

bins of size 3 (i.e., each bin contains three values). In smoothing by bin means, each

value in a bin is replaced by the mean value of the bin. For example, the mean of the

values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced

by the value 9.

Similarly, smoothing by bin medians can be employed, in which each bin value

is replaced by the bin median. In smoothing by bin boundaries, the minimum and

maximum values in a given bin are identiﬁed as the bin boundaries. Each bin value

is then replaced by the closest boundary value. In general, the larger the width, the

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 48 49 50 51 52 53 54 55 ... 343