HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 88
#6
88
Chapter 3 Data Preprocessing
3.2
Data Cleaning
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identi-
fying outliers, and correct inconsistencies in the data. In this section, you will study
basic methods for data cleaning. Section 3.2.1 looks at ways of handling missing values.
Section 3.2.2 explains data smoothing techniques. Section 3.2.3 discusses approaches to
data cleaning as a process.
3.2.1
Missing Values
Imagine that you need to analyze AllElectronics sales and customer data. You note that
many tuples have no recorded value for several attributes such as customer income. How
can you go about filling in the missing values for this attribute? Let’s look at the following
methods.
1.
Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percent-
age of missing values per attribute varies considerably. By ignoring the tuple, we do
not make use of the remaining attributes’ values in the tuple. Such data could have
been useful to the task at hand.
2.
Fill in the missing value manually: In general, this approach is time consuming and
may not be feasible given a large data set with many missing values.
3.
Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant such as a label like “Unknown” or −∞. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4.
Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value: Chapter 2 discussed measures of central tendency, which
indicate the “middle” value of a data distribution. For normal (symmetric) data dis-
tributions, the mean can be used, while skewed data distribution should employ
the median (Section 2.2). For example, suppose that the data distribution regard-
ing the income of AllElectronics customers is symmetric and that the mean income is
$56,000. Use this value to replace the missing value for income.
5.
Use the attribute mean or median for all samples belonging to the same class as
the given tuple: For example, if classifying customers according to
credit risk, we
may replace the missing value with the mean income value for customers in the same
credit risk category as that of the given tuple. If the data distribution for a given class
is skewed, the median value is a better choice.
6.
Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism, or decision tree
HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 89
#7
3.2 Data Cleaning
89
induction. For example, using the other customer attributes in your data set, you
may construct a decision tree to predict the missing values for income. Decision trees
and Bayesian inference are described in detail in Chapters 8 and 9, respectively, while
regression is introduced in Section 3.4.5.
Methods 3 through 6 bias the data—the filled-in value may not be correct. Method 6,
however, is a popular strategy. In comparison to the other methods, it uses the most
information from the present data to predict missing values. By considering the other
attributes’ values in its estimation of the missing value for income, there is a greater
chance that the relationships between income and the other attributes are preserved.
It is important to note that, in some cases, a missing value may not imply an error
in the data! For example, when applying for a credit card, candidates may be asked to
supply their driver’s license number. Candidates who do not have a driver’s license may
naturally leave this field blank. Forms should allow respondents to specify values such
as “not applicable.” Software routines may also be used to uncover other null values
(e.g., “don’t know,” “?” or “none”). Ideally, each attribute should have one or more rules
regarding the null condition. The rules may specify whether or not nulls are allowed
and/or how such values should be handled or transformed. Fields may also be inten-
tionally left blank if they are to be provided in a later step of the business process. Hence,
although we can try our best to clean the data after it is seized, good database and data
entry procedure design should help minimize the number of missing values or errors in
the first place.
3.2.2
Noisy Data
“What is noise?” Noise is a random error or variance in a measured variable. In
Chapter 2, we saw how some basic statistical description techniques (e.g., boxplots
and scatter plots), and methods of data visualization can be used to identify outliers,
which may represent noise. Given a numeric attribute such as, say, price, how can we
“smooth” out the data to remove the noise? Let’s look at the following data smoothing
techniques.
Binning: Binning methods smooth a sorted data value by consulting its “neighbor-
hood,” that is, the values around it. The sorted values are distributed into a number
of “buckets,” or bins. Because binning methods consult the neighborhood of values,
they perform local smoothing. Figure 3.2 illustrates some binning techniques. In this
example, the data for price are first sorted and then partitioned into equal-frequency
bins of size 3 (i.e., each bin contains three values). In smoothing by bin means, each
value in a bin is replaced by the mean value of the bin. For example, the mean of the
values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced
by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin value
is replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Each bin value
is then replaced by the closest boundary value. In general, the larger the width, the