Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	53/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 49 50 51 52 53 54 55 56 ... 343

Smoothing by bin means
Data Cleaning as a Process

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 90

#8

90

Chapter 3 Data Preprocessing

Sorted data for

price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins:

Bin 1: 4, 8, 15

Bin 2: 21, 21, 24

Bin 3: 25, 28, 34

Smoothing by bin means:

Bin 1: 9, 9, 9

Bin 2: 22, 22, 22

Bin 3: 29, 29, 29

Smoothing by bin boundaries:

Bin 1: 4, 4, 15

Bin 2: 21, 21, 24

Bin 3: 25, 25, 34

Figure 3.2

Binning methods for data smoothing.

greater the effect of the smoothing. Alternatively, bins may be equal width, where the

interval range of values in each bin is constant. Binning is also used as a discretization

technique and is further discussed in Section 3.5.

Regression: Data smoothing can also be done by regression, a technique that con-

forms data values to a function. Linear regression involves ﬁnding the “best” line to

ﬁt two attributes (or variables) so that one attribute can be used to predict the other.

Multiple linear regression is an extension of linear regression, where more than two

attributes are involved and the data are ﬁt to a multidimensional surface. Regression

is further described in Section 3.4.5.

Outlier analysis: Outliers may be detected by clustering, for example, where similar

values are organized into groups, or “clusters.” Intuitively, values that fall outside of

the set of clusters may be considered outliers (Figure 3.3). Chapter 12 is dedicated to

the topic of outlier analysis.

Many data smoothing methods are also used for data discretization (a form of data

transformation) and data reduction. For example, the binning techniques described

before reduce the number of distinct values per attribute. This acts as a form of data

reduction for logic-based data mining methods, such as decision tree induction, which

repeatedly makes value comparisons on sorted data. Concept hierarchies are a form of

data discretization that can also be used for data smoothing. A concept hierarchy for

price, for example, may map real price values into inexpensive, moderately priced, and

expensive, thereby reducing the number of data values to be handled by the mining

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 91

#9

3.2 Data Cleaning

91

Figure 3.3

A 2-D customer data plot with respect to customer locations in a city, showing three data

clusters. Outliers may be detected as values that fall outside of the cluster sets.

process. Data discretization is discussed in Section 3.5. Some methods of classiﬁcation

(e.g., neural networks) have built-in data smoothing mechanisms. Classiﬁcation is the

topic of Chapters 8 and 9.

3.2.3

Data Cleaning as a Process

Missing values, noise, and inconsistencies contribute to inaccurate data. So far, we have

looked at techniques for handling missing data and for smoothing data. “But data clean-

ing is a big job. What about data cleaning as a process? How exactly does one proceed in

tackling this task? Are there any tools out there to help?”

The ﬁrst step in data cleaning as a process is discrepancy detection. Discrepancies can

be caused by several factors, including poorly designed data entry forms that have many

optional ﬁelds, human error in data entry, deliberate errors (e.g., respondents not want-

ing to divulge information about themselves), and data decay (e.g., outdated addresses).

Discrepancies may also arise from inconsistent data representations and inconsistent use

of codes. Other sources of discrepancies include errors in instrumentation devices that

record data and system errors. Errors can also occur when the data are (inadequately)

used for purposes other than originally intended. There may also be inconsistencies due

to data integration (e.g., where a given attribute can have different names in different

databases).

Data integration and the removal of redundant data that can result from such integration are further

described in Section 3.3.

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 92

#10

92

Chapter 3 Data Preprocessing

“So, how can we proceed with discrepancy detection?” As a starting point, use any

knowledge you may already have regarding properties of the data. Such knowledge or

“data about data” is referred to as metadata. This is where we can make use of the know-

ledge we gained about our data in Chapter 2. For example, what are the data type and

domain of each attribute? What are the acceptable values for each attribute? The basic

statistical data descriptions discussed in Section 2.2 are useful here to grasp data trends

and identify anomalies. For example, ﬁnd the mean, median, and mode values. Are the

data symmetric or skewed? What is the range of values? Do all values fall within the

expected range? What is the standard deviation of each attribute? Values that are more

than two standard deviations away from the mean for a given attribute may be ﬂagged

as potential outliers. Are there any known dependencies between attributes? In this step,

you may write your own scripts and/or use some of the tools that we discuss further later.

From this, you may ﬁnd noise, outliers, and unusual values that need investigation.

As a data analyst, you should be on the lookout for the inconsistent use of codes and

any inconsistent data representations (e.g., “2010/12/25” and “25/12/2010” for date).

Field overloading is another error source that typically results when developers squeeze

new attribute deﬁnitions into unused (bit) portions of already deﬁned attributes (e.g.,

an unused bit of an attribute that has a value range that uses only, say, 31 out of

32 bits).

The data should also be examined regarding unique rules, consecutive rules, and null

rules. A unique rule says that each value of the given attribute must be different from

all other values for that attribute. A consecutive rule says that there can be no miss-

ing values between the lowest and highest values for the attribute, and that all values

must also be unique (e.g., as in check numbers). A null rule speciﬁes the use of blanks,

question marks, special characters, or other strings that may indicate the null condition

(e.g., where a value for a given attribute is not available), and how such values should

be handled. As mentioned in Section 3.2.1, reasons for missing values may include

(1) the person originally asked to provide a value for the attribute refuses and/or ﬁnds

that the information requested is not applicable (e.g., a license number attribute left

blank by nondrivers); (2) the data entry person does not know the correct value; or (3)

the value is to be provided by a later step of the process. The null rule should specify how

to record the null condition, for example, such as to store zero for numeric attributes, a

blank for character attributes, or any other conventions that may be in use (e.g., entries

like “don’t know” or “?” should be transformed to blank).

There are a number of different commercial tools that can aid in the discrepancy

detection step. Data scrubbing tools use simple domain knowledge (e.g., knowledge

of postal addresses and spell-checking) to detect errors and make corrections in the

data. These tools rely on parsing and fuzzy matching techniques when cleaning data

from multiple sources. Data auditing tools ﬁnd discrepancies by analyzing the data to

discover rules and relationships, and detecting data that violate such conditions. They

are variants of data mining tools. For example, they may employ statistical analysis to

ﬁnd correlations, or clustering to identify outliers. They may also use the basic statistical

data descriptions presented in Section 2.2.

Some data inconsistencies may be corrected manually using external references.

For example, errors made at data entry may be corrected by performing a paper

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 49 50 51 52 53 54 55 56 ... 343