Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	50/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 46 47 48 49 50 51 52 53 ... 343

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 84

#2

84

Chapter 3 Data Preprocessing

3.1

Data Preprocessing: An Overview

This section presents an overview of data preprocessing. Section 3.1.1 illustrates the

many elements deﬁning data quality. This provides the incentive behind data prepro-

cessing. Section 3.1.2 outlines the major tasks in data preprocessing.

3.1.1

Data Quality: Why Preprocess the Data?

Data have quality if they satisfy the requirements of the intended use. There are many

factors comprising data quality, including accuracy, completeness, consistency, timeliness,

believability, and interpretability.

Imagine that you are a manager at AllElectronics and have been charged with ana-

lyzing the company’s data with respect to your branch’s sales. You immediately set out

to perform this task. You carefully inspect the company’s database and data warehouse,

identifying and selecting the attributes or dimensions (e.g., item, price, and units sold)

to be included in your analysis. Alas! You notice that several of the attributes for various

tuples have no recorded value. For your analysis, you would like to include informa-

tion as to whether each item purchased was advertised as on sale, yet you discover that

this information has not been recorded. Furthermore, users of your database system

have reported errors, unusual values, and inconsistencies in the data recorded for some

transactions. In other words, the data you wish to analyze by data mining techniques are

incomplete (lacking attribute values or certain attributes of interest, or containing only

aggregate data); inaccurate or noisy (containing errors, or values that deviate from the

expected); and inconsistent (e.g., containing discrepancies in the department codes used

to categorize items). Welcome to the real world!

This scenario illustrates three of the elements deﬁning data quality: accuracy, com-

pleteness, and consistency. Inaccurate, incomplete, and inconsistent data are common-

place properties of large real-world databases and data warehouses. There are many

possible reasons for inaccurate data (i.e., having incorrect attribute values). The data col-

lection instruments used may be faulty. There may have been human or computer errors

occurring at data entry. Users may purposely submit incorrect data values for manda-

tory ﬁelds when they do not wish to submit personal information (e.g., by choosing

the default value “January 1” displayed for birthday). This is known as disguised missing

data. Errors in data transmission can also occur. There may be technology limitations

such as limited buffer size for coordinating synchronized data transfer and consump-

tion. Incorrect data may also result from inconsistencies in naming conventions or data

codes, or inconsistent formats for input ﬁelds (e.g., date). Duplicate tuples also require

data cleaning.

Incomplete data can occur for a number of reasons. Attributes of interest may not

always be available, such as customer information for sales transaction data. Other data

may not be included simply because they were not considered important at the time

of entry. Relevant data may not be recorded due to a misunderstanding or because of

equipment malfunctions. Data that were inconsistent with other recorded data may

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 85

#3

3.1 Data Preprocessing: An Overview

have been deleted. Furthermore, the recording of the data history or modiﬁcations may

have been overlooked. Missing data, particularly for tuples with missing values for some

attributes, may need to be inferred.

Recall that data quality depends on the intended use of the data. Two different users

may have very different assessments of the quality of a given database. For example, a

marketing analyst may need to access the database mentioned before for a list of cus-

tomer addresses. Some of the addresses are outdated or incorrect, yet overall, 80% of

the addresses are accurate. The marketing analyst considers this to be a large customer

database for target marketing purposes and is pleased with the database’s accuracy,

although, as sales manager, you found the data inaccurate.

Timeliness also affects data quality. Suppose that you are overseeing the distribu-

tion of monthly sales bonuses to the top sales representatives at AllElectronics. Several

sales representatives, however, fail to submit their sales records on time at the end of

the month. There are also a number of corrections and adjustments that ﬂow in after

the month’s end. For a period of time following each month, the data stored in the

database are incomplete. However, once all of the data are received, it is correct. The fact

that the month-end data are not updated in a timely fashion has a negative impact on

the data quality.

Two other factors affecting data quality are believability and interpretability. Believ-

ability reﬂects how much the data are trusted by users, while interpretability reﬂects

how easy the data are understood. Suppose that a database, at one point, had several

errors, all of which have since been corrected. The past errors, however, had caused

many problems for sales department users, and so they no longer trust the data. The

data also use many accounting codes, which the sales department does not know how to

interpret. Even though the database is now accurate, complete, consistent, and timely,

sales department users may regard it as of low quality due to poor believability and

interpretability.

3.1.2

Major Tasks in Data Preprocessing

In this section, we look at the major steps involved in data preprocessing, namely, data

cleaning, data integration, data reduction, and data transformation.

Data cleaning routines work to “clean” the data by ﬁlling in missing values, smooth-

ing noisy data, identifying or removing outliers, and resolving inconsistencies. If users

believe the data are dirty, they are unlikely to trust the results of any data mining that has

been applied. Furthermore, dirty data can cause confusion for the mining procedure,

resulting in unreliable output. Although most mining routines have some procedures

for dealing with incomplete or noisy data, they are not always robust. Instead, they may

concentrate on avoiding overﬁtting the data to the function being modeled. Therefore,

a useful preprocessing step is to run your data through some data cleaning routines.

Section 3.2 discusses methods for data cleaning.

Getting back to your task at AllElectronics, suppose that you would like to include

data from multiple sources in your analysis. This would involve integrating multiple

databases, data cubes, or ﬁles (i.e., data integration). Yet some attributes representing a

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 46 47 48 49 50 51 52 53 ... 343