Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	54/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 50 51 52 53 54 55 56 57 ... 343

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 93

#11

3.3 Data Integration

trace. Most errors, however, will require data transformations. That is, once we ﬁnd

discrepancies, we typically need to deﬁne and apply (a series of) transformations to

correct them.

Commercial tools can assist in the data transformation step. Data migration tools

allow simple transformations to be speciﬁed such as to replace the string “gender” by

“sex.” ETL (extraction/transformation/loading) tools allow users to specify transforms

through a graphical user interface (GUI). These tools typically support only a restricted

set of transforms so that, often, we may also choose to write custom scripts for this step

of the data cleaning process.

The two-step process of discrepancy detection and data transformation (to correct

discrepancies) iterates. This process, however, is error-prone and time consuming. Some

transformations may introduce more discrepancies. Some nested discrepancies may only

be detected after others have been ﬁxed. For example, a typo such as “20010” in a year

ﬁeld may only surface once all date values have been converted to a uniform format.

Transformations are often done as a batch process while the user waits without feedback.

Only after the transformation is complete can the user go back and check that no new

anomalies have been mistakenly created. Typically, numerous iterations are required

before the user is satisﬁed. Any tuples that cannot be automatically handled by a given

transformation are typically written to a ﬁle without any explanation regarding the rea-

soning behind their failure. As a result, the entire data cleaning process also suffers from

a lack of interactivity.

New approaches to data cleaning emphasize increased interactivity. Potter’s Wheel,

for example, is a publicly available data cleaning tool that integrates discrepancy detec-

tion and transformation. Users gradually build a series of transformations by composing

and debugging individual transformations, one step at a time, on a spreadsheet-like

interface. The transformations can be speciﬁed graphically or by providing examples.

Results are shown immediately on the records that are visible on the screen. The user

can choose to undo the transformations, so that transformations that introduced addi-

tional errors can be “erased.” The tool automatically performs discrepancy checking in

the background on the latest transformed view of the data. Users can gradually develop

and reﬁne transformations as discrepancies are found, leading to more effective and

efﬁcient data cleaning.

Another approach to increased interactivity in data cleaning is the development of

declarative languages for the speciﬁcation of data transformation operators. Such work

focuses on deﬁning powerful extensions to SQL and algorithms that enable users to

express data cleaning speciﬁcations efﬁciently.

As we discover more about the data, it is important to keep updating the metadata

to reﬂect this knowledge. This will help speed up data cleaning on future versions of the

same data store.

3.3

Data Integration

Data mining often requires data integration—the merging of data from multiple data

stores. Careful integration can help reduce and avoid redundancies and inconsistencies

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 94

#12

94

Chapter 3 Data Preprocessing

in the resulting data set. This can help improve the accuracy and speed of the subsequent

data mining process.

The semantic heterogeneity and structure of data pose great challenges in data inte-

gration. How can we match schema and objects from different sources? This is the

essence of the entity identiﬁcation problem, described in Section 3.3.1. Are any attributes

correlated? Section 3.3.2 presents correlation tests for numeric and nominal data. Tuple

duplication is described in Section 3.3.3. Finally, Section 3.3.4 touches on the detection

and resolution of data value conﬂicts.

3.3.1

Entity Identiﬁcation Problem

It is likely that your data analysis task will involve data integration, which combines data

from multiple sources into a coherent data store, as in data warehousing. These sources

may include multiple databases, data cubes, or ﬂat ﬁles.

There are a number of issues to consider during data integration. Schema integration

and object matching can be tricky. How can equivalent real-world entities from multiple

data sources be matched up? This is referred to as the entity identiﬁcation problem.

For example, how can the data analyst or the computer be sure that customer id in one

database and cust number in another refer to the same attribute? Examples of metadata

for each attribute include the name, meaning, data type, and range of values permitted

for the attribute, and null rules for handling blank, zero, or null values (Section 3.2).

Such metadata can be used to help avoid errors in schema integration. The metadata

may also be used to help transform the data (e.g., where data codes for pay type in one

database may be “H” and “S” but 1 and 2 in another). Hence, this step also relates to

data cleaning, as described earlier.

When matching attributes from one database to another during integration, special

attention must be paid to the structure of the data. This is to ensure that any attribute

functional dependencies and referential constraints in the source system match those in

the target system. For example, in one system, a discount may be applied to the order,

whereas in another system it is applied to each individual line item within the order.

If this is not caught before integration, items in the target system may be improperly

discounted.

3.3.2

Redundancy and Correlation Analysis

Redundancy is another important issue in data integration. An attribute (such as annual

revenue, for instance) may be redundant if it can be “derived” from another attribute

or set of attributes. Inconsistencies in attribute or dimension naming can also cause

redundancies in the resulting data set.

Some redundancies can be detected by correlation analysis. Given two attributes,

such analysis can measure how strongly one attribute implies the other, based on the

available data. For nominal data, we use the

2

(chi-square) test. For numeric attributes,

we can use the correlation coefﬁcient and covariance, both of which access how one

attribute’s values vary from those of another.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 50 51 52 53 54 55 56 57 ... 343