Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	60/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 56 57 58 59 60 61 62 63 ... 343

Regression and Log-Linear Models: Parametric Data Reduction
Multiple linear regression
Histograms
Example 3.3 Histograms.

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 105

#23

3.4 Data Reduction

105

1.

Stepwise forward selection: The procedure starts with an empty set of attributes as

the reduced set. The best of the original attributes is determined and added to the

reduced set. At each subsequent iteration or step, the best of the remaining original

attributes is added to the set.

2.

Stepwise backward elimination: The procedure starts with the full set of attributes.

At each step, it removes the worst attribute remaining in the set.

3.

Combination of forward selection and backward elimination: The stepwise for-

ward selection and backward elimination methods can be combined so that, at each

step, the procedure selects the best attribute and removes the worst from among the

remaining attributes.

4.

Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were

originally intended for classiﬁcation. Decision tree induction constructs a ﬂowchart-

like structure where each internal (nonleaf) node denotes a test on an attribute, each

branch corresponds to an outcome of the test, and each external (leaf) node denotes a

class prediction. At each node, the algorithm chooses the “best” attribute to partition

the data into individual classes.

When decision tree induction is used for attribute subset selection, a tree is con-

structed from the given data. All attributes that do not appear in the tree are assumed

to be irrelevant. The set of attributes appearing in the tree form the reduced subset

of attributes.

The stopping criteria for the methods may vary. The procedure may employ a threshold

on the measure used to determine when to stop the attribute selection process.

In some cases, we may want to create new attributes based on others. Such attribute

construction

can help improve accuracy and understanding of structure in high-

dimensional data. For example, we may wish to add the attribute area based on the

attributes height and width. By combining attributes, attribute construction can dis-

cover missing information about the relationships between data attributes that can be

useful for knowledge discovery.

3.4.5

Regression and Log-Linear Models: Parametric

Data Reduction

Regression and log-linear models can be used to approximate the given data. In (simple)

linear regression, the data are modeled to ﬁt a straight line. For example, a random

variable, y (called a response variable), can be modeled as a linear function of another

random variable, x (called a predictor variable), with the equation

y = wx + b,

(3.7)

where the variance of y is assumed to be constant. In the context of data mining, x and y

are numeric database attributes. The coefﬁcients, w and b (called regression coefﬁcients),

In the machine learning literature, attribute construction is known as feature construction.

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 106

#24

106

Chapter 3 Data Preprocessing

specify the slope of the line and the y-intercept, respectively. These coefﬁcients can

be solved for by the method of least squares, which minimizes the error between the

actual line separating the data and the estimate of the line. Multiple linear regression

is an extension of (simple) linear regression, which allows a response variable, y, to be

modeled as a linear function of two or more predictor variables.

Log-linear models approximate discrete multidimensional probability distributions.

Given a set of tuples in n dimensions (e.g., described by n attributes), we can con-

sider each tuple as a point in an n-dimensional space. Log-linear models can be used

to estimate the probability of each point in a multidimensional space for a set of dis-

cretized attributes, based on a smaller subset of dimensional combinations. This allows

a higher-dimensional data space to be constructed from lower-dimensional spaces.

Log-linear models are therefore also useful for dimensionality reduction (since the

lower-dimensional points together typically occupy less space than the original data

points) and data smoothing (since aggregate estimates in the lower-dimensional space

are less subject to sampling variations than the estimates in the higher-dimensional

space).

Regression and log-linear models can both be used on sparse data, although their

application may be limited. While both methods can handle skewed data, regression

does exceptionally well. Regression can be computationally intensive when applied to

high-dimensional data, whereas log-linear models show good scalability for up to 10 or

so dimensions.

Several software packages exist to solve regression problems. Examples include SAS

(www.sas.com), SPSS (www.spss.com), and S-Plus (www.insightful.com). Another useful

resource is the book Numerical Recipes in C, by Press, Teukolsky, Vetterling, and Flannery

[PTVF07], and its associated source code.

3.4.6

Histograms

Histograms use binning to approximate data distributions and are a popular form

of data reduction. Histograms were introduced in Section 2.2.3. A histogram for an

attribute, A, partitions the data distribution of A into disjoint subsets, referred to as

buckets or bins. If each bucket represents only a single attribute–value/frequency pair, the

buckets are called singleton buckets. Often, buckets instead represent continuous ranges

for the given attribute.

Example 3.3

Histograms. The following data are a list of AllElectronics prices for commonly sold

items (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,

5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,

18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,

30, 30.

Figure 3.7 shows a histogram for the data using singleton buckets. To further reduce

the data, it is common to have each bucket denote a continuous value range for

the given attribute. In Figure 3.8, each bucket represents a different $10 range for

price.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 56 57 58 59 60 61 62 63 ... 343