S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	11/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 ... 8 9 10 11 12 13 14 15 16

Input
9 Naive Bayes

Top 10 algorithms in data mining

23

Input:

the set of

training objects and test object

x

Process:

Compute

x , x , the distance between

and every object, x

Select

, the set of

closest training objects to .

Output:

argmax

x

Fig. 6 The k-nearest neighbor classiﬁcation algorithm

8.2 Issues

There are several key issues that affect the performance of kNN. One is the choice of k. If

k is too small, then the result can be sensitive to noise points. On the other hand, if k is too

large, then the neighborhood may include too many points from other classes.

Another issue is the approach to combining the class labels. The simplest method is to take

a majority vote, but this can be a problem if the nearest neighbors vary widely in their distance

and the closer neighbors more reliably indicate the class of the object. A more sophisticated

approach, which is usually much less sensitive to the choice of k, weights each object’s vote

by its distance, where the weight factor is often taken to be the reciprocal of the squared

distance:

w

i

= 1/d(x , x

)

. This amounts to replacing the last step of the kNN algorithm

with the following:

Distance-Weighted Voting: y

= argmax

)∈D

w

i

× I (v = y

i

(19)

The choice of the distance measure is another important consideration. Although various

measures can be used to compute the distance between two points, the most desirable distance

measure is one for which a smaller distance between two objects implies a greater likelihood

of having the same class. Thus, for example, if kNN is being applied to classify documents,

then it may be better to use the cosine measure rather than Euclidean distance. Some distance

measures can also be affected by the high dimensionality of the data. In particular, it is well

known that the Euclidean distance measure become less discriminating as the number of

attributes increases. Also, attributes may have to be scaled to prevent distance measures from

being dominated by one of the attributes. For example, consider a data set where the height

of a person varies from 1.5 to 1.8 m, the weight of a person varies from 90 to 300 lb, and

the income of a person varies from $10,000 to $1,000,000. If a distance measure is used

without scaling, the income attribute will dominate the computation of distance and thus, the

assignment of class labels. A number of schemes have been developed that try to compute

the weights of each individual attribute based upon a training set [

].

In addition, weights can be assigned to the training objects themselves. This can give more

weight to highly reliable training objects, while reducing the impact of unreliable objects. The

PEBLS system by by Cost and Salzberg [

] is a well known example of such an approach.

KNN classiﬁers are lazy learners, that is, models are not built explicitly unlike eager

learners (e.g., decision trees, SVM, etc.). Thus, building the model is cheap, but classifying

unknown objects is relatively expensive since it requires the computation of the k-nearest

neighbors of the object to be labeled. This, in general, requires computing the distance of the

unlabeled object to all the objects in the labeled set, which can be expensive particularly for

large training sets. A number of techniques have been developed for efﬁcient computation

123

X. Wu et al.

of k-nearest neighbor distance that make use of the structure in the data to avoid having to

compute distance to all objects in the training set. These techniques, which are particularly

applicable for low dimensional data, can help reduce the computational cost without affecting

classiﬁcation accuracy.

8.3 Impact

KNN classiﬁcation is an easy to understand and easy to implement classiﬁcation technique.

Despite its simplicity, it can perform well in many situations. In particular, a well known

result by Cover and Hart [

] shows that the the error of the nearest neighbor rule is bounded

above by twice the Bayes error under certain reasonable assumptions. Also, the error of the

general kNN method asymptotically approaches that of the Bayes error and can be used to

approximate it.

KNN is particularly well suited for multi-modal classes as well as applications in which

an object can have many class labels. For example, for the assignment of functions to genes

based on expression proﬁles, some researchers found that kNN outperformed SVM, which

is a much more sophisticated classiﬁcation scheme [

].

8.4 Current and future research

Although the basic kNN algorithm and some of its variations, such as weighted kNN and

assigning weights to objects, are relatively well known, some of the more advanced tech-

niques for kNN are much less known. For example, it is typically possible to eliminate many

of the stored data objects, but still retain the classiﬁcation accuracy of the kNN classiﬁer. This

is known as ‘condensing’ and can greatly speed up the classiﬁcation of new objects [

]. In

addition, data objects can be removed to improve classiﬁcation accuracy, a process known

as “editing” [

]. There has also been a considerable amount of work on the application of

proximity graphs (nearest neighbor graphs, minimum spanning trees, relative neighborhood

graphs, Delaunay triangulations, and Gabriel graphs) to the kNN problem. Recent papers by

Toussaint [

], which emphasize a proximity graph viewpoint, provide an overview of

work addressing these three areas and indicate some remaining open problems. Other impor-

tant resources include the collection of papers by Dasarathy [

] and the book by Devroye

et al. [

]. Finally, a fuzzy approach to kNN can be found in the work of Bezdek [

].

9 Naive Bayes

9.1 Introduction

Given a set of objects, each of which belongs to a known class, and each of which has a

known vector of variables, our aim is to construct a rule which will allow us to assign future

objects to a class, given only the vectors of variables describing the future objects. Problems

of this kind, called problems of supervised classiﬁcation, are ubiquitous, and many methods

for constructing such rules have been developed. One very important one is the naive Bayes

method—also called idiot’s Bayes, simple Bayes, and independence Bayes. This method

is important for several reasons. It is very easy to construct, not needing any complicated

iterative parameter estimation schemes. This means it may be readily applied to huge data

sets. It is easy to interpret, so users unskilled in classiﬁer technology can understand why it

is making the classiﬁcation it makes. And ﬁnally, it often does surprisingly well: it may not

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 ... 8 9 10 11 12 13 14 15 16