Top 10 algorithms in data mining
23
Input:
the set of
training objects and test object
x
Process:
Compute
x ,
x , the distance between
and every object, x
Select
, the set of
closest training objects to .
Output:
argmax
x
Fig. 6 The
k-nearest neighbor classification algorithm
8.2 Issues
There are several key issues that affect the performance of kNN. One is the choice of k. If
k is too small, then the result can be sensitive to noise points. On the other hand, if k is too
large, then the neighborhood may include too many points from other classes.
Another issue is the approach to combining the class labels. The simplest method is to take
a majority vote, but this can be a problem if the nearest neighbors vary widely in their distance
and the closer neighbors more reliably indicate the class of the object. A more sophisticated
approach, which is usually much less sensitive to the choice of k, weights each object’s vote
by its distance, where the weight factor is often taken to be the reciprocal of the squared
distance:
w
i
= 1/d(x , x
i
)
2
. This amounts to replacing the last step of the kNN algorithm
with the following:
Distance-Weighted Voting: y
= argmax
v
(x
i
,y
i
)∈D
z
w
i
× I (v = y
i
).
(19)
The choice of the distance measure is another important consideration. Although various
measures can be used to compute the distance between two points, the most desirable distance
measure is one for which a smaller distance between two objects implies a greater likelihood
of having the same class. Thus, for example, if kNN is being applied to classify documents,
then it may be better to use the cosine measure rather than Euclidean distance. Some distance
measures can also be affected by the high dimensionality of the data. In particular, it is well
known that the Euclidean distance measure become less discriminating as the number of
attributes increases. Also, attributes may have to be scaled to prevent distance measures from
being dominated by one of the attributes. For example, consider a data set where the height
of a person varies from 1.5 to 1.8 m, the weight of a person varies from 90 to 300 lb, and
the income of a person varies from $10,000 to $1,000,000. If a distance measure is used
without scaling, the income attribute will dominate the computation of distance and thus, the
assignment of class labels. A number of schemes have been developed that try to compute
the weights of each individual attribute based upon a training set [
32
].
In addition, weights can be assigned to the training objects themselves. This can give more
weight to highly reliable training objects, while reducing the impact of unreliable objects. The
PEBLS system by by Cost and Salzberg [
14
] is a well known example of such an approach.
KNN classifiers are lazy learners, that is, models are not built explicitly unlike eager
learners (e.g., decision trees, SVM, etc.). Thus, building the model is cheap, but classifying
unknown objects is relatively expensive since it requires the computation of the k-nearest
neighbors of the object to be labeled. This, in general, requires computing the distance of the
unlabeled object to all the objects in the labeled set, which can be expensive particularly for
large training sets. A number of techniques have been developed for efficient computation
123
24
X. Wu et al.
of k-nearest neighbor distance that make use of the structure in the data to avoid having to
compute distance to all objects in the training set. These techniques, which are particularly
applicable for low dimensional data, can help reduce the computational cost without affecting
classification accuracy.
8.3 Impact
KNN classification is an easy to understand and easy to implement classification technique.
Despite its simplicity, it can perform well in many situations. In particular, a well known
result by Cover and Hart [
15
] shows that the the error of the nearest neighbor rule is bounded
above by twice the Bayes error under certain reasonable assumptions. Also, the error of the
general kNN method asymptotically approaches that of the Bayes error and can be used to
approximate it.
KNN is particularly well suited for multi-modal classes as well as applications in which
an object can have many class labels. For example, for the assignment of functions to genes
based on expression profiles, some researchers found that kNN outperformed SVM, which
is a much more sophisticated classification scheme [
48
].
8.4 Current and future research
Although the basic
kNN algorithm and some of its variations, such as weighted
kNN and
assigning weights to objects, are relatively well known, some of the more advanced tech-
niques for kNN are much less known. For example, it is typically possible to eliminate many
of the stored data objects, but still retain the classification accuracy of the kNN classifier. This
is known as ‘condensing’ and can greatly speed up the classification of new objects [
35
]. In
addition, data objects can be removed to improve classification accuracy, a process known
as “editing” [
88
]. There has also been a considerable amount of work on the application of
proximity graphs (nearest neighbor graphs, minimum spanning trees, relative neighborhood
graphs, Delaunay triangulations, and Gabriel graphs) to the kNN problem. Recent papers by
Toussaint [
79
,
80
], which emphasize a proximity graph viewpoint, provide an overview of
work addressing these three areas and indicate some remaining open problems. Other impor-
tant resources include the collection of papers by Dasarathy [
16
] and the book by Devroye
et al. [
18
]. Finally, a fuzzy approach to kNN can be found in the work of Bezdek [
4
].
9 Naive Bayes
9.1 Introduction
Given a set of objects, each of which belongs to a known class, and each of which has a
known vector of variables, our aim is to construct a rule which will allow us to assign future
objects to a class, given only the vectors of variables describing the future objects. Problems
of this kind, called problems of supervised classification, are ubiquitous, and many methods
for constructing such rules have been developed. One very important one is the naive Bayes
method—also called idiot’s Bayes, simple Bayes, and independence Bayes. This method
is important for several reasons. It is very easy to construct, not needing any complicated
iterative parameter estimation schemes. This means it may be readily applied to huge data
sets. It is easy to interpret, so users unskilled in classifier technology can understand why it
is making the classification it makes. And finally, it often does surprisingly well: it may not
123