Top 10 algorithms in data mining
21
Fig. 5 The AdaBoost algorithm
AdaBoost and its variants have been applied to diverse domains with great success. For
example, Viola and Jones [
84
] combined AdaBoost with a cascade process for face detection.
They regarded rectangular features as weak learners, and by using AdaBoost to weight the
weak learners, they got very intuitive features for face detection. In order to get high accuracy
as well as high efficiency, they used a cascade process (which is beyond the scope of this
article). As the result, they reported a very strong face detector: On a 466 MHz machine,
face detection on a 384
× 288 image cost only 0.067 seconds, which is 15 times faster than
state-of-the-art face detectors at that time but with comparable accuracy. This face detector
has been recognized as one of the most exciting breakthroughs in computer vision (in par-
ticular, face detection) during the past decade. It is not strange that “Boosting” has become
a buzzword in computer vision and many other application areas.
7.3 Further research
Many interesting topics worth further studying. Here we only discuss on one theoretical topic
and one applied topic.
Many empirical study show that AdaBoost often does not overfit, i.e., the test error of
AdaBoost often tends to decrease even after the training error is zero. Many researchers have
studied this and several theoretical explanations have been given, e.g. [
38
]. Schapire et al.
[
68
] presented a margin-based explanation. They argued that AdaBoost is able to increase the
margins even after the training error is zero, and thus it does not overfit even after a large num-
ber of rounds. However, Breiman [
8
] indicated that larger margin does not necessarily mean
123
22
X. Wu et al.
better generalization, which seriously challenged the margin-based explanation. Recently,
Reyzin and Schapire [
65
] found that Breiman considered minimum margin instead of aver-
age or median margin, which suggests that the margin-based explanation still has chance to
survive. If this explanation succeeds, a strong connection between AdaBoost and SVM could
be found. It is obvious that this topic is well worth studying.
Many real-world applications are born with high dimensionality, i.e., with a large amount
of input features. There are two paradigms that can help us to deal with such kind of data, i.e.,
dimension reduction and feature selection. Dimension reduction methods are usually based
on mathematical projections, which attempt to transform the original features into an appro-
priate feature space. After dimension reduction, the original meaning of the features is usually
lost. Feature selection methods directly select some original features to use, and therefore
they can preserve the original meaning of the features, which is very desirable in many appli-
cations. However, feature selection methods are usually based on heuristics, lacking solid
theoretical foundation. Inspired by Viola and Jones’s work [
84
], we think AdaBoost could
be very useful in feature selection, especially when considering that it has solid theoretical
foundation. Current research mainly focus on images, yet we think general AdaBoost-based
feature selection techniques are well worth studying.
8 kNN: k-nearest neighbor classification
8.1 Description of the algorithm
One of the simplest, and rather trivial classifiers is the Rote classifier, which memorizes the
entire training data and performs classification only if the attributes of the test object match
one of the training examples exactly. An obvious drawback of this approach is that many test
records will not be classified because they do not exactly match any of the training records. A
more sophisticated approach, k-nearest neighbor (kNN) classification [
23
,
75
], finds a group
of k objects in the training set that are closest to the test object, and bases the assignment of
a label on the predominance of a particular class in this neighborhood. There are three key
elements of this approach: a set of labeled objects, e.g., a set of stored records, a distance
or similarity metric to compute distance between objects, and the value of k, the number of
nearest neighbors. To classify an unlabeled object, the distance of this object to the labeled
objects is computed, its k-nearest neighbors are identified, and the class labels of these nearest
neighbors are then used to determine the class label of the object.
Figure
6
provides a high-level summary of the nearest-neighbor classification method.
Given a training set D and a test object x
= (x , y ), the algorithm computes the distance (or
similarity) between z and all the training objects
(x, y) ∈ D to determine its nearest-neighbor
list, D
z
. (x is the data of a training object, while y is its class. Likewise, x is the data of the
test object and y is its class.)
Once the nearest-neighbor list is obtained, the test object is classified based on the majority
class of its nearest neighbors:
Majority Voting: y
= argmax
v
(x
i
,y
i
)∈D
z
I
(v = y
i
),
(18)
where
v is a class label, y
i
is the class label for the i th nearest neighbors, and I
(·) is an
indicator function that returns the value 1 if its argument is true and 0 otherwise.
123