Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	67/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 63 64 65 66 67 68 69 70 ... 219

Figure 4.12
Figure 4.13

and select the smallest. This procedure is linear in the number of training

instances: in other words, the time it takes to make a single prediction is pro-

portional to the number of training instances. Processing an entire test set takes

time proportional to the product of the number of instances in the training and

test sets.

Nearest neighbors can be found more efﬁciently by representing the training

set as a tree, although it is not quite obvious how. One suitable structure is a

kD-tree. This is a binary tree that divides the input space with a hyperplane

and then splits each partition again, recursively. All splits are made parallel to

one of the axes, either vertically or horizontally, in the two-dimensional case.

The data structure is called a kD-tree because it stores a set of points in k-

dimensional space, k being the number of attributes.

Figure 4.12(a) gives a small example with k

= 2, and Figure 4.12(b) shows the

four training instances it represents, along with the hyperplanes that constitute

the tree. Note that these hyperplanes are not decision boundaries: decisions are

made on a nearest-neighbor basis as explained later. The ﬁrst split is horizon-

tal (h), through the point (7,4)—this is the tree’s root. The left branch is not

split further: it contains the single point (2,2), which is a leaf of the tree. The

right branch is split vertically (v) at the point (6,7). Its left child is empty, and

its right child contains the point (3,8). As this example illustrates, each region

contains just one point—or, perhaps, no points. Sibling branches of the tree—

for example, the two daughters of the root in Figure 4.12(a)—are not neces-

sarily developed to the same depth. Every point in the training set corresponds

to a single node, and up to half are leaf nodes.

1 3 0

C H A P T E R 4

A LG O R I T H M S : T H E BA S I C M E T H O D S

(2,2)

(7,4); h

(6,7); v

(3,8)

(a)

a

a

(2,2)

(7,4)

(6,7)

(3,8)

(b)

Figure 4.12 A kD-tree for four training instances: (a) the tree and (b) instances and

splits.

P088407-Ch004.qxd 4/30/05 11:13 AM Page 130

4 . 7

I N S TA N C E - BA S E D L E A R N I N G

1 3 1

How do you build a kD-tree from a dataset? Can it be updated efﬁciently as

new training examples are added? And how does it speed up nearest-neighbor

calculations? We tackle the last question ﬁrst.

To locate the nearest neighbor of a given target point, follow the tree down

from its root to locate the region containing the target. Figure 4.13 shows a space

like that of Figure 4.12(b) but with a few more instances and an extra bound-

ary. The target, which is not one of the instances in the tree, is marked by a star.

The leaf node of the region containing the target is colored black. This is not

necessarily the target’s closest neighbor, as this example illustrates, but it is a

good ﬁrst approximation. In particular, any nearer neighbor must lie closer—

within the dashed circle in Figure 4.13. To determine whether one exists, ﬁrst

check whether it is possible for a closer neighbor to lie within the node’s sibling.

The black node’s sibling is shaded in Figure 4.13, and the circle does not inter-

sect it, so the sibling cannot contain a closer neighbor. Then back up to the

parent node and check its sibling—which here covers everything above the hor-

izontal line. In this case it must be explored, because the area it covers intersects

with the best circle so far. To explore it, ﬁnd its daughters (the original point’s

two aunts), check whether they intersect the circle (the left one does not, but

the right one does), and descend to see whether it contains a closer point (it

does).

Figure 4.13 Using a kD-tree to ﬁnd the nearest neighbor of the star.

P088407-Ch004.qxd 4/30/05 11:13 AM Page 131

In a typical case, this algorithm is far faster than examining all points to ﬁnd

the nearest neighbor. The work involved in ﬁnding the initial approximate

nearest neighbor—the black point in Figure 4.13—depends on the depth of the

tree, given by the logarithm of the number of nodes, log

2

n. The amount of work

involved in backtracking to check whether this really is the nearest neighbor

depends a bit on the tree, and on how good the initial approximation is. But for

a well-constructed tree whose nodes are approximately square, rather than long

skinny rectangles, it can also be shown to be logarithmic in the number of nodes.

How do you build a good tree for a set of training examples? The problem

boils down to selecting the ﬁrst training instance to split at and the direction of

the split. Once you can do that, apply the same method recursively to each child

of the initial split to construct the entire tree.

To ﬁnd a good direction for the split, calculate the variance of the data points

along each axis individually, select the axis with the greatest variance, and create

a splitting hyperplane perpendicular to it. To ﬁnd a good place for the hyper-

plane, locate the median value along that axis and select the corresponding

point. This makes the split perpendicular to the direction of greatest spread,

with half the points lying on either side. This produces a well-balanced tree. To

avoid long skinny regions it is best for successive splits to be along different axes,

which is likely because the dimension of greatest variance is chosen at each stage.

However, if the distribution of points is badly skewed, choosing the median

value may generate several successive splits in the same direction, yielding long,

skinny hyperrectangles. A better strategy is to calculate the mean rather than the

median and use the point closest to that. The tree will not be perfectly balanced,

but its regions will tend to be squarish because there is a greater chance that dif-

ferent directions will be chosen for successive splits.

An advantage of instance-based learning over most other machine learning

methods is that new examples can be added to the training set at any time. To

retain this advantage when using a kD-tree, we need to be able to update it incre-

mentally with new data points. To do this, determine which leaf node contains

the new point and ﬁnd its hyperrectangle. If it is empty, simply place the new

point there. Otherwise split the hyperrectangle, splitting it along its longest

dimension to preserve squareness. This simple heuristic does not guarantee that

adding a series of points will preserve the tree’s balance, nor that the hyperrec-

tangles will be well shaped for nearest-neighbor search. It is a good idea to

rebuild the tree from scratch occasionally—for example, when its depth grows

to twice the best possible depth.

As we have seen, kD-trees are good data structures for ﬁnding nearest neigh-

bors efﬁciently. However, they are not perfect. Skewed datasets present a basic

conﬂict between the desire for the tree to be perfectly balanced and the desire

for regions to be squarish. More importantly, rectangles—even squares—are not

the best shape to use anyway, because of their corners. If the dashed circle in

1 3 2

C H A P T E R 4

A LG O R I T H M S : T H E BA S I C M E T H O D S

P088407-Ch004.qxd 4/30/05 11:13 AM Page 132

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 63 64 65 66 67 68 69 70 ... 219