Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks

Yüklə 4,96 Mb.

Pdf görüntüsü

səhifə	3/6
tarix	08.08.2018
ölçüsü	4,96 Mb.
	#61490

1 2 3 4 5 6

•

169:3

Fig. 2. Decision forest data: learned labels closely match target.

work is that it shows that ConvNets can perform high-level reason-

ing from depth image features.

BINARY CLASSIFICATION

For the task of hand-background depth image segmentation we

trained an RDF classiﬁer to perform per-pixel binary segmentation

on a single image. The output of this stage is shown in Figure 2.

Decision forests are well suited for discrete classiﬁcation of body

parts [Shotton et al. 2011]. Furthermore, since decision forest clas-

siﬁcation is trivially parallelizable, it is well suited to real-time

processing in multicore environments.

After Shotton et al., our RDF is designed to classify each pixel

in a depth image as belonging to a hand or background. Each tree

in the RDF consists of a set of sequential deterministic decisions,

called weak learners (or nodes), that compare the relative depth of

the current pixel to a nearby pixel located at a learned offset. The

particular sequence of decisions a pixel satisﬁes induces a tentative

classiﬁcation into hand or background. Averaging the classiﬁcation

from all trees in the forest induces a ﬁnal probability distribution for

each pixel. As our implementation differs only slightly from that

of Shotton et al., we refer interested readers to their past work and

focus on the innovations particular to our implementation.

While full-body motion capture datasets are readily avail-

able [Allen et al. 2003], these datasets either lack articulation data

for hands or else do not adequately cover the wide variety of poses

that were planned for this system. Therefore, it was necessary to

create a custom database of full-body depth images with binary

hand labeling for RDF training (Figure 2). We had one user paint

his hands bright red with body paint and used a simple HSV-based

distance metric to estimate a coarse hand labeling on the RGB im-

age. The coarse labeling was then ﬁltered using a median ﬁlter to

remove outliers. Since commodity RGB+Depth (RGBD) cameras

typically exhibit imperfect alignment between depth and RGB, we

used a combination of graph cut and depth-sensitive ﬂood ﬁll to

further clean up the depth image labeling [Boykov et al. 2001].

In order to train the RDF we randomly sample weak learners

from a family of functions, similar to Shotton et al. [2011]. At a

given pixel (u, v) on the depth image I , each node in the decision

tree evaluates,

(u, v)

, v

(u, v)

− I (u, v) ≥ d

,

(1)

where I (u, v) is the depth pixel value in image I ,

and

are

learned pixel offsets, and d

is a learned depth threshold. Experi-

mentally, we found that (1) requires a large dynamic range of pixel

offsets during training to achieve good classiﬁcation performance.

Fig. 3. Linear-blend-skinning (LBS) model [ ˇSari´c 2011] with 42DOF.

We suspect that this is because a given decision path needs to use

both global and local geometry information to perform efﬁcient

hand-background segmentation. Since training time is limited, we

deﬁne a discrete set of weak learners that use offset and threshold

values that are linear in log space and then we randomly sample

weak learners from this space during training.

DATASET CREATION

The goal of this stage is to create a database of RGBD sensor images

representing a broad range of hand gestures with accurate ground-

truth estimates (i.e., labels) of joint parameters in each image that

may be used to train a ConvNet. The desired ground-truth label

consists of a 42-dimensional vector describing the full degree-of-

freedom pose for the hand in that frame. The DOF of each hand

joint is shown in Figure 3. After the hand has been segmented

from the background using the RDF-based binary classiﬁcation

just described, we use a direct search method to deduce the pose

parameters based on the approach of Oikonomidis et al. [2011].

An important insight of our work is that we can capture the power

of their direct search method in an ofﬂine fashion and then use it

to train ConvNets (or similar algorithms that are better suited to

fast computation). One advantage of this decoupling is that, during

ofﬂine training, we are not penalized for using more complicated

models that are more expensive to render yet better explain the

incoming range data. A second advantage is that we can utilize

multiple sensors for training, thereby mitigating problems of self-

occlusion during real-time interaction with a single sensor.

The algorithm proposed by Oikonomidis et al. [2011] and adopted

with modiﬁcations for this work is as follows: starting with an

approximate hand pose, a synthetic depth image is rendered and

compared to the depth image using a scalar objective function. This

depth image is rendered in an OpenGL-based framework, where

the only render output is the distance from the camera origin and

we use a camera with the same properties (e.g., focal length) as

the PrimeSense

IR sensor. In practice the hand pose is estimated

using the previous frame’s pose when ﬁtting a sequence of recorded

frames. The pose is manually estimated using a simple UI for the ﬁrst

frame in a sequence. This results in a single scalar value representing

the quality of the ﬁt given the estimated pose coefﬁcients. The

particle swarm optimization with partial randomization (PrPSO)

direct search method [Yasuda et al. 2010] is used to adjust the

pose coefﬁcient values to ﬁnd the best-ﬁt pose that minimizes this

objective function value. An overview of this algorithm is shown in

Figure 4.

Since PSO convergence is slow once the swarm positions are

close to the ﬁnal solution (which is exacerbated when partial

ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.

Yüklə 4,96 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6