Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks
•
169:3
Fig. 2. Decision forest data: learned labels closely match target.
work is that it shows that ConvNets can perform high-level reason-
ing from depth image features.
3.
BINARY CLASSIFICATION
For the task of hand-background depth image segmentation we
trained an RDF classifier to perform per-pixel binary segmentation
on a single image. The output of this stage is shown in Figure 2.
Decision forests are well suited for discrete classification of body
parts [Shotton et al. 2011]. Furthermore, since decision forest clas-
sification is trivially parallelizable, it is well suited to real-time
processing in multicore environments.
After Shotton et al., our RDF is designed to classify each pixel
in a depth image as belonging to a hand or background. Each tree
in the RDF consists of a set of sequential deterministic decisions,
called weak learners (or nodes), that compare the relative depth of
the current pixel to a nearby pixel located at a learned offset. The
particular sequence of decisions a pixel satisfies induces a tentative
classification into hand or background. Averaging the classification
from all trees in the forest induces a final probability distribution for
each pixel. As our implementation differs only slightly from that
of Shotton et al., we refer interested readers to their past work and
focus on the innovations particular to our implementation.
While full-body motion capture datasets are readily avail-
able [Allen et al. 2003], these datasets either lack articulation data
for hands or else do not adequately cover the wide variety of poses
that were planned for this system. Therefore, it was necessary to
create a custom database of full-body depth images with binary
hand labeling for RDF training (Figure 2). We had one user paint
his hands bright red with body paint and used a simple HSV-based
distance metric to estimate a coarse hand labeling on the RGB im-
age. The coarse labeling was then filtered using a median filter to
remove outliers. Since commodity RGB+Depth (RGBD) cameras
typically exhibit imperfect alignment between depth and RGB, we
used a combination of graph cut and depth-sensitive flood fill to
further clean up the depth image labeling [Boykov et al. 2001].
In order to train the RDF we randomly sample weak learners
from a family of functions, similar to Shotton et al. [2011]. At a
given pixel (u, v) on the depth image I , each node in the decision
tree evaluates,
I
u
+
u
I
(u, v)
, v
+
v
I
(u, v)
− I (u, v) ≥ d
t
,
(1)
where I (u, v) is the depth pixel value in image I ,
u
and
v
are
learned pixel offsets, and d
t
is a learned depth threshold. Experi-
mentally, we found that (1) requires a large dynamic range of pixel
offsets during training to achieve good classification performance.
Fig. 3. Linear-blend-skinning (LBS) model [ ˇSari´c 2011] with 42DOF.
We suspect that this is because a given decision path needs to use
both global and local geometry information to perform efficient
hand-background segmentation. Since training time is limited, we
define a discrete set of weak learners that use offset and threshold
values that are linear in log space and then we randomly sample
weak learners from this space during training.
4.
DATASET CREATION
The goal of this stage is to create a database of RGBD sensor images
representing a broad range of hand gestures with accurate ground-
truth estimates (i.e., labels) of joint parameters in each image that
may be used to train a ConvNet. The desired ground-truth label
consists of a 42-dimensional vector describing the full degree-of-
freedom pose for the hand in that frame. The DOF of each hand
joint is shown in Figure 3. After the hand has been segmented
from the background using the RDF-based binary classification
just described, we use a direct search method to deduce the pose
parameters based on the approach of Oikonomidis et al. [2011].
An important insight of our work is that we can capture the power
of their direct search method in an offline fashion and then use it
to train ConvNets (or similar algorithms that are better suited to
fast computation). One advantage of this decoupling is that, during
offline training, we are not penalized for using more complicated
models that are more expensive to render yet better explain the
incoming range data. A second advantage is that we can utilize
multiple sensors for training, thereby mitigating problems of self-
occlusion during real-time interaction with a single sensor.
The algorithm proposed by Oikonomidis et al. [2011] and adopted
with modifications for this work is as follows: starting with an
approximate hand pose, a synthetic depth image is rendered and
compared to the depth image using a scalar objective function. This
depth image is rendered in an OpenGL-based framework, where
the only render output is the distance from the camera origin and
we use a camera with the same properties (e.g., focal length) as
the PrimeSense
TM
IR sensor. In practice the hand pose is estimated
using the previous frame’s pose when fitting a sequence of recorded
frames. The pose is manually estimated using a simple UI for the first
frame in a sequence. This results in a single scalar value representing
the quality of the fit given the estimated pose coefficients. The
particle swarm optimization with partial randomization (PrPSO)
direct search method [Yasuda et al. 2010] is used to adjust the
pose coefficient values to find the best-fit pose that minimizes this
objective function value. An overview of this algorithm is shown in
Figure 4.
Since PSO convergence is slow once the swarm positions are
close to the final solution (which is exacerbated when partial
ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.