169:2
•
J. Tompson et al.
Fig. 1. Pose recovery pipeline overview.
suggest a simple and robust inverse kinematics (IK) algorithm for
real-time, high-degree-of-freedom pose inference from the ConvNet
output. The system can accommodate multiple commodity depth
cameras for generating training data, but requires only a single
depth camera for real-time tracking. We believe the key technical
contribution of this work is the creation of a novel pipeline for fast
pose inference applicable to a wide variety of articulable objects.
An overview of this pipeline is shown in Figure 1.
As a single example, training our system on an open-source
linear-blend skinning model of a hand with 42 degrees of freedom
takes less than 10 minutes of human effort (18,000 frames at 30fps),
followed by two days of autonomous computation time. Tracking
and pose inference for a person’s hand can then be performed in
real time using a single depth camera. Throughout our experiments,
the camera is situated in front of the user at approximately eye-level
height. The trained system can be readily used to puppeteer related
objects such as alien hands, or real robot linkages, and as an input
to 3D user interfaces [Stein et al. 2012].
2.
RELATED WORK
A large body of literature is devoted to real-time recovery of pose
for markerless articulable objects, such as human bodies, clothes,
and man-made objects. As the primary contribution of our work is
a fast pipeline for recovery of the pose of human hands in 3D, we
will limit our discussion to the most relevant prior work.
Many groups have created their own dataset of ground-truth labels
and images to enable real-time pose recovery of the human body.
For example, Wang et al. [2011] use the CyberGlove II Motion
Capture system to construct a dataset of labeled hand poses from
users that are rerendered as a colored glove with known texture. A
similar colored glove is worn by the user at runtime, while the pose
is inferred in real time by matching the imaged glove in RGB to their
database of templates [Wang and Popovi´c 2009]. In later work, the
CyberGlove data was repurposed for pose inference using template
matching on depth images, without a colored glove. Wang et al.
have recently commercialized their hand-tracking system (which
is now proprietary and managed by 3Gear Systems [3Gear 2014])
that now uses a PrimeSense
TM
depth camera oriented above the
table to recognize a large range of possible poses. This work differs
from 3Gear’s in a number of ways: (1) we attempt to perform
continuous pose estimation rather than recognition by matching into
a static and discrete database and (2) we orient the camera facing
the user and so our system is optimized for a different set of hand
gestures.
Also relevant to our work is that of Shotton et al. [2011], who used
randomized decision forests to recover the pose of multiple bodies
from a single frame by learning a per-pixel classification of the
depth image into 38 different body parts. Their training examples
were synthesized from combinations of known poses and body
shapes. In similar work, Keskin et al. [2011] created a randomized
decision forest classifier specialized for human hands. Lacking a
dataset based on human motion capture, they synthesized a dataset
from known poses in American Sign Language and expanded the
dataset by interpolating between poses. Owing to their prescribed
goal of recognizing sign-language signs themselves, this approach
proved useful, but would not be feasible in our case as we require
unrestricted hand poses to be recovered. In a follow-up work Keskin
et al. [2012], those authors presented a novel shape classification
forest architecture to perform per-pixel part classification.
Several other groups have used domain knowledge and temporal
coherence to construct methods that do not require any dataset for
tracking the pose of complicated objects. For example, Weise et al.
[2009] devise a real-time facial animation system for range sensors
using salient points to deduce transformations on an underlying
face model by framing it as energy minimization. In related work,
Li et al. [2013] showed how to extend this technique to enable
adaptation to the user’s own facial expressions in an online fashion.
Melax et al. [2013] demonstrate a real-time system for tracking the
full pose of a human hand by fitting convex polyhedra directly to
range data using an approach inspired by constraint-based physics
systems. Ballan et al. [2012] show how to fit high-polygon hand
models to multiple camera views of a pair of hands interacting
with a small sphere, using a combination of feature-based tracking
and energy minimization. In contrast to our method, their approach
relies upon inter-frame correspondences to provide optical flow and
good starting poses for energy minimization.
Early work by Rehg and Kanade [1994] demonstrated a model-
based tracking system that fits a high-degree-of-freedom articulated
hand model to greyscale image data using hand-designed 2D fea-
tures. Zhao et al. [2012] use a combination of IR markers and
RGBD capture to infer offline (at one frame per second) the pose
of an articulated hand model. Similar to this work, Oikonomidis
et al. [2011] demonstrate the utility of Particle Swarm Optimization
(PSO) for tracking single and interacting hands by searching for
parameters of an explicit 3D model that reduce the reconstruction
error of a z-buffer rendered model compared to an incoming depth
image. Their work relies heavily on temporal coherence assump-
tions for efficient inference of the PSO optimizer, since the radius
of convergence of their optimizer is finite. Unfortunately, temporal
coherence cannot be relied on for robust real-time tracking since
dropped frames and fast-moving objects typically break this tempo-
ral coherency assumption. In contrast to their work that used PSO
directly for interactive tracking on the GPU at 4–15fps, our work
shows that, with relaxed temporal coherence assumptions in an of-
fline setting, PSO is an invaluable offline tool for generating labeled
data.
To our knowledge, there is no published prior work on using
ConvNets to recover continuous 3D pose of human hands from
depth data. However, several groups have shown ConvNets can re-
cover the pose of rigid and nonrigid 3D objects such as plastic toys,
faces, and even human bodies. For example, LeCun et al. [2004]
used ConvNets to deduce the 6DOF pose of 3D plastic toys by
finding a low-dimensional embedding that maps RGB images to a
six-dimensional space. Osadchy et al. [2005] use a similar formula-
tion to perform pose detection of faces via a nonlinear mapping to a
low-dimensional manifold. Taylor et al. [2011] use crowd-sourcing
to build a database of similar human poses from different sub-
jects and then use ConvNets to perform dimensionality reduction
on a low-dimensional manifold, where similarity between training
examples is preserved. Lastly, Jiu et al. [2013] use ConvNets to
perform per-pixel classifications of depth images (whose output is
similar to Shotton et al. [2011]) in order to infer human body pose,
but they do not evaluate the performance of their approach on hand
pose recognition.
Couprie et al. [2013] use ConvNets to perform image segmenta-
tion of indoor scenes using RGB-D data. The significance of their
ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.