169
Real-Time Continuous Pose Recovery of Human Hands Using
Convolutional Networks
JONATHAN TOMPSON, MURPHY STEIN, YANN LECUN, and KEN PERLIN
New York University
We present a novel method for real-time continuous pose recovery of mark-
erless complex articulable objects from a single depth image. Our method
consists of the following stages: a randomized decision forest classifier for
image segmentation, a robust method for labeled dataset generation, a con-
volutional network for dense feature extraction, and finally an inverse kine-
matics stage for stable real-time pose recovery. As one possible application
of this pipeline, we show state-of-the-art results for real-time puppeteering
of a skinned hand-model.
Categories and Subject Descriptors: I.3.6 [Computer Graphics]: Methodol-
ogy and Techniques—Interaction Techniques; I.3.7 [Computer Graphics]:
Three-Dimensional Graphics and Realism—Animation
General Terms: Algorithms, Human Factors
Additional Key Words and Phrases: Hand tracking, neural networks,
markerless motion capture, analysis-by-synthesis
ACM Reference Format:
Jonathan Tompson, Murphy Stein, Yann LeCun, and Ken Perlin. 2014.
Real-time continuous pose recovery of human hands using convolutional
networks. ACM Trans. Graph. 33, 5. Article 169 (August 2014) 10 pages.
DOI: http://dx.doi.org/10.1145/2629500
1.
INTRODUCTION
Inferring the pose of articulable objects from depth video data is a
difficult problem in markerless motion capture. Requiring real-time
inference with low latency for real-time applications makes this
even harder. The difficulty arises because articulable objects typi-
cally have many degrees of freedom (DOF), constrained parameter
spaces, self-similar parts, and suffer self-occlusion. All these fac-
tors make fitting a model directly to the depth data hard, and even
undesirable in practice, unless the fitting process is able to account
for such missing data.
One common approach to “fill in” missing data is to combine
multiple simultaneous video streams, but this is a costly demand on
the end-user and may prohibit widespread use of otherwise good
solutions. A second common approach, called “supervised learning”
Authors’ addresses: J. Tompson (corresponding author), M. Stein, Y. LeCun,
K. Perlin, New York University, 70 Washington Square S, New York, NY
10012; email: tompson@cims.nyu.edu.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from Permissions@acm.org.
c 2014 ACM 0730-0301/2014/08-ART169 $15.00
DOI: http://dx.doi.org/10.1145/2629500
in computer vision and machine learning, is to train a model on
ground-truth data that combines the full pose of the object in the
frame with the depth image. The trained model can then use a priori
information from known poses to make informed guesses about the
likely pose in the current frame.
Large ground-truth datasets have been constructed for impor-
tant articulable objects such as human bodies, and robust real-time
pose inference systems have been trained on them using super-
vised learning. Unfortunately, most articulable objects, even com-
mon ones such as human hands, do not have publicly available
datasets, or these datasets do not adequately cover the vast range
of possible poses. Perhaps more importantly, it may be desirable
to infer the real-time continuous pose of objects that do not yet
have such datasets. The vast majority of objects seen in the world
fall into this category. A general method for dataset acquisition of
articulable objects is an important contribution of this work.
The main difficulty with using supervised learning for training
models to perform real-time pose inference of a human hand is in
obtaining ground-truth data for the hand pose. Typical models of
the human hand have 25–50 degrees of freedom [Erol et al. 2007]
and exclude important information such as joint angle constraints.
Since real hands exhibit joint angle constraints that are pose depen-
dent, faithfully expressing such limits is still difficult in practice.
Unfortunately, without such constraints, most models are capable
of poses that are anatomically incorrect. This means that sampling
the space of possible parameters using a real hand is more desirable
than exploring it with a model. With the advent of commodity depth
sensors, it is possible to economically capture continuous traversals
of this constrained low-dimensional parameter space in video and
then to robustly fit hand models to the data to recover the pose
parameters [Oikonomidis et al. 2011].
In this work, we present a solution to the difficult problem of
inferring the continuous pose of a human hand by first constructing
an accurate database of labeled ground-truth data in an automatic
process and then training a system capable of real-time inference.
Since the human hand represents a particularly difficult kind of
articulable object to track, we believe our solution is applicable to
a wide range of articulable objects. Our method has a small latency
equal to one frame of video, is robust to self-occlusion, requires
no special markers, and can handle objects with self-similar parts
such as fingers. To allow a broad range of applications, our method
works when the hand is smaller than 2% of the 640
×480 = 307kpx
image area.
Our method can be generalized to track any articulable object
that satisfies three requirements: (1) the object to be tracked can be
modeled as a 3D boned mesh, (2) a binary classifier can be made
to label those pixels in the image belonging to the object, and (3)
the projection from pose space (of the bones) to a projected 2D
image in depth is approximately one-to-one. The model is used to
automatically label depth video captured live from a user. This data
is used to train a Randomized Decision Forest (RDF) architecture for
image segmentation as well as a Convolutional Network (ConvNet)
to infer the position of key model features in real time. We also
ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.