Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks

Yüklə 4,96 Mb.

Pdf görüntüsü

səhifə	1/6
tarix	08.08.2018
ölçüsü	4,96 Mb.
	#61490

1 2 3 4 5 6

Computer Graphics
ACM Reference Format

169

Real-Time Continuous Pose Recovery of Human Hands Using

Convolutional Networks

JONATHAN TOMPSON, MURPHY STEIN, YANN LECUN, and KEN PERLIN

New York University

We present a novel method for real-time continuous pose recovery of mark-

erless complex articulable objects from a single depth image. Our method

consists of the following stages: a randomized decision forest classiﬁer for

image segmentation, a robust method for labeled dataset generation, a con-

volutional network for dense feature extraction, and ﬁnally an inverse kine-

matics stage for stable real-time pose recovery. As one possible application

of this pipeline, we show state-of-the-art results for real-time puppeteering

of a skinned hand-model.

Categories and Subject Descriptors: I.3.6 [Computer Graphics]: Methodol-

ogy and Techniques—Interaction Techniques; I.3.7 [Computer Graphics]:

Three-Dimensional Graphics and Realism—Animation

General Terms: Algorithms, Human Factors

Additional Key Words and Phrases: Hand tracking, neural networks,

markerless motion capture, analysis-by-synthesis

ACM Reference Format:

Jonathan Tompson, Murphy Stein, Yann LeCun, and Ken Perlin. 2014.

Real-time continuous pose recovery of human hands using convolutional

networks. ACM Trans. Graph. 33, 5. Article 169 (August 2014) 10 pages.

DOI: http://dx.doi.org/10.1145/2629500

INTRODUCTION

Inferring the pose of articulable objects from depth video data is a

difﬁcult problem in markerless motion capture. Requiring real-time

inference with low latency for real-time applications makes this

even harder. The difﬁculty arises because articulable objects typi-

cally have many degrees of freedom (DOF), constrained parameter

spaces, self-similar parts, and suffer self-occlusion. All these fac-

tors make ﬁtting a model directly to the depth data hard, and even

undesirable in practice, unless the ﬁtting process is able to account

for such missing data.

One common approach to “ﬁll in” missing data is to combine

multiple simultaneous video streams, but this is a costly demand on

the end-user and may prohibit widespread use of otherwise good

solutions. A second common approach, called “supervised learning”

Authors’ addresses: J. Tompson (corresponding author), M. Stein, Y. LeCun,

K. Perlin, New York University, 70 Washington Square S, New York, NY

10012; email: tompson@cims.nyu.edu.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for proﬁt or commercial advantage and that copies bear

this notice and the full citation on the ﬁrst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with

credit is permitted. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior speciﬁc permission and/or a fee. Request

permissions from Permissions@acm.org.

c 2014 ACM 0730-0301/2014/08-ART169 $15.00

DOI: http://dx.doi.org/10.1145/2629500

in computer vision and machine learning, is to train a model on

ground-truth data that combines the full pose of the object in the

frame with the depth image. The trained model can then use a priori

information from known poses to make informed guesses about the

likely pose in the current frame.

Large ground-truth datasets have been constructed for impor-

tant articulable objects such as human bodies, and robust real-time

pose inference systems have been trained on them using super-

vised learning. Unfortunately, most articulable objects, even com-

mon ones such as human hands, do not have publicly available

datasets, or these datasets do not adequately cover the vast range

of possible poses. Perhaps more importantly, it may be desirable

to infer the real-time continuous pose of objects that do not yet

have such datasets. The vast majority of objects seen in the world

fall into this category. A general method for dataset acquisition of

articulable objects is an important contribution of this work.

The main difﬁculty with using supervised learning for training

models to perform real-time pose inference of a human hand is in

obtaining ground-truth data for the hand pose. Typical models of

the human hand have 25–50 degrees of freedom [Erol et al. 2007]

and exclude important information such as joint angle constraints.

Since real hands exhibit joint angle constraints that are pose depen-

dent, faithfully expressing such limits is still difﬁcult in practice.

Unfortunately, without such constraints, most models are capable

of poses that are anatomically incorrect. This means that sampling

the space of possible parameters using a real hand is more desirable

than exploring it with a model. With the advent of commodity depth

sensors, it is possible to economically capture continuous traversals

of this constrained low-dimensional parameter space in video and

then to robustly ﬁt hand models to the data to recover the pose

parameters [Oikonomidis et al. 2011].

In this work, we present a solution to the difﬁcult problem of

inferring the continuous pose of a human hand by ﬁrst constructing

an accurate database of labeled ground-truth data in an automatic

process and then training a system capable of real-time inference.

Since the human hand represents a particularly difﬁcult kind of

articulable object to track, we believe our solution is applicable to

a wide range of articulable objects. Our method has a small latency

equal to one frame of video, is robust to self-occlusion, requires

no special markers, and can handle objects with self-similar parts

such as ﬁngers. To allow a broad range of applications, our method

works when the hand is smaller than 2% of the 640

×480 = 307kpx

image area.

Our method can be generalized to track any articulable object

that satisﬁes three requirements: (1) the object to be tracked can be

modeled as a 3D boned mesh, (2) a binary classiﬁer can be made

to label those pixels in the image belonging to the object, and (3)

the projection from pose space (of the bones) to a projected 2D

image in depth is approximately one-to-one. The model is used to

automatically label depth video captured live from a user. This data

is used to train a Randomized Decision Forest (RDF) architecture for

image segmentation as well as a Convolutional Network (ConvNet)

to infer the position of key model features in real time. We also

ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.

Yüklə 4,96 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6