Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks

Yüklə 4,96 Mb.

Pdf görüntüsü

səhifə	2/6
tarix	08.08.2018
ölçüsü	4,96 Mb.
	#61490

1 2 3 4 5 6

169:2

•

J. Tompson et al.

Fig. 1. Pose recovery pipeline overview.

suggest a simple and robust inverse kinematics (IK) algorithm for

real-time, high-degree-of-freedom pose inference from the ConvNet

output. The system can accommodate multiple commodity depth

cameras for generating training data, but requires only a single

depth camera for real-time tracking. We believe the key technical

contribution of this work is the creation of a novel pipeline for fast

pose inference applicable to a wide variety of articulable objects.

An overview of this pipeline is shown in Figure 1.

As a single example, training our system on an open-source

linear-blend skinning model of a hand with 42 degrees of freedom

takes less than 10 minutes of human effort (18,000 frames at 30fps),

followed by two days of autonomous computation time. Tracking

and pose inference for a person’s hand can then be performed in

real time using a single depth camera. Throughout our experiments,

the camera is situated in front of the user at approximately eye-level

height. The trained system can be readily used to puppeteer related

objects such as alien hands, or real robot linkages, and as an input

to 3D user interfaces [Stein et al. 2012].

RELATED WORK

A large body of literature is devoted to real-time recovery of pose

for markerless articulable objects, such as human bodies, clothes,

and man-made objects. As the primary contribution of our work is

a fast pipeline for recovery of the pose of human hands in 3D, we

will limit our discussion to the most relevant prior work.

Many groups have created their own dataset of ground-truth labels

and images to enable real-time pose recovery of the human body.

For example, Wang et al. [2011] use the CyberGlove II Motion

Capture system to construct a dataset of labeled hand poses from

users that are rerendered as a colored glove with known texture. A

similar colored glove is worn by the user at runtime, while the pose

is inferred in real time by matching the imaged glove in RGB to their

database of templates [Wang and Popovi´c 2009]. In later work, the

CyberGlove data was repurposed for pose inference using template

matching on depth images, without a colored glove. Wang et al.

have recently commercialized their hand-tracking system (which

is now proprietary and managed by 3Gear Systems [3Gear 2014])

that now uses a PrimeSense

depth camera oriented above the

table to recognize a large range of possible poses. This work differs

from 3Gear’s in a number of ways: (1) we attempt to perform

continuous pose estimation rather than recognition by matching into

a static and discrete database and (2) we orient the camera facing

the user and so our system is optimized for a different set of hand

gestures.

Also relevant to our work is that of Shotton et al. [2011], who used

randomized decision forests to recover the pose of multiple bodies

from a single frame by learning a per-pixel classiﬁcation of the

depth image into 38 different body parts. Their training examples

were synthesized from combinations of known poses and body

shapes. In similar work, Keskin et al. [2011] created a randomized

decision forest classiﬁer specialized for human hands. Lacking a

dataset based on human motion capture, they synthesized a dataset

from known poses in American Sign Language and expanded the

dataset by interpolating between poses. Owing to their prescribed

goal of recognizing sign-language signs themselves, this approach

proved useful, but would not be feasible in our case as we require

unrestricted hand poses to be recovered. In a follow-up work Keskin

et al. [2012], those authors presented a novel shape classiﬁcation

forest architecture to perform per-pixel part classiﬁcation.

Several other groups have used domain knowledge and temporal

coherence to construct methods that do not require any dataset for

tracking the pose of complicated objects. For example, Weise et al.

[2009] devise a real-time facial animation system for range sensors

using salient points to deduce transformations on an underlying

face model by framing it as energy minimization. In related work,

Li et al. [2013] showed how to extend this technique to enable

adaptation to the user’s own facial expressions in an online fashion.

Melax et al. [2013] demonstrate a real-time system for tracking the

full pose of a human hand by ﬁtting convex polyhedra directly to

range data using an approach inspired by constraint-based physics

systems. Ballan et al. [2012] show how to ﬁt high-polygon hand

models to multiple camera views of a pair of hands interacting

with a small sphere, using a combination of feature-based tracking

and energy minimization. In contrast to our method, their approach

relies upon inter-frame correspondences to provide optical ﬂow and

good starting poses for energy minimization.

Early work by Rehg and Kanade [1994] demonstrated a model-

based tracking system that ﬁts a high-degree-of-freedom articulated

hand model to greyscale image data using hand-designed 2D fea-

tures. Zhao et al. [2012] use a combination of IR markers and

RGBD capture to infer ofﬂine (at one frame per second) the pose

of an articulated hand model. Similar to this work, Oikonomidis

et al. [2011] demonstrate the utility of Particle Swarm Optimization

(PSO) for tracking single and interacting hands by searching for

parameters of an explicit 3D model that reduce the reconstruction

error of a z-buffer rendered model compared to an incoming depth

image. Their work relies heavily on temporal coherence assump-

tions for efﬁcient inference of the PSO optimizer, since the radius

of convergence of their optimizer is ﬁnite. Unfortunately, temporal

coherence cannot be relied on for robust real-time tracking since

dropped frames and fast-moving objects typically break this tempo-

ral coherency assumption. In contrast to their work that used PSO

directly for interactive tracking on the GPU at 4–15fps, our work

shows that, with relaxed temporal coherence assumptions in an of-

ﬂine setting, PSO is an invaluable ofﬂine tool for generating labeled

data.

To our knowledge, there is no published prior work on using

ConvNets to recover continuous 3D pose of human hands from

depth data. However, several groups have shown ConvNets can re-

cover the pose of rigid and nonrigid 3D objects such as plastic toys,

faces, and even human bodies. For example, LeCun et al. [2004]

used ConvNets to deduce the 6DOF pose of 3D plastic toys by

ﬁnding a low-dimensional embedding that maps RGB images to a

six-dimensional space. Osadchy et al. [2005] use a similar formula-

tion to perform pose detection of faces via a nonlinear mapping to a

low-dimensional manifold. Taylor et al. [2011] use crowd-sourcing

to build a database of similar human poses from different sub-

jects and then use ConvNets to perform dimensionality reduction

on a low-dimensional manifold, where similarity between training

examples is preserved. Lastly, Jiu et al. [2013] use ConvNets to

perform per-pixel classiﬁcations of depth images (whose output is

similar to Shotton et al. [2011]) in order to infer human body pose,

but they do not evaluate the performance of their approach on hand

pose recognition.

Couprie et al. [2013] use ConvNets to perform image segmenta-

tion of indoor scenes using RGB-D data. The signiﬁcance of their

ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.

Yüklə 4,96 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6