Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks

Yüklə 4,96 Mb.

Pdf görüntüsü

səhifə	6/6
tarix	08.08.2018
ölçüsü	4,96 Mb.
	#61490

1 2 3 4 5 6

•

169:7

Fig. 10. RDF Error.

Fig. 11. Dataset creation: objective function data (with Libhand

model [ ˇSari´c 2011]).

approximately 12 hours. For each node in the tree, 10,000 weak

learners were sampled. The error ratio of the number of incorrect

pixel labels to total number of hand pixels in the dataset for varying

tree counts and tree heights is shown in Figure 10.

We found that four trees with a height of 25 was a good trade-off

of classiﬁcation accuracy versus speed. The validation-set classiﬁ-

cation error for four trees of depth 25 was 4.1%. Of the classiﬁcation

errors, 76.3% were false positives and 23.7% were false negatives.

We found that, in practice, small clusters of false positive pixel

labels can be easily removed using median ﬁltering and blob detec-

tion. The common classiﬁcation failure cases occur when the hand

is occluded by another body part (causing false positives), or when

the elbow is much closer to the camera than the hand (causing false

positives on the elbow). We believe this inaccuracy results from the

training set not containing any frames with these poses. A more

comprehensive dataset containing examples of these poses should

improve performance in the future.

Since we do not have a ground-truth measure for the 42DOF

hand model ﬁtting, quantitative evaluation of this stage is difﬁcult.

Qualitatively, the ﬁtting accuracy was visually consistent with the

underlying point cloud. An example of a ﬁtted frame is shown in

Figure 11. Only a very small number of poses failed to ﬁt correctly;

for these difﬁcult poses, manual intervention was required.

One limitation of this system was that the frame rate of the

PrimeSense

camera (30fps) was not sufﬁcient to ensure enough

temporal coherence for correct convergence of the PSO optimizer.

To overcome this, we had each user move her hands slowly during

Fig. 12. Sample ConvNet test-set images.

Fig. 13. ConvNet Learning Curve.

training data capture. Using a workstation with an Nvidia GTX

580 GPU and 4-core Intel processor, ﬁtting each frame required 3

to 6 seconds. The ﬁnal database consisted of 76,712, training-set

images, 2,421 validation-set images, and 2,000 test-set images with

their corresponding heat-maps, collected from multiple participants.

A small sample of the test-set images is shown in Figure 12.

The ConvNet training took approximately 24 hours where

early stopping is performed after 350 epochs to prevent overﬁt-

ting. ConvNet hyperparameters, such as learning rate, momentum,

L2-regularization, and architectural parameters (e.g., max-pooling

window size or number of stages) were chosen by coarse meta-

optimization to minimize a validation-set error. Two stages of con-

volution (at each resolution level) and two fully connected neural

network stages were chosen as a trade-off between numerous perfor-

mance characteristics: generalization performance, evaluation time,

and model complexity (or ability to infer complex poses). Figure 13

shows the Mean Squared Error (MSE) after each epoch. The MSE

was calculated by taking the mean of sum-of-squared differences

between the calculated 14 feature maps and the corresponding target

feature maps.

The mean UV error of the ConvNet heat-map output on the test-

set data was 0.41px (with standard deviation of 0.35px) on the

× 18-resolution heat-map image

. After each heat-map feature

was translated to the 640

× 480-depth image, the mean UV error

was 5.8px (with standard deviation of 4.9px). Since the heat-map

downsampling ratio is depth dependent, the UV error improves as

the hand approaches the sensor. For applications that require greater

To calculate this error we used the technique described in Section 6 to

calculate the heat-map UV feature location and then calculated the error

distance between the target and ConvNet output locations.

ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.

169:8

•

J. Tompson et al.

Table I. Heat-Map UV Error by Feature Type

Feature Type

Mean (px)

STD (px)

Palm

0.33

0.30

Thumb Base & Knuckle

0.33

0.43

Thumb Tip

0.39

0.55

Finger Knuckle

0.38

0.27

Finger Tip

0.54

0.33

Fig. 14. Real-time tracking results, (a) typical hardware setup, (b) depth

with heat-map features, (c) ConvNet input and pose output.

accuracy, the heat-map resolution can be increased for better spatial

accuracy at the cost of increased latency and reduced throughput.

Table I shows the UV accuracy for each feature type. Unsurpris-

ingly, we found that the ConvNet architecture had the most difﬁculty

learning ﬁngertip positions, where the mean error is 61% higher than

the accuracy of the palm features. The likely cause for this inaccu-

racy is twofold. First, the ﬁngertip positions undergo a large range

of motion between various hand poses and therefore the ConvNet

must learn a more difﬁcult mapping between local features and

ﬁngertip positions. Second, the PrimeSense

Carmine 1.09 depth

camera cannot always recover the depth of small surfaces such as

ﬁngertips. The ConvNet is able to learn this noise behavior and is

actually able to approximate ﬁngertip location in the presence of

missing data. However, the accuracy for these poses is low.

The computation time of the entire pipeline is 24.9ms, which is

within our 30fps performance target. Within this period: decision

forest evaluation takes 3.4ms, depth image preprocessing takes

4.7ms, ConvNet evaluation takes 5.6ms, and pose estimation takes

11.2ms. The entire pipeline introduces approximately one frame

of latency. For an example of the entire pipeline running in real

time as well as puppeteering of the LBS hand model, please refer

to the supplementary video (screenshots from this video are shown

in Figure 14).

Figure 15 shows three typical fail cases of our system. In

Figure 15(a) ﬁnite spatial precision of the ConvNet heat map results

in ﬁnger tip positions that are not quite touching. In Figure 15(b) no

similar pose exists in the database used to train the ConvNet, and

for this example the network generalization performance was poor.

In Figure 15(c) the PrimeSense

depth camera fails to detect the

ring ﬁnger (the surface area of the ﬁngertip presented to the camera

is too small and the angle of incidence in the camera plane is too

shallow), thus the ConvNet has difﬁculty inferring the ﬁngertip po-

sition without adequate support in the depth image, resulting in an

incorrect inferred position.

Figure 16 shows that the ConvNet output is tolerant for hand

shapes and sizes that are not well represented in the ConvNet train-

ing set. The ConvNet and RDF training sets did not include any

images for user (b) and user (c) (only user (a)). We have only

evaluated the system’s performance on adult subjects. We found

Fig. 15. Fail cases: RGB ground truth (top row) inferred model [ ˇSari´c

2011] pose (bottom row).

Fig. 16. Hand shape/size tolerance: RGB ground truth (top row); depth

with annotated ConvNet output positions (bottom row).

that adding a single per-user scale parameter to approximately ad-

just the size of the LBS model to a user’s hand helped the real-time

IK stage to better ﬁt the ConvNet output.

Comparison of the relative real-time performance of this work

with relevant prior art, such as that of 3Gear [2014] and Melax

et al. [2013], is difﬁcult for a number of reasons. First, Melax et al.

[2013] use a different capture device that prevents fair comparison,

as it is impossible (without degrading sensor performance by using

mirrors) for multiple devices to simultaneously see the hand from

the same viewpoint. Second, no third-party ground-truth database

of poses with depth frames exists for human hands, so comparing

the quantitative accuracy of numerous methods against a known

baseline is not possible. More importantly, however, is that the

technique utilized by 3Gear [2014] is optimized for an entirely

different use case and so fair comparison with their work is very

difﬁcult. 3Gear [2014] utilizes a vertically mounted camera, can

track multiple hands simultaneously, and is computationally less

expensive than the method presented in our work.

Figure 17 examines the performance of this work with the pro-

prietary system of 3Gear [2014] (using the ﬁxed database version of

the library) for four poses chosen to highlight the relative difference

between the two techniques (images used with permission from

3Gear). We captured this data by streaming the output of both sys-

tems simultaneously (using the same RGBD camera). We mounted

the camera vertically as this is required for 3Gear [2014], however,

our training set did not include any poses from this orientation.

ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.

Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks

•

169:9

Fig. 17. Comparison with state-of-the-art commercial system: RGB

ground truth (top row); this work inferred model [ ˇSari´c 2011] pose (middle

row); 3Gear [2014] inferred model pose (bottom row) (images used with

permission from 3Gear).

Therefore, we expect our system to perform suboptimally for this

very different use case.

FUTURE WORK

As indicated in Figure 16, qualitatively we have found that the

ConvNet generalization performance to varying hand shapes is ac-

ceptable but could be improved. We are conﬁdent we can make

improvements by adding more training data from users with differ-

ent hand sizes to the training set.

For this work, only the ConvNet forward-propagation stage was

implemented on the GPU. We are currently working on imple-

menting the entire pipeline on the GPU, which should signiﬁcantly

improve the performance of the other pipeline stages. For example,

the GPU ConvNet implementation requires 5.6ms, while the same

network executed on the CPU (using optimized multithreaded C++

code) requires 139ms.

The current implementation of our system can track two hands

only if they are not interacting. While we have determined that

the dataset generation system can ﬁt multiple strongly interacting

hand poses with sufﬁcient accuracy, it is future work to evaluate the

neural network recognition performance on these poses. Likewise,

we hope to evaluate the recognition performance on hand poses

involving interactions with non-hand objects (such as pens and

other man-made devices).

While the pose recovery implementation presented in this work

is fast, we hope to augment this stage by including a model-based

ﬁtting step that trades convergence radius for ﬁt quality. Speciﬁcally,

we suspect that replacing our ﬁnal IK stage with an energy-based

local optimization method, inspired by the work of Li et al. [2008],

could allow our method to recover second-order surface effects

such as skin folding and skin-muscle coupling from very limited

data and still maintain low latency. In addition to inference, such a

localized energy-minimizing stage would enable improvements to

the underlying model itself. Since these localized methods typically

require good registration, our method, which gives correspondence

from a single image, could advance the state-of-the-art in nonrigid

model capture.

Finally, we hope to augment our ﬁnal IK stage with some form of

temporal pose prior to reduce jitter, for instance, using an extended

Kalman ﬁlter as a postprocessing step to clean up the ConvNet

feature output.

CONCLUSION

We have presented a novel pipeline for tracking the instantaneous

pose of articulable objects from a single depth image. As an applica-

tion of this pipeline, we showed state-of-the-art results for tracking

human hands in real time using commodity hardware. This pipeline

leverages the accuracy of ofﬂine model-based dataset generation

routines in support of a robust real-time convolutional network ar-

chitecture for feature extraction. We showed that it is possible to

use intermediate heat-map features to extract accurate and reli-

able 3D pose information at interactive frame rates using inverse

kinematics.

REFERENCES

3GEAR. 2014. 3gear sytems hand-tracking development platform. http://

www.threegear.com/.

B. Allen, B. Curless, and Z. Popovic. 2003. The space of human body

shapes: Reconstruction and parameterization from range scans. ACM

Trans. Graph. 22, 3, 587–594.

L. Ballan, A. Taneja, J. Gall, L. Van Gool, and M. Pollefeys. 2012. Mo-

tion capture of hands in action using discriminative salient points. In

Proceedings of the 12

t h

European Conference on Computer Vision. 640–

653.

Y. Boykov, O. Veksler, and R. Zabih. 2001. Fast approximate energy mini-

mization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23, 11,

1222–1239.

D. A. Butler, S. Izadi, O. Hilliges, D. Molyneaux, S. Hodges, and D. Kim.

2012. Shake’n’sense: Reducing interference for overlapping structured

light depth cameras. In Proceedings of the ACM Annual Conference on

Human Factors in Computing Systems. 1933–1936.

R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A matlab-like

environment for machine learning. http://ronan.collobert.com/pub/matos/

2011 torch7 nipsw.pdf.

C. Couprie, C. Farabet, L. Najman, and Y. Lecun. 2013. Indoor semantic

segmentation using depth information. In Proceedings of the International

Conference on Learning Representations.

A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly. 2007. Vision-

based hand pose estimation: A review. Comput. Vis. Image Understand.

108, 1–2, 52–73.

C. Farabet, C. Couprie, L. Najman, and Y. Lecun. 2013. Learning hierarchi-

cal features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell.

35, 8, 1915–1929.

B. K. P. Horn. 1987. Closed-form solution of absolute orientation using unit

quaternions. J. Opt. Soc. Amer. 4, 4, 629–642.

K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. Lecun. 2009. What is the

best multi-stage architecture for object recognition? In Proceedings of the

12

t h

IEEE International Conference on Computer Vision. 2146–2153.

M. Jiu, C. Wolf, G. W. Taylor, and A. Baskurt. 2013. Human body part

estimation from depth images via spatially-constrained deep learning.

Pattern Recogn. Lett. (to appear).

C. Keskin, F. Kirac, Y. Kara, and L. Akarun. 2011. Real time hand pose

estimation using depth sensors. In Proceedings of the IEEE International

Computer Vision Workshops. 1228–1234.

ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.

169:10

•

J. Tompson et al.

C. Keskin, F. Kirac, Y. E. Kara, and L. Akarun. 2012. Hand pose estimation

and hand shape classiﬁcation using multi-layered randomized decision

forests. In Proceedings of the 12

t h

European Conference on Computer

Vision. Vol. 6. Springer, 852–863.

A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. Imagenet classiﬁcation

with deep convolutional neural networks. In Proceedings of the Neu-

ral Information Processing Systems Conference. P. Bartlett, F. Pereira,

C. Burges, L. Bottou, and K. Weinberger, Eds., 1106–1114.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learn-

ing applied to document recognition. Proc. IEEE 86, 11, 2278–2324.

Y. Lecun, F. J. Huang, and L. Bottou. 2004. Learning methods for generic

object recognition with invariance to pose and lighting. In Proceedings of

the IEEE Computer Society Conference on Computer Vision and Pattern

Recognition. Vol. 2. 97–104.

H. Li, R. W. Sumner, and M. Pauly. 2008. Global correspondence optimiza-

tion for non-rigid registration of depth scans. Comput. Graph. Forum 27,

5, 1421–1430.

H. Li, J. Yu, Y. Ye, and C. Bregler. 2013. Realtime facial animation with

on-the-ﬂy correctives. ACM Trans. Graph. 32, 4.

S. Melax, L. Keselman, and S. Orsten. 2013. Dynamics based 3D skeletal

hand tracking. In Proceedings of the ACM Symposium on Interactive 3D

Graphics and Games.

J. Nagi, F. Ducatelle, G. Di Caro, D. Ciresan, U. Meier, A. Giusti, F. Nagi,

J. Schmidhuber, and L. Gambardella. 2011. Max-pooling convolutional

neural networks for vision-based hand gesture recognition. In Proceedings

of the IEEE International Conference on Signal and Image Processing

Applications. 342–347.

S. J. Nowlan and J. C. Platt. 1995. A convolutional neural network hand

tracker. In Proceedings of the Neural Information Processing Systems

Conference. 901–908.

I. Oikonomidis, N. Kyriazis, and A. Argyros. 2011. Efﬁcient model-based

3D tracking of hand articulations using kinect. In Proceedings of the

British Machine Vision Conference.

M. Osadchy, Y. Lecun, M. L. Miller, and P. Perona. 2005. Synergistic face

detection and pose estimation with energy-based model. In Proceedings

of the Neural Information Processing Systems Conference. 1017–1024.

J. M. Rehg and T. Kanade. 1994. Visual tracking of high dof articulated

structures: An application to human hand tracking. In Proceedings of the

3

rd

European Conference on Computer Vision. 35–46.

M. Saric. 2011. Libhand: A library for hand articulation. http://www.libhand.

org/.

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,

A. Kipman, and A. Blake. 2011. Real-time human pose recognition in

parts from single depth images. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition. 1297–1304.

M. Stein, J. Tompson, X. Xiao, C. Hendeee, H. Ishii, and K. Perlin. 2012.

Arcade: A system for augmenting gesture-based computer graphic pre-

sentations. In Proceedings of the ACM SIGGRAPH Computer Animation

Festival. 77–77.

G. W. Taylor, I. Spiro, C. Bregler, and R. Fergus. 2011. Learning in-

variance through imitation. In Proceedings of the IEEE Computer So-

ciety Conference on Computer Vision and Pattern Recognition. 2729–

2736.

P. Tseng. 1995. Fortiﬁed-descent simplicial search method: A general ap-

proach. SIAM J. Optim. 10, 1, 269–288.

R. Wang, S. Paris, and J. Popovic. 2011. 6d hands: Markerless tracking for

computer aided design. In Proceedings of the 24

t h

Annual ACM Sympo-

sium on User Interface Software and Technology. 549–558.

R. Y. Wang. and J. Popovic. 2009. Real-time hand-tracking with a color

glove. ACM Trans. Graph. 28, 3.

T. Weise, H. Li, L. Van Gool, and M. Pauly. 2009. Face/off: Live facial pup-

petry. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium

on Computer Animation.

T. Yasuda, K. Ohkura, and Y. Matsumura. 2010. Extended PSO with partial

randomization for large scale multimodal problems. In Proceedings of the

World Automation Congress. 1–6.

W. Zhao, J. Chai, and Y.-Q. Xu. 2012. Combining marker-based MOCAP

and RGB-d camera for acquiring high-ﬁdelity hand motion data. In Pro-

ceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer

Animation. Eurographics Association, 33–42.

Received August 2013; revised January 2014; accepted March 2014

ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.

Yüklə 4,96 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6