Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks
•
169:7
Fig. 10. RDF Error.
Fig. 11. Dataset creation: objective function data (with Libhand
model [ ˇSari´c 2011]).
approximately 12 hours. For each node in the tree, 10,000 weak
learners were sampled. The error ratio of the number of incorrect
pixel labels to total number of hand pixels in the dataset for varying
tree counts and tree heights is shown in Figure 10.
We found that four trees with a height of 25 was a good trade-off
of classification accuracy versus speed. The validation-set classifi-
cation error for four trees of depth 25 was 4.1%. Of the classification
errors, 76.3% were false positives and 23.7% were false negatives.
We found that, in practice, small clusters of false positive pixel
labels can be easily removed using median filtering and blob detec-
tion. The common classification failure cases occur when the hand
is occluded by another body part (causing false positives), or when
the elbow is much closer to the camera than the hand (causing false
positives on the elbow). We believe this inaccuracy results from the
training set not containing any frames with these poses. A more
comprehensive dataset containing examples of these poses should
improve performance in the future.
Since we do not have a ground-truth measure for the 42DOF
hand model fitting, quantitative evaluation of this stage is difficult.
Qualitatively, the fitting accuracy was visually consistent with the
underlying point cloud. An example of a fitted frame is shown in
Figure 11. Only a very small number of poses failed to fit correctly;
for these difficult poses, manual intervention was required.
One limitation of this system was that the frame rate of the
PrimeSense
TM
camera (30fps) was not sufficient to ensure enough
temporal coherence for correct convergence of the PSO optimizer.
To overcome this, we had each user move her hands slowly during
Fig. 12. Sample ConvNet test-set images.
Fig. 13. ConvNet Learning Curve.
training data capture. Using a workstation with an Nvidia GTX
580 GPU and 4-core Intel processor, fitting each frame required 3
to 6 seconds. The final database consisted of 76,712, training-set
images, 2,421 validation-set images, and 2,000 test-set images with
their corresponding heat-maps, collected from multiple participants.
A small sample of the test-set images is shown in Figure 12.
The ConvNet training took approximately 24 hours where
early stopping is performed after 350 epochs to prevent overfit-
ting. ConvNet hyperparameters, such as learning rate, momentum,
L2-regularization, and architectural parameters (e.g., max-pooling
window size or number of stages) were chosen by coarse meta-
optimization to minimize a validation-set error. Two stages of con-
volution (at each resolution level) and two fully connected neural
network stages were chosen as a trade-off between numerous perfor-
mance characteristics: generalization performance, evaluation time,
and model complexity (or ability to infer complex poses). Figure 13
shows the Mean Squared Error (MSE) after each epoch. The MSE
was calculated by taking the mean of sum-of-squared differences
between the calculated 14 feature maps and the corresponding target
feature maps.
The mean UV error of the ConvNet heat-map output on the test-
set data was 0.41px (with standard deviation of 0.35px) on the
18
× 18-resolution heat-map image
1
. After each heat-map feature
was translated to the 640
× 480-depth image, the mean UV error
was 5.8px (with standard deviation of 4.9px). Since the heat-map
downsampling ratio is depth dependent, the UV error improves as
the hand approaches the sensor. For applications that require greater
1
To calculate this error we used the technique described in Section 6 to
calculate the heat-map UV feature location and then calculated the error
distance between the target and ConvNet output locations.
ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.
169:8
•
J. Tompson et al.
Table I. Heat-Map UV Error by Feature Type
Feature Type
Mean (px)
STD (px)
Palm
0.33
0.30
Thumb Base & Knuckle
0.33
0.43
Thumb Tip
0.39
0.55
Finger Knuckle
0.38
0.27
Finger Tip
0.54
0.33
Fig. 14. Real-time tracking results, (a) typical hardware setup, (b) depth
with heat-map features, (c) ConvNet input and pose output.
accuracy, the heat-map resolution can be increased for better spatial
accuracy at the cost of increased latency and reduced throughput.
Table I shows the UV accuracy for each feature type. Unsurpris-
ingly, we found that the ConvNet architecture had the most difficulty
learning fingertip positions, where the mean error is 61% higher than
the accuracy of the palm features. The likely cause for this inaccu-
racy is twofold. First, the fingertip positions undergo a large range
of motion between various hand poses and therefore the ConvNet
must learn a more difficult mapping between local features and
fingertip positions. Second, the PrimeSense
TM
Carmine 1.09 depth
camera cannot always recover the depth of small surfaces such as
fingertips. The ConvNet is able to learn this noise behavior and is
actually able to approximate fingertip location in the presence of
missing data. However, the accuracy for these poses is low.
The computation time of the entire pipeline is 24.9ms, which is
within our 30fps performance target. Within this period: decision
forest evaluation takes 3.4ms, depth image preprocessing takes
4.7ms, ConvNet evaluation takes 5.6ms, and pose estimation takes
11.2ms. The entire pipeline introduces approximately one frame
of latency. For an example of the entire pipeline running in real
time as well as puppeteering of the LBS hand model, please refer
to the supplementary video (screenshots from this video are shown
in Figure 14).
Figure 15 shows three typical fail cases of our system. In
Figure 15(a) finite spatial precision of the ConvNet heat map results
in finger tip positions that are not quite touching. In Figure 15(b) no
similar pose exists in the database used to train the ConvNet, and
for this example the network generalization performance was poor.
In Figure 15(c) the PrimeSense
TM
depth camera fails to detect the
ring finger (the surface area of the fingertip presented to the camera
is too small and the angle of incidence in the camera plane is too
shallow), thus the ConvNet has difficulty inferring the fingertip po-
sition without adequate support in the depth image, resulting in an
incorrect inferred position.
Figure 16 shows that the ConvNet output is tolerant for hand
shapes and sizes that are not well represented in the ConvNet train-
ing set. The ConvNet and RDF training sets did not include any
images for user (b) and user (c) (only user (a)). We have only
evaluated the system’s performance on adult subjects. We found
Fig. 15. Fail cases: RGB ground truth (top row) inferred model [ ˇSari´c
2011] pose (bottom row).
Fig. 16. Hand shape/size tolerance: RGB ground truth (top row); depth
with annotated ConvNet output positions (bottom row).
that adding a single per-user scale parameter to approximately ad-
just the size of the LBS model to a user’s hand helped the real-time
IK stage to better fit the ConvNet output.
Comparison of the relative real-time performance of this work
with relevant prior art, such as that of 3Gear [2014] and Melax
et al. [2013], is difficult for a number of reasons. First, Melax et al.
[2013] use a different capture device that prevents fair comparison,
as it is impossible (without degrading sensor performance by using
mirrors) for multiple devices to simultaneously see the hand from
the same viewpoint. Second, no third-party ground-truth database
of poses with depth frames exists for human hands, so comparing
the quantitative accuracy of numerous methods against a known
baseline is not possible. More importantly, however, is that the
technique utilized by 3Gear [2014] is optimized for an entirely
different use case and so fair comparison with their work is very
difficult. 3Gear [2014] utilizes a vertically mounted camera, can
track multiple hands simultaneously, and is computationally less
expensive than the method presented in our work.
Figure 17 examines the performance of this work with the pro-
prietary system of 3Gear [2014] (using the fixed database version of
the library) for four poses chosen to highlight the relative difference
between the two techniques (images used with permission from
3Gear). We captured this data by streaming the output of both sys-
tems simultaneously (using the same RGBD camera). We mounted
the camera vertically as this is required for 3Gear [2014], however,
our training set did not include any poses from this orientation.
ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.
Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks
•
169:9
Fig. 17. Comparison with state-of-the-art commercial system: RGB
ground truth (top row); this work inferred model [ ˇSari´c 2011] pose (middle
row); 3Gear [2014] inferred model pose (bottom row) (images used with
permission from 3Gear).
Therefore, we expect our system to perform suboptimally for this
very different use case.
8.
FUTURE WORK
As indicated in Figure 16, qualitatively we have found that the
ConvNet generalization performance to varying hand shapes is ac-
ceptable but could be improved. We are confident we can make
improvements by adding more training data from users with differ-
ent hand sizes to the training set.
For this work, only the ConvNet forward-propagation stage was
implemented on the GPU. We are currently working on imple-
menting the entire pipeline on the GPU, which should significantly
improve the performance of the other pipeline stages. For example,
the GPU ConvNet implementation requires 5.6ms, while the same
network executed on the CPU (using optimized multithreaded C++
code) requires 139ms.
The current implementation of our system can track two hands
only if they are not interacting. While we have determined that
the dataset generation system can fit multiple strongly interacting
hand poses with sufficient accuracy, it is future work to evaluate the
neural network recognition performance on these poses. Likewise,
we hope to evaluate the recognition performance on hand poses
involving interactions with non-hand objects (such as pens and
other man-made devices).
While the pose recovery implementation presented in this work
is fast, we hope to augment this stage by including a model-based
fitting step that trades convergence radius for fit quality. Specifically,
we suspect that replacing our final IK stage with an energy-based
local optimization method, inspired by the work of Li et al. [2008],
could allow our method to recover second-order surface effects
such as skin folding and skin-muscle coupling from very limited
data and still maintain low latency. In addition to inference, such a
localized energy-minimizing stage would enable improvements to
the underlying model itself. Since these localized methods typically
require good registration, our method, which gives correspondence
from a single image, could advance the state-of-the-art in nonrigid
model capture.
Finally, we hope to augment our final IK stage with some form of
temporal pose prior to reduce jitter, for instance, using an extended
Kalman filter as a postprocessing step to clean up the ConvNet
feature output.
9.
CONCLUSION
We have presented a novel pipeline for tracking the instantaneous
pose of articulable objects from a single depth image. As an applica-
tion of this pipeline, we showed state-of-the-art results for tracking
human hands in real time using commodity hardware. This pipeline
leverages the accuracy of offline model-based dataset generation
routines in support of a robust real-time convolutional network ar-
chitecture for feature extraction. We showed that it is possible to
use intermediate heat-map features to extract accurate and reli-
able 3D pose information at interactive frame rates using inverse
kinematics.
REFERENCES
3GEAR. 2014. 3gear sytems hand-tracking development platform. http://
www.threegear.com/.
B. Allen, B. Curless, and Z. Popovic. 2003. The space of human body
shapes: Reconstruction and parameterization from range scans. ACM
Trans. Graph. 22, 3, 587–594.
L. Ballan, A. Taneja, J. Gall, L. Van Gool, and M. Pollefeys. 2012. Mo-
tion capture of hands in action using discriminative salient points. In
Proceedings of the 12
t h
European Conference on Computer Vision. 640–
653.
Y. Boykov, O. Veksler, and R. Zabih. 2001. Fast approximate energy mini-
mization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23, 11,
1222–1239.
D. A. Butler, S. Izadi, O. Hilliges, D. Molyneaux, S. Hodges, and D. Kim.
2012. Shake’n’sense: Reducing interference for overlapping structured
light depth cameras. In Proceedings of the ACM Annual Conference on
Human Factors in Computing Systems. 1933–1936.
R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A matlab-like
environment for machine learning. http://ronan.collobert.com/pub/matos/
2011 torch7 nipsw.pdf.
C. Couprie, C. Farabet, L. Najman, and Y. Lecun. 2013. Indoor semantic
segmentation using depth information. In Proceedings of the International
Conference on Learning Representations.
A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly. 2007. Vision-
based hand pose estimation: A review. Comput. Vis. Image Understand.
108, 1–2, 52–73.
C. Farabet, C. Couprie, L. Najman, and Y. Lecun. 2013. Learning hierarchi-
cal features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell.
35, 8, 1915–1929.
B. K. P. Horn. 1987. Closed-form solution of absolute orientation using unit
quaternions. J. Opt. Soc. Amer. 4, 4, 629–642.
K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. Lecun. 2009. What is the
best multi-stage architecture for object recognition? In Proceedings of the
12
t h
IEEE International Conference on Computer Vision. 2146–2153.
M. Jiu, C. Wolf, G. W. Taylor, and A. Baskurt. 2013. Human body part
estimation from depth images via spatially-constrained deep learning.
Pattern Recogn. Lett. (to appear).
C. Keskin, F. Kirac, Y. Kara, and L. Akarun. 2011. Real time hand pose
estimation using depth sensors. In Proceedings of the IEEE International
Computer Vision Workshops. 1228–1234.
ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.
169:10
•
J. Tompson et al.
C. Keskin, F. Kirac, Y. E. Kara, and L. Akarun. 2012. Hand pose estimation
and hand shape classification using multi-layered randomized decision
forests. In Proceedings of the 12
t h
European Conference on Computer
Vision. Vol. 6. Springer, 852–863.
A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. Imagenet classification
with deep convolutional neural networks. In Proceedings of the Neu-
ral Information Processing Systems Conference. P. Bartlett, F. Pereira,
C. Burges, L. Bottou, and K. Weinberger, Eds., 1106–1114.
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learn-
ing applied to document recognition. Proc. IEEE 86, 11, 2278–2324.
Y. Lecun, F. J. Huang, and L. Bottou. 2004. Learning methods for generic
object recognition with invariance to pose and lighting. In Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition. Vol. 2. 97–104.
H. Li, R. W. Sumner, and M. Pauly. 2008. Global correspondence optimiza-
tion for non-rigid registration of depth scans. Comput. Graph. Forum 27,
5, 1421–1430.
H. Li, J. Yu, Y. Ye, and C. Bregler. 2013. Realtime facial animation with
on-the-fly correctives. ACM Trans. Graph. 32, 4.
S. Melax, L. Keselman, and S. Orsten. 2013. Dynamics based 3D skeletal
hand tracking. In Proceedings of the ACM Symposium on Interactive 3D
Graphics and Games.
J. Nagi, F. Ducatelle, G. Di Caro, D. Ciresan, U. Meier, A. Giusti, F. Nagi,
J. Schmidhuber, and L. Gambardella. 2011. Max-pooling convolutional
neural networks for vision-based hand gesture recognition. In Proceedings
of the IEEE International Conference on Signal and Image Processing
Applications. 342–347.
S. J. Nowlan and J. C. Platt. 1995. A convolutional neural network hand
tracker. In Proceedings of the Neural Information Processing Systems
Conference. 901–908.
I. Oikonomidis, N. Kyriazis, and A. Argyros. 2011. Efficient model-based
3D tracking of hand articulations using kinect. In Proceedings of the
British Machine Vision Conference.
M. Osadchy, Y. Lecun, M. L. Miller, and P. Perona. 2005. Synergistic face
detection and pose estimation with energy-based model. In Proceedings
of the Neural Information Processing Systems Conference. 1017–1024.
J. M. Rehg and T. Kanade. 1994. Visual tracking of high dof articulated
structures: An application to human hand tracking. In Proceedings of the
3
rd
European Conference on Computer Vision. 35–46.
M. Saric. 2011. Libhand: A library for hand articulation. http://www.libhand.
org/.
J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,
A. Kipman, and A. Blake. 2011. Real-time human pose recognition in
parts from single depth images. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 1297–1304.
M. Stein, J. Tompson, X. Xiao, C. Hendeee, H. Ishii, and K. Perlin. 2012.
Arcade: A system for augmenting gesture-based computer graphic pre-
sentations. In Proceedings of the ACM SIGGRAPH Computer Animation
Festival. 77–77.
G. W. Taylor, I. Spiro, C. Bregler, and R. Fergus. 2011. Learning in-
variance through imitation. In Proceedings of the IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition. 2729–
2736.
P. Tseng. 1995. Fortified-descent simplicial search method: A general ap-
proach. SIAM J. Optim. 10, 1, 269–288.
R. Wang, S. Paris, and J. Popovic. 2011. 6d hands: Markerless tracking for
computer aided design. In Proceedings of the 24
t h
Annual ACM Sympo-
sium on User Interface Software and Technology. 549–558.
R. Y. Wang. and J. Popovic. 2009. Real-time hand-tracking with a color
glove. ACM Trans. Graph. 28, 3.
T. Weise, H. Li, L. Van Gool, and M. Pauly. 2009. Face/off: Live facial pup-
petry. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium
on Computer Animation.
T. Yasuda, K. Ohkura, and Y. Matsumura. 2010. Extended PSO with partial
randomization for large scale multimodal problems. In Proceedings of the
World Automation Congress. 1–6.
W. Zhao, J. Chai, and Y.-Q. Xu. 2012. Combining marker-based MOCAP
and RGB-d camera for acquiring high-fidelity hand motion data. In Pro-
ceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer
Animation. Eurographics Association, 33–42.
Received August 2013; revised January 2014; accepted March 2014
ACM Transactions on Graphics, Vol. 33, No. 5, Article 169, Publication date: August 2014.
Dostları ilə paylaş: |