Automatic and Efficient Human Pose Estimation for Sign Language Videos

doi:10.1007/S11263-013-0672-6

Journal ArticleDOI

Automatic and Efficient Human Pose Estimation for Sign Language Videos

James Charles, +3 more

- 01 Oct 2014 -

International Journal of Computer Vision

- Vol. 110, Iss: 1, pp 70-90

TLDR

A fully automatic arm and hand tracker that detects joint positions over continuous sign language video sequences of more than an hour in length that outperforms the state-of-the-art long term tracker by Buehler et al. and achieves superior joint localisation results to those obtained using the pose estimation method of Yang and Ramanan.

Abstract:

We present a fully automatic arm and hand tracker that detects joint positions over continuous sign language video sequences of more than an hour in length. To achieve this, we make contributions in four areas: (i) we show that the overlaid signer can be separated from the background TV broadcast using co-segmentation over all frames with a layered model; (ii) we show that joint positions (shoulders, elbows, wrists) can be predicted per-frame using a random forest regressor given only this segmentation and a colour model; (iii) we show that the random forest can be trained from an existing semi-automatic, but computationally expensive, tracker; and, (iv) introduce an evaluator to assess whether the predicted joint positions are correct for each frame. The method is applied to 20 signing footage videos with changing background, challenging imaging conditions, and for different signers. Our framework outperforms the state-of-the-art long term tracker by Buehler et al. (International Journal of Computer Vision 95:180---197, 2011), does not require the manual annotation of that work, and, after automatic initialisation, performs tracking in real-time. We also achieve superior joint localisation results to those obtained using the pose estimation method of Yang and Ramanan (Proceedings of the IEEE conference on computer vision and pattern recognition, 2011).

Automatic and Efficient Human Pose Estimation for Sign Language Videos

Citations

Flowing ConvNets for Human Pose Estimation in Videos

An affine invariant salient region detector

Neural Sign Language Translation

Monocular human pose estimation: A survey of deep learning-based methods

Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos

References

Random Forests

Histograms of oriented gradients for human detection

"GrabCut": interactive foreground extraction using iterated graph cuts

Real-time human pose recognition in parts from single depth images

A discriminatively trained, multiscale, deformable part model

Related Papers (5)

ImageNet Classification with Deep Convolutional Neural Networks

Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields

Very Deep Convolutional Networks for Large-Scale Image Recognition

Deep Residual Learning for Image Recognition

Convolutional Pose Machines