scispace - formally typeset
Journal ArticleDOI

Automatic and Efficient Human Pose Estimation for Sign Language Videos

TLDR
A fully automatic arm and hand tracker that detects joint positions over continuous sign language video sequences of more than an hour in length that outperforms the state-of-the-art long term tracker by Buehler et al. and achieves superior joint localisation results to those obtained using the pose estimation method of Yang and Ramanan.
Abstract
We present a fully automatic arm and hand tracker that detects joint positions over continuous sign language video sequences of more than an hour in length. To achieve this, we make contributions in four areas: (i) we show that the overlaid signer can be separated from the background TV broadcast using co-segmentation over all frames with a layered model; (ii) we show that joint positions (shoulders, elbows, wrists) can be predicted per-frame using a random forest regressor given only this segmentation and a colour model; (iii) we show that the random forest can be trained from an existing semi-automatic, but computationally expensive, tracker; and, (iv) introduce an evaluator to assess whether the predicted joint positions are correct for each frame. The method is applied to 20 signing footage videos with changing background, challenging imaging conditions, and for different signers. Our framework outperforms the state-of-the-art long term tracker by Buehler et al. (International Journal of Computer Vision 95:180---197, 2011), does not require the manual annotation of that work, and, after automatic initialisation, performs tracking in real-time. We also achieve superior joint localisation results to those obtained using the pose estimation method of Yang and Ramanan (Proceedings of the IEEE conference on computer vision and pattern recognition, 2011).

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Flowing ConvNets for Human Pose Estimation in Videos

TL;DR: This work proposes a ConvNet architecture that is able to benefit from temporal context by combining information across the multiple frames using optical flow and outperforms a number of others, including one that uses optical flow solely at the input layers, one that regresses joint coordinates directly, and one that predicts heatmaps without spatial fusion.

An affine invariant salient region detector

TL;DR: In this article, a novel technique for detecting salient regions in an image is described, which is a generalization to affine invariance of the method introduced by Kadir and Brady.
Proceedings ArticleDOI

Neural Sign Language Translation

TL;DR: This work formalizes SLT in the framework of Neural Machine Translation (NMT) for both end-to-end and pretrained settings (using expert knowledge) and allows to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language.
Journal ArticleDOI

Monocular human pose estimation: A survey of deep learning-based methods

TL;DR: This survey extensively reviews the recent deep learning-based 2D and 3D human pose estimation methods published since 2014 and summarizes the challenges, main frameworks, benchmark datasets, evaluation metrics, performance comparison, and discusses some promising future research directions.
Book ChapterDOI

Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos

TL;DR: This work is the first to their knowledge to use ConvNets for estimating human pose in videos and introduces a new network that exploits temporal information from multiple frames, leading to better performance.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

"GrabCut": interactive foreground extraction using iterated graph cuts

TL;DR: A more powerful, iterative version of the optimisation of the graph-cut approach is developed and the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result.
Proceedings ArticleDOI

Real-time human pose recognition in parts from single depth images

TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
Proceedings ArticleDOI

A discriminatively trained, multiscale, deformable part model

TL;DR: A discriminatively trained, multiscale, deformable part model for object detection, which achieves a two-fold improvement in average precision over the best performance in the 2006 PASCAL person detection challenge and outperforms the best results in the 2007 challenge in ten out of twenty categories.