scispace - formally typeset
Search or ask a question

Showing papers on "Pose published in 2014"


Book ChapterDOI
06 Sep 2014
TL;DR: A novel direct tracking method which operates on \(\mathfrak{sim}(3)\), thereby explicitly detecting scale-drift, and an elegant probabilistic solution to include the effect of noisy depth values into tracking are introduced.
Abstract: We propose a direct (feature-less) monocular SLAM algorithm which, in contrast to current state-of-the-art regarding direct methods, allows to build large-scale, consistent maps of the environment Along with highly accurate pose estimation based on direct image alignment, the 3D environment is reconstructed in real-time as pose-graph of keyframes with associated semi-dense depth maps These are obtained by filtering over a large number of pixelwise small-baseline stereo comparisons The explicitly scale-drift aware formulation allows the approach to operate on challenging sequences including large variations in scene scale Major enablers are two key novelties: (1) a novel direct tracking method which operates on \(\mathfrak{sim}(3)\), thereby explicitly detecting scale-drift, and (2) an elegant probabilistic solution to include the effect of noisy depth values into tracking The resulting direct monocular SLAM system runs in real-time on a CPU

3,273 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: The pose estimation is formulated as a DNN-based regression problem towards body joints and a cascade of such DNN regres- sors which results in high precision pose estimates.
Abstract: We propose a method for human pose estimation based on Deep Neural Networks (DNNs). The pose estimation is formulated as a DNN-based regression problem towards body joints. We present a cascade of such DNN regres- sors which results in high precision pose estimates. The approach has the advantage of reasoning about pose in a holistic fashion and has a simple but yet powerful formula- tion which capitalizes on recent advances in Deep Learn- ing. We present a detailed empirical analysis with state-of- art or better performance on four academic benchmarks of diverse real-world images.

2,612 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: A novel benchmark "MPII Human Pose" is introduced that makes a significant advance in terms of diversity and difficulty, a contribution that is required for future developments in human body models.
Abstract: Human pose estimation has made significant progress during the last years. However current datasets are limited in their coverage of the overall pose estimation challenges. Still these serve as the common sources to evaluate, train and compare different models on. In this paper we introduce a novel benchmark "MPII Human Pose" that makes a significant advance in terms of diversity and difficulty, a contribution that we feel is required for future developments in human body models. This comprehensive dataset was collected using an established taxonomy of over 800 human activities [1]. The collected images cover a wider variety of human activities than previous datasets including various recreational, occupational and householding activities, and capture people from a wider range of viewpoints. We provide a rich set of labels including positions of body joints, full 3D torso and head orientation, occlusion labels for joints and body parts, and activity labels. For each image we provide adjacent video frames to facilitate the use of motion information. Given these rich annotations we perform a detailed analysis of leading human pose estimation approaches and gaining insights for the success and failures of these methods.

2,372 citations


Journal ArticleDOI
TL;DR: A new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, is introduced for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms.
Abstract: We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state-of-the-art by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image, human motion capture, and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of large-scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement by future work in the research community. Our experiments show that our best large-scale model can leverage our full training set to obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem. Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m .

2,209 citations


Journal ArticleDOI
TL;DR: A fiducial marker system specially appropriated for camera pose estimation in applications such as augmented reality and robot localization is presented and an algorithm for generating configurable marker dictionaries following a criterion to maximize the inter-marker distance and the number of bit transitions is proposed.

1,758 citations


Book ChapterDOI
06 Sep 2014
TL;DR: A novel tasks-constrained deep model is formulated, with task-wise early stopping to facilitate learning convergence and reduces model complexity drastically compared to the state-of-the-art method based on cascaded deep model.
Abstract: Facial landmark detection has long been impeded by the problems of occlusion and pose variation. Instead of treating the detection task as a single and independent problem, we investigate the possibility of improving detection robustness through multi-task learning. Specifically, we wish to optimize facial landmark detection together with heterogeneous but subtly correlated tasks, e.g. head pose estimation and facial attribute inference. This is non-trivial since different tasks have different learning difficulties and convergence rates. To address this problem, we formulate a novel tasks-constrained deep model, with task-wise early stopping to facilitate learning convergence. Extensive evaluations show that the proposed task-constrained learning (i) outperforms existing methods, especially in dealing with faces with severe occlusion and pose variation, and (ii) reduces model complexity drastically compared to the state-of-the-art method based on cascaded deep model [21].

1,457 citations


Posted Content
TL;DR: This paper proposes a new hybrid architecture that consists of a deep Convolu-tional Network and a Markov Random Field and shows how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images.
Abstract: This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.

1,278 citations


Posted Content
TL;DR: A novel architecture which includes an efficient `position refinement' model that is trained to estimate the joint offset location within a small region of the image to achieve improved accuracy in human joint location estimation is introduced.
Abstract: Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introduce invariance and prevent over-training. These benefits of pooling come at the cost of reduced localization accuracy. We introduce a novel architecture which includes an efficient `position refinement' model that is trained to estimate the joint offset location within a small region of the image. This refinement model is jointly trained in cascade with a state-of-the-art ConvNet model to achieve improved accuracy in human joint location estimation. We show that the variance of our detector approaches the variance of human annotations on the FLIC dataset and outperforms all existing approaches on the MPII-human-pose dataset.

877 citations


Journal ArticleDOI
TL;DR: A review of the work on data-driven grasp synthesis and the methodologies for sampling and ranking candidate grasps and an overview of the different methodologies are provided, which draw a parallel to the classical approaches that rely on analytic formulations.
Abstract: We review the work on data-driven grasp synthesis and the methodologies for sampling and ranking candidate grasps. We divide the approaches into three groups based on whether they synthesize grasps for known, familiar, or unknown objects. This structure allows us to identify common object representations and perceptual processes that facilitate the employed data-driven grasp synthesis technique. In the case of known objects, we concentrate on the approaches that are based on object recognition and pose estimation. In the case of familiar objects, the techniques use some form of a similarity matching to a set of previously encountered objects. Finally, for the approaches dealing with unknown objects, the core part is the extraction of specific features that are indicative of good grasps. Our survey provides an overview of the different methodologies and discusses open problems in the area of robot grasping. We also draw a parallel to the classical approaches that rely on analytic formulations.

859 citations


Proceedings ArticleDOI
24 Mar 2014
TL;DR: PASCAL3D+ dataset is contributed, which is a novel and challenging dataset for 3D object detection and pose estimation, and on average there are more than 3,000 object instances per category.
Abstract: 3D object detection and pose estimation methods have become popular in recent years since they can handle ambiguities in 2D images and also provide a richer description for objects compared to 2D object detectors. However, most of the datasets for 3D recognition are limited to a small amount of images per category or are captured in controlled environments. In this paper, we contribute PASCAL3D+ dataset, which is a novel and challenging dataset for 3D object detection and pose estimation. PASCAL3D+ augments 12 rigid categories of the PASCAL VOC 2012 [4] with 3D annotations. Furthermore, more images are added for each category from ImageNet [3]. PASCAL3D+ images exhibit much more variability compared to the existing 3D datasets, and on average there are more than 3,000 object instances per category. We believe this dataset will provide a rich testbed to study 3D detection and pose estimation and will help to significantly push forward research in this area. We provide the results of variations of DPM [6] on our new dataset for object detection and viewpoint estimation in different scenarios, which can be used as baselines for the community. Our benchmark is available online at http://cvgl.stanford.edu/projects/pascal3d

853 citations


Book ChapterDOI
06 Sep 2014
TL;DR: This work addresses the problem of estimating the 6D Pose of specific objects from a single RGB-D image by presenting a learned, intermediate representation in form of a dense 3D object coordinate labelling paired with a dense class labelling.
Abstract: This work addresses the problem of estimating the 6D Pose of specific objects from a single RGB-D image. We present a flexible approach that can deal with generic objects, both textured and texture-less. The key new concept is a learned, intermediate representation in form of a dense 3D object coordinate labelling paired with a dense class labelling. We are able to show that for a common dataset with texture-less objects, where template-based techniques are suitable and state of the art, our approach is slightly superior in terms of accuracy. We also demonstrate the benefits of our approach, compared to template-based techniques, in terms of robustness with respect to varying lighting conditions. Towards this end, we contribute a new ground truth dataset with 10k images of 20 objects captured each under three different lighting conditions. We demonstrate that our approach scales well with the number of objects and has capabilities to run fast.

Proceedings Article
08 Dec 2014
TL;DR: In this article, a hybrid architecture that consists of a deep Convolu-tional Network and a Markov Random Field (MRF) was proposed for articulated human pose estimation in monocular images.
Abstract: This paper proposes a new hybrid architecture that consists of a deep Convolu-tional Network and a Markov Random Field We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images The architecture can exploit structural domain constraints such as geometric relationships between body joint locations We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques

Journal ArticleDOI
TL;DR: A simple yet effective and efficient tracking algorithm with an appearance model based on features extracted from a multiscale image feature space with dataindependent basis that performs favorably against state-of-the-art methods on challenging sequences in terms of efficiency, accuracy and robustness.
Abstract: It is a challenging task to develop effective and efficient appearance models for robust object tracking due to factors such as pose variation, illumination change, occlusion, and motion blur. Existing online tracking algorithms often update models with samples from observations in recent frames. Despite much success has been demonstrated, numerous issues remain to be addressed. First, while these adaptive appearance models are data-dependent, there does not exist sufficient amount of data for online algorithms to learn at the outset. Second, online tracking algorithms often encounter the drift problems. As a result of self-taught learning, misaligned samples are likely to be added and degrade the appearance models. In this paper, we propose a simple yet effective and efficient tracking algorithm with an appearance model based on features extracted from a multiscale image feature space with data-independent basis. The proposed appearance model employs non-adaptive random projections that preserve the structure of the image feature space of objects. A very sparse measurement matrix is constructed to efficiently extract the features for the appearance model. We compress sample images of the foreground target and the background using the same sparse measurement matrix. The tracking task is formulated as a binary classification via a naive Bayes classifier with online update in the compressed domain. A coarse-to-fine search strategy is adopted to further reduce the computational complexity in the detection procedure. The proposed compressive tracking algorithm runs in real-time and performs favorably against state-of-the-art methods on challenging sequences in terms of efficiency, accuracy and robustness.

Posted Content
TL;DR: This work specifies a graphical model for human pose which exploits the fact the local image measurements can be used both to detect parts (or joints) and also to predict the spatial relationships between them (Image Dependent Pairwise Relations).
Abstract: We present a method for estimating articulated human pose from a single static image based on a graphical model with novel pairwise relations that make adaptive use of local image measurements. More precisely, we specify a graphical model for human pose which exploits the fact the local image measurements can be used both to detect parts (or joints) and also to predict the spatial relationships between them (Image Dependent Pairwise Relations). These spatial relationships are represented by a mixture model. We use Deep Convolutional Neural Networks (DCNNs) to learn conditional probabilities for the presence of parts and their spatial relationships within image patches. Hence our model combines the representational flexibility of graphical models with the efficiency and statistical power of DCNNs. Our method significantly outperforms the state of the art methods on the LSP and FLIC datasets and also performs very well on the Buffy dataset without any training.

Book ChapterDOI
01 Nov 2014
TL;DR: A deep convolutional neural network for 3D human pose estimation from monocular images is proposed and empirically show that the network has disentangled the dependencies among different body parts, and learned their correlations.
Abstract: In this paper, we propose a deep convolutional neural network for 3D human pose estimation from monocular images. We train the network using two strategies: (1) a multi-task framework that jointly trains pose regression and body part detectors; (2) a pre-training strategy where the pose regressor is initialized using a network trained for body part detection. We compare our network on a large data set and achieve significant improvement over baseline methods. Human pose estimation is a structured prediction problem, i.e., the locations of each body part are highly correlated. Although we do not add constraints about the correlations between body parts to the network, we empirically show that the network has disentangled the dependencies among different body parts, and learned their correlations.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: The Latent Regression Forest is presented, a novel framework for real-time, 3D hand pose estimation from a single depth image and shows that the LRF out-performs state-of-the-art methods in both accuracy and efficiency.
Abstract: In this paper we present the Latent Regression Forest (LRF), a novel framework for real-time, 3D hand pose estimation from a single depth image. In contrast to prior forest-based methods, which take dense pixels as input, classify them independently and then estimate joint positions afterwards, our method can be considered as a structured coarse-to-fine search, starting from the centre of mass of a point cloud until locating all the skeletal joints. The searching process is guided by a learnt Latent Tree Model which reflects the hierarchical topology of the hand. Our main contributions can be summarised as follows: (i) Learning the topology of the hand in an unsupervised, data-driven manner. (ii) A new forest-based, discriminative framework for structured search in images, as well as an error regression step to avoid error accumulation. (iii) A new multi-view hand pose dataset containing 180K annotated images from 10 different subjects. Our experiments show that the LRF out-performs state-of-the-art methods in both accuracy and efficiency.

01 Dec 2014
TL;DR: This paper presents a new simultaneous localization and mapping (SLAM) system capable of producing high-quality globally consistent surface reconstructions over hundreds of meters in real time with only a low-cost commodity RGB-D sensor and shows that the system performs strongly in terms of trajectory estimation, map quality and computational performance in comparison to other state-of-the-art systems.
Abstract: We present a new simultaneous localization and mapping SLAM system capable of producing high-quality globally consistent surface reconstructions over hundreds of meters in real time with only a low-cost commodity RGB-D sensor. By using a fused volumetric surface reconstruction we achieve a much higher quality map over what would be achieved using raw RGB-D point clouds. In this paper we highlight three key techniques associated with applying a volumetric fusion-based mapping system to the SLAM problem in real time. First, the use of a GPU-based 3D cyclical buffer trick to efficiently extend dense every-frame volumetric fusion of depth maps to function over an unbounded spatial region. Second, overcoming camera pose estimation limitations in a wide variety of environments by combining both dense geometric and photometric camera pose constraints. Third, efficiently updating the dense map according to place recognition and subsequent loop closure constraints by the use of an 'as-rigid-as-possible' space deformation. We present results on a wide variety of aspects of the system and show through evaluation on de facto standard RGB-D benchmarks that our system performs strongly in terms of trajectory estimation, map quality and computational performance in comparison to other state-of-the-art systems.

Book ChapterDOI
06 Sep 2014
TL;DR: In this article, the latent class distributions at the leaf nodes are treated as latent variables, and during the inference process they iteratively update these distributions, providing accurate estimation of background clutter and foreground occlusions and thus a better detection rate.
Abstract: In this paper we propose a novel framework, Latent-Class Hough Forests, for 3D object detection and pose estimation in heavily cluttered and occluded scenes. Firstly, we adapt the state-of-the-art template matching feature, LINEMOD [14], into a scale-invariant patch descriptor and integrate it into a regression forest using a novel template-based split function. In training, rather than explicitly collecting representative negative samples, our method is trained on positive samples only and we treat the class distributions at the leaf nodes as latent variables. During the inference process we iteratively update these distributions, providing accurate estimation of background clutter and foreground occlusions and thus a better detection rate. Furthermore, as a by-product, the latent class distributions can provide accurate occlusion aware segmentation masks, even in the multi-instance scenario. In addition to an existing public dataset, which contains only single-instance sequences with large amounts of clutter, we have collected a new, more challenging, dataset for multiple-instance detection containing heavy 2D and 3D clutter as well as foreground occlusions. We evaluate the Latent-Class Hough Forest on both of these datasets where we outperform state-of-the art methods.

Journal ArticleDOI
TL;DR: This paper provides specific and practical approaches to associate uncertainty with 4 ×4 transformation matrices, which is a common representation for pose variables in 3-D space, and shows constraint-sensitive means of perturbing transformationMatrices using their associated exponential-map generators.
Abstract: In this paper, we provide specific and practical approaches to associate uncertainty with $\hbox{4} \times \hbox{4}$ transformation matrices, which is a common representation for pose variables in 3-D space. We show constraint-sensitive means of perturbing transformation matrices using their associated exponential-map generators and demonstrate these tools on three simple-yet-important estimation problems: 1) propagating uncertainty through a compound pose change, 2) fusing multiple measurements of a pose (e.g., for use in pose-graph relaxation), and 3) propagating uncertainty on poses (and landmarks) through a nonlinear camera model. The contribution of the paper is the presentation of the theoretical tools, which can be applied in the analysis of many problems involving 3-D pose and point variables.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: By extracting the non-linear representation from multiple information sources, the deep model outperforms state-of-the-art by up to 8.6 percent on three public benchmark datasets.
Abstract: Visual appearance score, appearance mixture type and deformation are three important information sources for human pose estimation. This paper proposes to build a multi-source deep model in order to extract non-linear representation from these different aspects of information sources. With the deep model, the global, high-order human body articulation patterns in these information sources are extracted for pose estimation. The task for estimating body locations and the task for human detection are jointly learned using a unified deep model. The proposed approach can be viewed as a post-processing of pose estimation results and can flexibly integrate with existing methods by taking their information sources as input. By extracting the non-linear representation from multiple information sources, the deep model outperforms state-of-the-art by up to 8.6 percent on three public benchmark datasets.

Book ChapterDOI
06 Sep 2014
TL;DR: This paper builds upon the inference machine framework and presents a method for articulated human pose estimation that incorporates rich spatial interactions among multiple parts and information across parts of different scales and outperforms the state-of-the-art on these benchmarks.
Abstract: State-of-the-art approaches for articulated human pose estimation are rooted in parts-based graphical models. These models are often restricted to tree-structured representations and simple parametric potentials in order to enable tractable inference. However, these simple dependencies fail to capture all the interactions between body parts. While models with more complex interactions can be defined, learning the parameters of these models remains challenging with intractable or approximate inference. In this paper, instead of performing inference on a learned graphical model, we build upon the inference machine framework and present a method for articulated human pose estimation. Our approach incorporates rich spatial interactions among multiple parts and information across parts of different scales. Additionally, the modular framework of our approach enables both ease of implementation without specialized optimization solvers, and efficient inference. We analyze our approach on two challenging datasets with large pose variation and outperform the state-of-the-art on these benchmarks.

Journal ArticleDOI
TL;DR: The proposed HD-MSL effectively combines varied features into a unified representation and integrates the labeling information based on a probabilistic framework and can automatically learn a combination coefficient for each view, which plays an important role in utilizing the complementary information of multiview data.
Abstract: How do we find all images in a larger set of images which have a specific content? Or estimate the position of a specific object relative to the camera? Image classification methods, like support vector machine (supervised) and transductive support vector machine (semi-supervised), are invaluable tools for the applications of content-based image retrieval, pose estimation, and optical character recognition. However, these methods only can handle the images represented by single feature. In many cases, different features (or multiview data) can be obtained, and how to efficiently utilize them is a challenge. It is inappropriate for the traditionally concatenating schema to link features of different views into a long vector. The reason is each view has its specific statistical property and physical interpretation. In this paper, we propose a high-order distance-based multiview stochastic learning (HD-MSL) method for image classification. HD-MSL effectively combines varied features into a unified representation and integrates the labeling information based on a probabilistic framework. In comparison with the existing strategies, our approach adopts the high-order distance obtained from the hypergraph to replace pairwise distance in estimating the probability matrix of data distribution. In addition, the proposed approach can automatically learn a combination coefficient for each view, which plays an important role in utilizing the complementary information of multiview data. An alternative optimization is designed to solve the objective functions of HD-MSL and obtain different views on coefficients and classification scores simultaneously. Experiments on two real world datasets demonstrate the effectiveness of HD-MSL in image classification.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: The proposed method to learn pose-robust features by modeling the complex non-linear transform from the non-frontal face images to frontal ones through a deep network in a progressive way, termed as stacked progressive auto-encoders (SPAE).
Abstract: Identifying subjects with variations caused by poses is one of the most challenging tasks in face recognition, since the difference in appearances caused by poses may be even larger than the difference due to identity. Inspired by the observation that pose variations change non-linearly but smoothly, we propose to learn pose-robust features by modeling the complex non-linear transform from the non-frontal face images to frontal ones through a deep network in a progressive way, termed as stacked progressive auto-encoders (SPAE). Specifically, each shallow progressive auto-encoder of the stacked network is designed to map the face images at large poses to a virtual view at smaller ones, and meanwhile keep those images already at smaller poses unchanged. Then, stacking multiple these shallow auto-encoders can convert non-frontal face images to frontal ones progressively, which means the pose variations are narrowed down to zero step by step. As a result, the outputs of the topmost hidden layers of the stacked network contain very small pose variations, which can be used as the pose-robust features for face recognition. An additional attractiveness of the proposed method is that no pose estimation is needed for the test images. The proposed method is evaluated on two datasets with pose variations, i.e., MultiPIE and FERET datasets, and the experimental results demonstrate the superiority of our method to the existing works, especially to those 2D ones.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: A novel 3D pictorial structures (3DPS) model is introduced that infers 3D human body configurations from the authors' reduced state space and is generic and applicable to both single and multiple human pose estimation.
Abstract: In this work, we address the problem of 3D pose estimation of multiple humans from multiple views This is a more challenging problem than single human 3D pose estimation due to the much larger state space, partial occlusions as well as across view ambiguities when not knowing the identity of the humans in advance To address these problems, we first create a reduced state space by triangulation of corresponding body joints obtained from part detectors in pairs of camera views In order to resolve the ambiguities of wrong and mixed body parts of multiple humans after triangulation and also those coming from false positive body part detections, we introduce a novel 3D pictorial structures (3DPS) model Our model infers 3D human body configurations from our reduced state space The 3DPS model is generic and applicable to both single and multiple human pose estimation In order to compare to the state-of-the art, we first evaluate our method on single human 3D pose estimation on HumanEva-I [22] and KTH Multiview Football Dataset II [8] datasets Then, we introduce and evaluate our method on two datasets for multiple human 3D pose estimation

Proceedings Article
08 Dec 2014
TL;DR: In this paper, a graphical model for human pose estimation from a single static image is proposed, which exploits the fact the local image measurements can be used both to detect parts and also to predict the spatial relationships between them.
Abstract: We present a method for estimating articulated human pose from a single static image based on a graphical model with novel pairwise relations that make adaptive use of local image measurements. More precisely, we specify a graphical model for human pose which exploits the fact the local image measurements can be used both to detect parts (or joints) and also to predict the spatial relationships between them (Image Dependent Pairwise Relations). These spatial relationships are represented by a mixture model. We use Deep Convolutional Neural Networks (DCNNs) to learn conditional probabilities for the presence of parts and their spatial relationships within image patches. Hence our model combines the representational flexibility of graphical models with the efficiency and statistical power of DCNNs. Our method significantly outperforms the state of the art methods on the LSP and FLIC datasets and also performs very well on the Buffy dataset without any training.

Proceedings ArticleDOI
29 Sep 2014
TL;DR: This work presents a novel approach for detecting objects and estimating their 3D pose in single images of cluttered scenes using a deformable parts-based model and demonstrates successful grasps using the detection and pose estimate with a PR2 robot.
Abstract: We present a novel approach for detecting objects and estimating their 3D pose in single images of cluttered scenes. Objects are given in terms of 3D models without accompanying texture cues. A deformable parts-based model is trained on clusters of silhouettes of similar poses and produces hypotheses about possible object locations at test time. Objects are simultaneously segmented and verified inside each hypothesis bounding region by selecting the set of superpixels whose collective shape matches the model silhouette. A final iteration on the 6-DOF object pose minimizes the distance between the selected image contours and the actual projection of the 3D model. We demonstrate successful grasps using our detection and pose estimate with a PR2 robot. Extensive evaluation with a novel ground truth dataset shows the considerable benefit of using shape-driven cues for detecting objects in heavily cluttered scenes.

Journal ArticleDOI
TL;DR: This paper proposes a simple, yet effective method to compensate for large delays in the control loop using an accurate model of the quadrocopter’s flight dynamics, and presents a novel, closed-form method to estimate the scale of a monocular SLAM system from additional metric sensors.

Proceedings ArticleDOI
Cha Zhang1, Zhengyou Zhang1
24 Mar 2014
TL;DR: A deep convolutional neural network is built that can simultaneously learn the face/nonface decision, the face pose estimation problem, and the facial landmark localization problem and it is shown that such a multi-task learning scheme can further improve the classifier's accuracy.
Abstract: Multiview face detection is a challenging problem due to dramatic appearance changes under various pose, illumination and expression conditions. In this paper, we present a multi-task deep learning scheme to enhance the detection performance. More specifically, we build a deep convolutional neural network that can simultaneously learn the face/nonface decision, the face pose estimation problem, and the facial landmark localization problem. We show that such a multi-task learning scheme can further improve the classifier's accuracy. On the challenging FDDB data set, our detector achieves over 3% improvement in detection rate at the same false positive rate compared with other state-of-the-art methods.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: In this paper, a linear combination of a sparse set of bases learned from 3D human skeletons is used to estimate the 3D pose by minimizing the 1-norm error between the projection of the 2D pose and the corresponding 2D detection.
Abstract: Human pose estimation is a key step to action recognition We propose a method of estimating 3D human poses from a single image, which works in conjunction with an existing 2D pose/joint detector 3D pose estimation is challenging because multiple 3D poses may correspond to the same 2D pose after projection due to the lack of depth information Moreover, current 2D pose estimators are usually inaccurate which may cause errors in the 3D estimation We address the challenges in three ways: (i) We represent a 3D pose as a linear combination of a sparse set of bases learned from 3D human skeletons (ii) We enforce limb length constraints to eliminate anthropomorphically implausible skeletons (iii) We estimate a 3D pose by minimizing the 1-norm error between the projection of the 3D pose and the corresponding 2D detection The 1-norm loss term is robust to inaccurate 2D joint estimations We use the alternating direction method (ADM) to solve the optimization problem efficiently Our approach outperforms the state-of-the-arts on three benchmark datasets

Proceedings Article
01 Jan 2014
TL;DR: In this article, a multi-layer convolutional network architecture and a modified learning technique that learns low-level features and higher-level weak spatial models are used for human pose estimation.
Abstract: This paper introduces a new architecture for human pose estimation using a multi- layer convolutional network architecture and a modified learning technique that learns low-level features and higher-level weak spatial models. Unconstrained human pose estimation is one of the hardest problems in computer vision, and our new architecture and learning schema shows significant improvement over the current state-of-the-art results. The main contribution of this paper is showing, for the first time, that a specific variation of deep learning is able to outperform all existing traditional architectures on this task. The paper also discusses several lessons learned while researching alternatives, most notably, that it is possible to learn strong low-level feature detectors on features that might even just cover a few pixels in the image. Higher-level spatial models improve somewhat the overall result, but to a much lesser extent then expected. Many researchers previously argued that the kinematic structure and top-down information is crucial for this domain, but with our purely bottom up, and weak spatial model, we could improve other more complicated architectures that currently produce the best results. This mirrors what many other researchers, like those in the speech recognition, object recognition, and other domains have experienced.