scispace - formally typeset
Search or ask a question
Posted Content

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

TL;DR: This work presents an approach to efficiently detect the 2D pose of multiple people in an image using a nonparametric representation, which it refers to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image.
Abstract: We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.
Citations
More filters
Proceedings ArticleDOI
20 Mar 2017
TL;DR: This work presents a conceptually simple, flexible, and general framework for object instance segmentation, which extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
Abstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code will be made available.

14,299 citations


Cites background or methods from "Realtime Multi-Person 2D Pose Estim..."

  • ...9 points higher than the COCO 2016 keypoint detection winner [4] that uses a multi-stage processing pipeline (see caption of Table 4)....

    [...]

  • ...CMUPose+++ [4] is the 2016 competition winner that uses multi-scale testing, post-processing with CPM [33], and filtering with an object detector, adding a cumulative ∼5 points (clarified in personal communication)....

    [...]

  • ...APkp AP 50 AP kp 75 AP kp M AP kp L CMU-Pose+++ [4] 61....

    [...]

  • ..., modeling structures [4]) will be complementary to our simple approach, but it is beyond the scope of this paper....

    [...]

Proceedings Article
20 Mar 2017
TL;DR: This work presents a conceptually simple, flexible, and general framework for object instance segmentation that outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners.
Abstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: this https URL

11,343 citations


Cites background or methods from "Realtime Multi-Person 2D Pose Estim..."

  • ...9 points higher than the COCO 2016 keypoint detection winner [6] that uses a multi-stage processing pipeline (see caption of Table 4)....

    [...]

  • ...CMU-Pose+++ [6] is the 2016 competition winner that uses multi-scale testing, post-processing with CPM [44], and filtering with an object detector, adding a cumulative ∼5 points (clarified in personal communication)....

    [...]

  • ...APkp AP 50 AP kp 75 AP kp M AP kp L CMU-Pose+++ [6] 61....

    [...]

  • ..., modeling structures [6]) will be complementary to our simple approach....

    [...]

Journal ArticleDOI
TL;DR: In this article, a review of deep learning-based object detection frameworks is provided, focusing on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further.
Abstract: Due to object detection’s close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles that combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy, and optimization function. In this paper, we provide a review of deep learning-based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely, the convolutional neural network. Then, we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection, and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network-based learning systems.

3,097 citations

Proceedings ArticleDOI
25 Feb 2019
TL;DR: This paper proposes a network that maintains high-resolution representations through the whole process of human pose estimation and empirically demonstrates the effectiveness of the network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset.
Abstract: In this paper, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. In addition, we show the superiority of our network in pose tracking on the PoseTrack dataset. The code and models have been publicly available at https://github.com/leoxiaobin/deep-high-resolution-net.pytorch.

2,979 citations


Cites background from "Realtime Multi-Person 2D Pose Estim..."

  • ...This paper is interested in single-person pose estimation, which is the basis of other related problems, such as multiperson pose estimation [6, 26, 32, 38, 46, 55, 40, 45, 17, 69],...

    [...]

Proceedings Article
27 Apr 2018
TL;DR: Wang et al. as discussed by the authors proposed a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data.
Abstract: Dynamics of human body skeletons convey significant information for human action recognition. Conventional approaches for modeling skeletons usually rely on hand-crafted parts or traversal rules, thus resulting in limited expressive power and difficulties of generalization. In this work, we propose a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. This formulation not only leads to greater expressive power but also stronger generalization capability. On two large datasets, Kinetics and NTU-RGBD, it achieves substantial improvements over mainstream methods.

2,681 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations


"Realtime Multi-Person 2D Pose Estim..." refers methods in this paper

  • ...The image is analyzed by a CNN (initialized by the first 10 layers of VGG-19 [53] and fine-tuned), generating a set of feature maps F that is input to the first stage....

    [...]

  • ...During training, we initialize the 10 layers from the VGG-19 [32]....

    [...]

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

30,462 citations


"Realtime Multi-Person 2D Pose Estim..." refers background or methods in this paper

  • ...knees, hips, shoulders, elbows, wrists, necks, torsos, and head tops, while COCO [67] also includes some facial keypoints....

    [...]

  • ...The MSCOCO evaluation defines the object keypoint similarity (OKS) and uses the mean average precision (AP) over 10 OKS thresholds as main competition metric [1]....

    [...]

  • ...dataset [66], which consists of 3844 training and 1758 testing groups of multiple interacting individuals in highly articulated poses with 14 body parts; (2) COCO keypoint challenge dataset [67], which requires simultaneously detecting people and localizing 17 keypoints (body parts) in each person (including 12 human body parts and 5 facial keypoints); (3) our foot dataset, which is a subset of 15K annotations out of the COCO keypoint dataset....

    [...]

  • ...Existing human pose datasets ([66], [67]) contain limited body part types....

    [...]

  • ...We report results on the inaugural MSCOCO 2016 challenge where our approach sets the state-of-the-art, and in the MPII multiperson benchmark, where we significantly exceed the previous state-of-the-art result....

    [...]

Journal ArticleDOI

28,888 citations


"Realtime Multi-Person 2D Pose Estim..." refers methods in this paper

  • ...The library also includes 3D keypoint pose detection, by performing 3D triangulation with non-linear Levenberg-Marquardt refinement [64] over the results of multiple synchronized camera views....

    [...]