Showing papers by "Jia Deng published in 2016"

PDF

Open Access

Book Chapter•DOI•

Stacked Hourglass Networks for Human Pose Estimation

[...]

Alejandro Newell¹, Kaiyu Yang¹, Jia Deng¹•Institutions (1)

08 Oct 2016

TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.

...read moreread less

Abstract: This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

...read moreread less

3,865 citations

Posted Content•

Stacked Hourglass Networks for Human Pose Estimation

[...]

Alejandro Newell¹, Kaiyu Yang¹, Jia Deng¹•Institutions (1)

University of Michigan¹

22 Mar 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Stacked hourglass networks as mentioned in this paper were proposed for human pose estimation, where features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body, and repeated bottom-up, top-down processing with intermediate supervision is critical to improving the performance of the network.

...read moreread less

Abstract: This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a "stacked hourglass" network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

...read moreread less

2,369 citations

Posted Content•

Single-Image Depth Perception in the Wild

[...]

Weifeng Chen¹, Zhao Fu¹, Dawei Yang¹, Jia Deng²•Institutions (2)

University of Michigan¹, Carnegie Mellon University²

13 Apr 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Experiments show that the proposed algorithm, combined with existing RGB-D data and the new relative depth annotations, significantly improves single-image depth perception in the wild.

...read moreread less

Abstract: This paper studies single-image depth perception in the wild, i.e., recovering depth from a single image taken in unconstrained settings. We introduce a new dataset "Depth in the Wild" consisting of images in the wild annotated with relative depth between pairs of random points. We also propose a new algorithm that learns to estimate metric depth using annotations of relative depth. Compared to the state of the art, our algorithm is simpler and performs better. Experiments show that our algorithm, combined with existing RGB-D data and our new relative depth annotations, significantly improves single-image depth perception in the wild.

...read moreread less

338 citations

Posted Content•

Associative Embedding:End-to-End Learning for Joint Detection and Grouping

[...]

Alejandro Newell¹, Zhiao Huang², Jia Deng¹•Institutions (2)

University of Michigan¹, Tsinghua University²

16 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Associative embedding is introduced, a novel method for supervising convolutional neural networks for the task of detection and grouping for multi-person pose estimation and state-of-the-art performance on the MPII and MS-COCO datasets is reported.

...read moreread less

Abstract: We introduce associative embedding, a novel method for supervising convolutional neural networks for the task of detection and grouping. A number of computer vision problems can be framed in this manner including multi-person pose estimation, instance segmentation, and multi-object tracking. Usually the grouping of detections is achieved with multi-stage pipelines, instead we propose an approach that teaches a network to simultaneously output detections and group assignments. This technique can be easily integrated into any state-of-the-art network architecture that produces pixel-wise predictions. We show how to apply this method to both multi-person pose estimation and instance segmentation and report state-of-the-art performance for multi-person pose on the MPII and MS-COCO datasets.

...read moreread less

222 citations

Proceedings Article•

Single-Image Depth Perception in the Wild

[...]

Weifeng Chen¹, Zhao Fu¹, Dawei Yang¹, Jia Deng²•Institutions (2)

University of Michigan¹, Carnegie Mellon University²

01 Jan 2016

TL;DR: In this paper, the authors proposed a new dataset Depth in the Wild (DIW) consisting of images in the wild annotated with relative depth between pairs of random points, which significantly improves single-image depth perception.

...read moreread less

Abstract: This paper studies single-image depth perception in the wild, i.e., recovering depth from a single image taken in unconstrained settings. We introduce a new dataset “Depth in the Wild” consisting of images in the wild annotated with relative depth between pairs of random points. We also propose a new algorithm that learns to estimate metric depth using annotations of relative depth. Compared to the state of the art, our algorithm is simpler and performs better. Experiments show that our algorithm, combined with existing RGB-D data and our new relative depth annotations, significantly improves single-image depth perception in the wild.

...read moreread less

168 citations

Book Chapter•DOI•

Structured Matching for Phrase Localization

[...]

Mingzhe Wang¹, Mahmoud Azab¹, Noriyuki Kojima¹, Rada Mihalcea¹, Jia Deng¹ - Show less +1 more•Institutions (1)

University of Michigan¹

08 Oct 2016

TL;DR: A structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual relations between regions that is formulated as a discrete optimization problem and relaxed to a linear program.

...read moreread less

Abstract: In this paper we introduce a new approach to phrase localization: grounding phrases in sentences to image regions. We propose a structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual relations between regions. We formulate structured matching as a discrete optimization problem and relax it to a linear program. We use neural networks to embed regions and phrases into vectors, which then define the similarities (matching weights) between regions and phrases. We integrate structured matching with neural networks to enable end-to-end training. Experiments on Flickr30K Entities demonstrate the empirical effectiveness of our approach.

...read moreread less

98 citations

Journal Article•DOI•

Leveraging the Wisdom of the Crowd for Fine-Grained Recognition

[...]

Jia Deng¹, Jonathan Krause², Michael Stark, Li Fei-Fei²•Institutions (2)

University of Michigan¹, Stanford University²

01 Apr 2016-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work introduces a novel online game called “Bubbles” that reveals discriminative features humans use and proposes the "BubbleBank” representation that uses the human selected bubbles to improve machine recognition performance.

...read moreread less

Abstract: Fine-grained recognition concerns categorization at sub-ordinate levels, where the distinction between object classes is highly local. Compared to basic level recognition, fine-grained categorization can be more challenging as there are in general less data and fewer discriminative features. This necessitates the use of a stronger prior for feature selection. In this work, we include humans in the loop to help computers select discriminative features. We introduce a novel online game called “Bubbles” that reveals discriminative features humans use. The player's goal is to identify the category of a heavily blurred image. During the game, the player can choose to reveal full details of circular regions (“bubbles”), with a certain penalty. With proper setup the game generates discriminative bubbles with assured quality. We next propose the “BubbleBank” representation that uses the human selected bubbles to improve machine recognition performance. Finally, we demonstrate how to extend BubbleBank to a view-invariant 3D representation. Experiments demonstrate that our approach yields large improvements over the previous state of the art on challenging benchmarks.

...read moreread less

53 citations

Journal Article•DOI•

Learning to name objects

[...]

Vicente Ordonez¹, Wei Liu², Jia Deng³, Yejin Choi⁴, Alexander C. Berg², Tamara L. Berg² - Show less +2 more•Institutions (4)

Allen Institute for Artificial Intelligence¹, University of North Carolina at Chapel Hill², University of Michigan³, University of Washington⁴

25 Feb 2016-Communications of The ACM

TL;DR: This paper looks at the problem of predicting category labels that mimic how human observers would name objects, related to the concept of entry-level categories first introduced by psychologists in the 1970s and 1980s.

...read moreread less

Abstract: We have seen remarkable recent progress in computational visual recognition, producing systems that can classify objects into thousands of different categories with increasing accuracy. However, one question that has received relatively less attention is "what labels should recognition systems output?" This paper looks at the problem of predicting category labels that mimic how human observers would name objects. This goal is related to the concept of entry-level categories first introduced by psychologists in the 1970s and 1980s. We extend these seminal ideas to study human naming at large scale and to learn computational models for predicting entry-level categories. Practical applications of this work include improving human-focused computer vision applications such as automatically generating a natural language description for an image or text-based image search.

...read moreread less

9 citations