Illuminating Pedestrians via Simultaneous Detection and Segmentation

doi:10.1109/ICCV.2017.530

Home
/
Papers
/
Illuminating Pedestrians via Simultaneous Detection and Segmentation

Proceedings Article•DOI•

Illuminating Pedestrians via Simultaneous Detection and Segmentation

Garrick Brazil¹, Xi Yin¹, Xiaoming Liu•Institutions (1)

Michigan State University¹

01 Oct 2017-pp 4960-4969

TL;DR: This work proposes a segmentation infusion network to enable joint supervision on semantic segmentation and pedestrian detection, and provides an in-depth analysis to demonstrate how shared layers are shaped by the segmentation supervision.

read less

Abstract: Pedestrian detection is a critical problem in computer vision with significant impact on safety in urban autonomous driving. In this work, we explore how semantic segmentation can be used to boost pedestrian detection accuracy while having little to no impact on network efficiency. We propose a segmentation infusion network to enable joint supervision on semantic segmentation and pedestrian detection. When placed properly, the additional supervision helps guide features in shared layers to become more sophisticated and helpful for the downstream pedestrian detector. Using this approach, we find weakly annotated boxes to be sufficient for considerable performance gains. We provide an in-depth analysis to demonstrate how shared layers are shaped by the segmentation supervision. In doing so, we show that the resulting feature maps become more semantically meaningful and robust to shape and occlusion. Overall, our simultaneous detection and segmentation framework achieves a considerable gain over the state-of-the-art on the Caltech pedestrian dataset, competitive performance on KITTI, and executes 2 × faster than competitive methods.

...read moreread less

Citations

PDF

Open Access

More filters

The PASCAL Visual Object Classes Challenge

[...]

Jianguo Zhang

01 Jan 2006

3,012 citations

Journal Article•DOI•

Recent Advances in Deep Learning for Object Detection

[...]

Xiongwei Wu¹, Doyen Sahoo², Steven C. H. Hoi¹, Steven C. H. Hoi²•Institutions (2)

Singapore Management University¹, Salesforce.com²

05 Jul 2020-Neurocomputing

TL;DR: A comprehensive survey of recent advances in visual object detection with deep learning can be found in this article, where the authors systematically analyze the existing object detection frameworks and organize the survey into three major parts: detection components, learning strategies, and applications and benchmarks.

...read moreread less

420 citations

Proceedings Article•DOI•

M3D-RPN: Monocular 3D Region Proposal Network for Object Detection

[...]

Garrick Brazil¹, Xiaoming Liu¹•Institutions (1)

Michigan State University¹

01 Oct 2019

TL;DR: M3D-RPN is able to significantly improve the performance of both monocular 3D Object Detection and Bird's Eye View tasks within the KITTI urban autonomous driving dataset, while efficiently using a shared multi-class model.

...read moreread less

Abstract: Understanding the world in 3D is a critical component of urban autonomous driving. Generally, the combination of expensive LiDAR sensors and stereo RGB imaging has been paramount for successful 3D object detection algorithms, whereas monocular image-only methods experience drastically reduced performance. We propose to reduce the gap by reformulating the monocular 3D detection problem as a standalone 3D region proposal network. We leverage the geometric relationship of 2D and 3D perspectives, allowing 3D boxes to utilize well-known and powerful convolutional features generated in the image-space. To help address the strenuous 3D parameter estimations, we further design depth-aware convolutional layers which enable location specific feature development and in consequence improved 3D scene understanding. Compared to prior work in monocular 3D detection, our method consists of only the proposed 3D region proposal network rather than relying on external networks, data, or multiple stages. M3D-RPN is able to significantly improve the performance of both monocular 3D Object Detection and Bird's Eye View tasks within the KITTI urban autonomous driving dataset, while efficiently using a shared multi-class model.

...read moreread less

386 citations

Cites methods from "Illuminating Pedestrians via Simult..."

...We therefore ablate using bins of [4, 8, 16, 32] as described in Tab....
[...]

Book Chapter•DOI•

Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd

[...]

Shifeng Zhang¹, Longyin Wen², Xiao Bian², Zhen Lei¹, Stan Z. Li¹ - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, General Electric²

08 Sep 2018

TL;DR: In this article, a new occlusion-aware R-CNN (OR-CNN) was proposed to improve the detection accuracy in the crowd by introducing a new aggregation loss to enforce proposals to be close and locate compactly to the corresponding objects.

...read moreread less

Abstract: Pedestrian detection in crowded scenes is a challenging problem since the pedestrians often gather together and occlude each other In this paper, we propose a new occlusion-aware R-CNN (OR-CNN) to improve the detection accuracy in the crowd Specifically, we design a new aggregation loss to enforce proposals to be close and locate compactly to the corresponding objects Meanwhile, we use a new part occlusion-aware region of interest (PORoI) pooling unit to replace the RoI pooling layer in order to integrate the prior structure information of human body with visibility prediction into the network to handle occlusion Our detector is trained in an end-to-end fashion, which achieves state-of-the-art results on three pedestrian detection datasets, ie, CityPersons, ETH, and INRIA, and performs on-pair with the state-of-the-arts on Caltech

...read moreread less

286 citations

Posted Content•

Recent Advances in Deep Learning for Object Detection

[...]

Xiongwei Wu¹, Doyen Sahoo², Steven C. H. Hoi¹, Steven C. H. Hoi²•Institutions (2)

Singapore Management University¹, Salesforce.com²

10 Aug 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A comprehensive survey of recent advances in visual object detection with deep learning by reviewing a large body of recent related work in literature and covering a variety of factors affecting the detection performance in detail.

...read moreread less

Abstract: Object detection is a fundamental visual recognition problem in computer vision and has been widely studied in the past decades. Visual object detection aims to find objects of certain target classes with precise localization in a given image and assign each object instance a corresponding class label. Due to the tremendous successes of deep learning based image classification, object detection techniques using deep learning have been actively studied in recent years. In this paper, we give a comprehensive survey of recent advances in visual object detection with deep learning. By reviewing a large body of recent related work in literature, we systematically analyze the existing object detection frameworks and organize the survey into three major parts: (i) detection components, (ii) learning strategies, and (iii) applications & benchmarks. In the survey, we cover a variety of factors affecting the detection performance in detail, such as detector architectures, feature learning, proposal generation, sampling strategies, etc. Finally, we discuss several future directions to facilitate and spur future research for visual object detection with deep learning. Keywords: Object Detection, Deep Learning, Deep Convolutional Neural Networks

...read moreread less

244 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Residual Learning for Image Recognition

[...]

Kaiming He¹, Xiangyu Zhang¹, Shaoqing Ren¹, Jian Sun¹•Institutions (1)

Microsoft¹

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read moreread less

123,388 citations

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

55,235 citations

"Illuminating Pedestrians via Simult..." refers methods in this paper

...Thus, we choose to construct a separate network using VGG-16....
[...]
...For a modest improvement to efficiency, we remove the pool5 layer from the VGG-16 architecture then adjust the input size to 112 × 112 to keep the fully-connected layers intact....
[...]
...The RPN architecture uses conv1-5 from VGG-16 [24] as the backbone....
[...]
...We initialize conv1-5 from a VGG-16 model pretrained on ImageNet [7], and all remaining layers randomly....
[...]

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

49,914 citations

Proceedings Article•DOI•

ImageNet: A large-scale hierarchical image database

[...]

Jia Deng¹, Wei Dong¹, Richard Socher¹, Li-Jia Li¹, Kai Li¹, Li Fei-Fei¹ - Show less +2 more•Institutions (1)

Princeton University¹

20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

...read moreread less

49,639 citations

"Illuminating Pedestrians via Simult..." refers methods in this paper

...We initialize conv1-5 from a VGG-16 model pretrained on ImageNet [7], and all remaining layers randomly....
[...]

Proceedings Article•DOI•

Going deeper with convolutions

[...]

Christian Szegedy¹, Wei Liu², Yangqing Jia¹, Pierre Sermanet¹, Scott Reed³, Dragomir Anguelov¹, Dumitru Erhan¹, Vincent Vanhoucke¹, Andrew Rabinovich - Show less +5 more•Institutions (3)

Google¹, University of North Carolina at Chapel Hill², University of Michigan³

07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

...read moreread less

40,257 citations