Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Home
/
Papers
/
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Posted Content•

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren¹, Kaiming He², Ross Girshick³, Jian Sun²•Institutions (3)

University of Science and Technology of China¹, Microsoft², Facebook³

04 Jun 2015-arXiv: Computer Vision and Pattern Recognition-

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.

read less

Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

...read moreread less

Citations

PDF

Open Access

More filters

You Only Look on Lymphocytes Once

[...]

Mart van Rijthoven, Zaneta Swiderska-Chadaj, K. Seeliger, Jeroen van der Laak, Francesco Ciompi - Show less +1 more

11 Apr 2018

TL;DR: The potential of the You Only Look Once (YOLO) architecture applied to automatic detection of lymphocytes in gigapixel histopathology wholeslide images stained with immunohistochemistry is boosted by tailoring the YOLO architecture to lymphocyte detection in WSI.

...read moreread less

Abstract: Understanding the role of immune cells is at the core of cancer research. In this paper, we boost the potential of the You Only Look Once (YOLO) architecture applied to automatic detection of lymphocytes in gigapixel histopathology wholeslide images (WSI) stained with immunohistochemistry by (1) tailoring the YOLO architecture to lymphocyte detection in WSI; (2) guiding training data sampling by exploiting prior knowledge on hard negative samples; (3) pairing the proposed sampling strategy with the focal loss technique. The combination of the proposed improvements increases the F1-score of YOLO by 3% with a speed-up of 4.3X.

...read moreread less

12 citations

Proceedings Article•DOI•

Comparing Boosted Cascades to Deep Learning Architectures for Fast and Robust Coconut Tree Detection in Aerial Images.

[...]

Steven Puttemans¹, Kristof Van Beeck¹, Toon Goedemé¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2018

TL;DR: This work proves that deep learning is indeed a feasible alternative for robust coconut tree detection with a high average precision in aerial imagery, keeping attention to known issues with the selected architectures.

...read moreread less

Abstract: Object detection using a boosted cascade of weak classifiers is a principle that has been used in a variety of applications, ranging from pedestrian detection to fruit counting in orchards, and this with a high average precision. In this work we prove that using both the boosted cascade approach suggest by Viola & Jones and the adapted approach based on integral or aggregate channels by Dollár yield promising results on coconut tree detection in aerial images. However with the rise of robust deep learning architectures for both detection and classification, and the significant drop in hardware costs, we wonder if it is feasible to apply deep learning to solve the task of fast and robust coconut tree detection and classification in aerial imagery. We examine both classificationand detection-based architectures for this task. By doing so we prove that deep learning is indeed a feasible alternative for robust coconut tree detection with a high average precision in aerial imagery, keeping attention to known issues with the selected architectures.

...read moreread less

12 citations

Book Chapter•DOI•

Real-Time Scene Understanding Using Deep Neural Networks for RoboCup SPL

[...]

Marton Szemenyei¹, Vladimir Estivill-Castro²•Institutions (2)

Budapest University of Technology and Economics¹, Griffith University²

18 Jun 2018

TL;DR: An end-to-end neural network solution to scene understanding for robot soccer and RoboDNN, a C++ neural network library designed for fast inference on the Nao robots are presented.

...read moreread less

Abstract: Convolutional neural networks (CNNs) are the state-of-the-art method for most computer vision tasks. But, the deployment of CNNs on mobile or embedded platforms is challenging because of CNNs’ excessive computational requirements. We present an end-to-end neural network solution to scene understanding for robot soccer. We compose two key neural networks: one to perform semantic segmentation on an image, and another to propagate class labels between consecutive frames. We trained our networks on synthetic datasets and fine-tuned them on a set consisting of real images from a Nao robot. Furthermore, we investigate and evaluate several practical methods for increasing the efficiency and performance of our networks. Finally, we present RoboDNN, a C++ neural network library designed for fast inference on the Nao robots.

...read moreread less

12 citations

Journal Article•DOI•

Improved Target Detection Algorithm Based on Libra R-CNN

[...]

Zhao Zijing¹, Li Xuewei¹, Hongzhe Liu¹, Cheng Xu¹•Institutions (1)

Beijing Union University¹

17 Jun 2020-IEEE Access

TL;DR: The idea of generating increasingly more diverse indistinguishable samples during training is proposed to improve the detection accuracy of the Libra R-CNN, which is verified by experiments.

...read moreread less

Abstract: With the development of science and technology, artificial intelligence has been widely used in the transportation field, and research on the symmetry of artificial intelligence has become increasingly more in-depth Traffic sign detection based on deep learning has the problems of different target shapes and high variability in the number of targets between different labels To solve these problems from a lack of symmetry, the idea of applying the concept of balanced data and the deformable positioning region to a target recognition network is proposed The research is based on the improvement of the Libra R-CNN Aiming at the problem that the difficult-to-distinguish target in target detection has a high impact on detection, the idea of generating increasingly more diverse indistinguishable samples during training is proposed to improve the detection accuracy, which is verified by experiments The experiment is carried out on the MS COCO 2017 and traffic sign datasets The improved Libra R-CNN is 3 percentage points better than the unimproved Libra R-CNN's mean Average Precision (mAP) A large number of comparative experimental results show that the improved network is effective

...read moreread less

12 citations

Journal Article•DOI•

TCMINet: Face Parsing for Traditional Chinese Medicine Inspection via a Hybrid Neural Network With Context Aggregation

[...]

Xinlei Li¹, Dawei Yang¹, Wang Yan¹, Wei Zhang¹, Fufeng Li², Wenqiang Zhang¹ - Show less +2 more•Institutions (2)

Fudan University¹, Shanghai University²

18 May 2020-IEEE Access

TL;DR: The first attempt to deal with multiple organs simultaneously and develop an end-to-end hybrid network with context aggregation (named TCMINet) to achieve face parsing for Traditional Chinese Medicine Inspection (TCMI) is made.

...read moreread less

Abstract: Facial medical analysis, including the inspection of the face and inner facial components, has always been a primary part of the diagnostic method in Traditional Chinese Medicine (TCM). The existing literature merely focus on detecting or segmenting single face organs such as tongue, eyes, or lips. In this paper, we make the first attempt to deal with multiple organs simultaneously and develop an end-to-end hybrid network with context aggregation (named TCMINet) to achieve face parsing for Traditional Chinese Medicine Inspection (TCMI). Additionally, we construct a new dataset named TCMID to overcome the lackness of accurate annotated data. In order to verify the generalization ability of TCMINet, we manually relabel images in two popular face parsing datasets referred to as LFW-PL * and HELEN * for test. The extensive ablation evaluations and experimental comparisons demonstrate that the proposed TCMINet outperforms state-of-the-art methods under various evaluation metrics. It runs at 267ms per face (512×512 image) on Nvidia Titan Xp GPU, being possible to be integrated into engineering solutions.

...read moreread less

12 citations

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Residual Learning for Image Recognition

[...]

Kaiming He¹, Xiangyu Zhang¹, Shaoqing Ren¹, Jian Sun¹•Institutions (1)

Microsoft¹

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read moreread less

123,388 citations

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

49,914 citations

Proceedings Article•DOI•

Going deeper with convolutions

[...]

Christian Szegedy¹, Wei Liu², Yangqing Jia¹, Pierre Sermanet¹, Scott Reed³, Dragomir Anguelov¹, Dumitru Erhan¹, Vincent Vanhoucke¹, Andrew Rabinovich - Show less +5 more•Institutions (3)

Google¹, University of North Carolina at Chapel Hill², University of Michigan³

07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

...read moreread less

40,257 citations

Journal Article•DOI•

ImageNet Large Scale Visual Recognition Challenge

[...]

Olga Russakovsky¹, Jia Deng², Hao Su¹, Jonathan Krause¹, Sanjeev Satheesh¹, Sean Ma¹, Zhiheng Huang¹, Andrej Karpathy¹, Aditya Khosla³, Michael S. Bernstein¹, Alexander C. Berg⁴, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, University of Michigan², Massachusetts Institute of Technology³, University of North Carolina at Chapel Hill⁴

01 Dec 2015-International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

...read moreread less

30,811 citations