scispace - formally typeset
Search or ask a question
Posted Content

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.
Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.
Citations
More filters
11 Apr 2018
TL;DR: The potential of the You Only Look Once (YOLO) architecture applied to automatic detection of lymphocytes in gigapixel histopathology wholeslide images stained with immunohistochemistry is boosted by tailoring the YOLO architecture to lymphocyte detection in WSI.
Abstract: Understanding the role of immune cells is at the core of cancer research. In this paper, we boost the potential of the You Only Look Once (YOLO) architecture applied to automatic detection of lymphocytes in gigapixel histopathology wholeslide images (WSI) stained with immunohistochemistry by (1) tailoring the YOLO architecture to lymphocyte detection in WSI; (2) guiding training data sampling by exploiting prior knowledge on hard negative samples; (3) pairing the proposed sampling strategy with the focal loss technique. The combination of the proposed improvements increases the F1-score of YOLO by 3% with a speed-up of 4.3X.

12 citations

Proceedings ArticleDOI
01 Jan 2018
TL;DR: This work proves that deep learning is indeed a feasible alternative for robust coconut tree detection with a high average precision in aerial imagery, keeping attention to known issues with the selected architectures.
Abstract: Object detection using a boosted cascade of weak classifiers is a principle that has been used in a variety of applications, ranging from pedestrian detection to fruit counting in orchards, and this with a high average precision. In this work we prove that using both the boosted cascade approach suggest by Viola & Jones and the adapted approach based on integral or aggregate channels by Dollár yield promising results on coconut tree detection in aerial images. However with the rise of robust deep learning architectures for both detection and classification, and the significant drop in hardware costs, we wonder if it is feasible to apply deep learning to solve the task of fast and robust coconut tree detection and classification in aerial imagery. We examine both classificationand detection-based architectures for this task. By doing so we prove that deep learning is indeed a feasible alternative for robust coconut tree detection with a high average precision in aerial imagery, keeping attention to known issues with the selected architectures.

12 citations

Book ChapterDOI
18 Jun 2018
TL;DR: An end-to-end neural network solution to scene understanding for robot soccer and RoboDNN, a C++ neural network library designed for fast inference on the Nao robots are presented.
Abstract: Convolutional neural networks (CNNs) are the state-of-the-art method for most computer vision tasks. But, the deployment of CNNs on mobile or embedded platforms is challenging because of CNNs’ excessive computational requirements. We present an end-to-end neural network solution to scene understanding for robot soccer. We compose two key neural networks: one to perform semantic segmentation on an image, and another to propagate class labels between consecutive frames. We trained our networks on synthetic datasets and fine-tuned them on a set consisting of real images from a Nao robot. Furthermore, we investigate and evaluate several practical methods for increasing the efficiency and performance of our networks. Finally, we present RoboDNN, a C++ neural network library designed for fast inference on the Nao robots.

12 citations

Journal ArticleDOI
TL;DR: The idea of generating increasingly more diverse indistinguishable samples during training is proposed to improve the detection accuracy of the Libra R-CNN, which is verified by experiments.
Abstract: With the development of science and technology, artificial intelligence has been widely used in the transportation field, and research on the symmetry of artificial intelligence has become increasingly more in-depth Traffic sign detection based on deep learning has the problems of different target shapes and high variability in the number of targets between different labels To solve these problems from a lack of symmetry, the idea of applying the concept of balanced data and the deformable positioning region to a target recognition network is proposed The research is based on the improvement of the Libra R-CNN Aiming at the problem that the difficult-to-distinguish target in target detection has a high impact on detection, the idea of generating increasingly more diverse indistinguishable samples during training is proposed to improve the detection accuracy, which is verified by experiments The experiment is carried out on the MS COCO 2017 and traffic sign datasets The improved Libra R-CNN is 3 percentage points better than the unimproved Libra R-CNN's mean Average Precision (mAP) A large number of comparative experimental results show that the improved network is effective

12 citations

Journal ArticleDOI
TL;DR: The first attempt to deal with multiple organs simultaneously and develop an end-to-end hybrid network with context aggregation (named TCMINet) to achieve face parsing for Traditional Chinese Medicine Inspection (TCMI) is made.
Abstract: Facial medical analysis, including the inspection of the face and inner facial components, has always been a primary part of the diagnostic method in Traditional Chinese Medicine (TCM). The existing literature merely focus on detecting or segmenting single face organs such as tongue, eyes, or lips. In this paper, we make the first attempt to deal with multiple organs simultaneously and develop an end-to-end hybrid network with context aggregation (named TCMINet) to achieve face parsing for Traditional Chinese Medicine Inspection (TCMI). Additionally, we construct a new dataset named TCMID to overcome the lackness of accurate annotated data. In order to verify the generalization ability of TCMINet, we manually relabel images in two popular face parsing datasets referred to as LFW-PL * and HELEN * for test. The extensive ablation evaluations and experimental comparisons demonstrate that the proposed TCMINet outperforms state-of-the-art methods under various evaluation metrics. It runs at 267ms per face (512×512 image) on Nvidia Titan Xp GPU, being possible to be integrated into engineering solutions.

12 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations