scispace - formally typeset
Search or ask a question
Proceedings Article

Faster R-CNN: towards real-time object detection with region proposal networks

07 Dec 2015-Vol. 28, pp 91-99
TL;DR: Ren et al. as discussed by the authors proposed a region proposal network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.
Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model [19], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. Code is available at https://github.com/ShaoqingRen/faster_rcnn.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
Tooba Shams, Pascal Desbarats1
09 Nov 2020
TL;DR: This work compares two models of state-of-the-art neural networks (YOLO and Mask-RCNN) for detecting nests of Asian hornets from images acquired by a drone, based on the advantages of visible spectrum and FLIR images.
Abstract: Asian hornets are considered a pest because of their dangerousness and their impact on the ecosystem. Detecting nests of this species is a difficult task, as they are found in the trees, hidden in the leaves. Our goal is to carry out this detection from images acquired by a drone. We propose in this work a new method, based on the advantages of visible spectrum and FLIR images. We compare two models of state-of-the-art neural networks (YOLO and Mask-RCNN) for this task. The results are presented from the two separate image sets, then by combining the network responses. To do this, a third dataset (for ensemble model) was built by simulating a FLIR acquisition simultaneous with the acquisition in the visible spectrum. Preliminary results show that the best strategy is to use Mask-RCNN on the ensemble model (detection rate of 93%). A discussion on the relevant information present in the images and on taking into account of this information by the networks is also proposed.

2 citations

Book ChapterDOI
07 Apr 2021
TL;DR: In this paper, the authors focused on extracting eye region features to detect eye state using the light-weight convolutional neural networks with two stages: eye detection and classification, which can apply on simple drowsiness warning system and perform well on Intel Core I7-4770 CPU @ 3.40 GHz and on quad-core ARM Cortex-A57 CPU (Jetson Nano device) with 19.04 FPS and 17.20 FPS, respectively.
Abstract: The eye are a very important organ in the human body. The eye area and eyes contain lots of useful information about human interaction with the environment. Many studies have relied on eye region analyzes to build the medical care, surveillance, interaction, security, and warning systems. This paper focuses on extracting eye region features to detect eye state using the light-weight convolutional neural networks with two stages: eye detection and classification. This method can apply on simple drowsiness warning system and perform well on Intel Core I7-4770 CPU @ 3.40 GHz (Personal Computer - PC) and on quad-core ARM Cortex-A57 CPU (Jetson Nano device) with 19.04 FPS and 17.20 FPS (frames per second), respectively.

2 citations

Proceedings ArticleDOI
01 Oct 2017
TL;DR: A visual attention inspired computational model is proposed to address the issue of panoramic photos recognition, which mimics human perceptual and cognitive mechanisms by a focus model and a scale model and its effectiveness in measuring scenery quality is verified.
Abstract: Travel photos record tourists' experiences and attentions when visiting a place. We question if they embed any untapped indices, subconsciously created by the tourists, for measuring the scenery quality? By analyzing thousands of such photos and inspired by the psychological theory of "broaden-and-build", our study reveals a strong inclination of taking panoramic photos at high rating outdoor tourist spots. Thus, this preference can be a supplementary measure of indexing the scenery quality. However, the task of recognizing panoramic photos is nontrivial. In this paper, we propose a visual attention inspired computational model to address this issue, which mimics human perceptual and cognitive mechanisms by a focus model and a scale model. The experiments on a newly created dataset demonstrate a remarkable performance of our proposal, along with its effectiveness in measuring scenery quality also verified by 10 high rating outdoor spots and 2 lower rating ones from across the world.

2 citations

Proceedings ArticleDOI
01 Jan 2019
TL;DR: A deep neural network called Attentive Bilinear Convolutional Neural Networks (AB-CNN) is proposed that learns appropriate representation for metadata verification, aiming to verify the authenticity of the metadata associated with the image, using a deep representation learning approach.
Abstract: Verifying the authenticity of a given image is an emerging topic in media forensics research. Many current works focus on content manipulation detection, which aims to detect possible alteration in the image content. However, tampering might not only occur in the image content itself, but also in the metadata associated with the image, such as timestamp, geo-tag, and captions. We address metadata verification, aiming to verify the authenticity of the metadata associated with the image, using a deep representation learning approach. We propose a deep neural network called Attentive Bilinear Convolutional Neural Networks (AB-CNN) that learns appropriate representation for metadata verification. AB-CNN address several common challenges in verifying a specific type of metadata – event (i.e. time and places), including lack of training data, finegrained differences between distinct events, and diverse visual content within the same event. Experimental results on three different datasets show that the proposed model can provide a substantial improvement over the baseline method.

2 citations

Posted Content
Abstract: Objective: To assess the ability of imaging-based deep learning to predict radiographic patellofemoral osteoarthritis (PFOA) from knee lateral view radiographs. Design: Knee lateral view radiographs were extracted from The Multicenter Osteoarthritis Study (MOST) (n = 18,436 knees). Patellar region-of-interest (ROI) was first automatically detected, and subsequently, end-to-end deep convolutional neural networks (CNNs) were trained and validated to detect the status of patellofemoral OA. Patellar ROI was detected using deep-learning-based object detection method. Manual PFOA status assessment provided in the MOST dataset was used as a classification outcome for the CNNs. Performance of prediction models was assessed using the area under the receiver operating characteristic curve (ROC AUC) and the average precision (AP) obtained from the precision-recall (PR) curve in the stratified 5-fold cross validation setting. Results: Of the 18,436 knees, 3,425 (19%) had PFOA. AUC and AP for the reference model including age, sex, body mass index (BMI), the total Western Ontario and McMaster Universities Arthritis Index (WOMAC) score, and tibiofemoral Kellgren-Lawrence (KL) grade to predict PFOA were 0.806 and 0.478, respectively. The CNN model that used only image data significantly improved the prediction of PFOA status (ROC AUC= 0.958, AP= 0.862). Conclusion: We present the first machine learning based automatic PFOA detection method. Furthermore, our deep learning based model trained on patella region from knee lateral view radiographs performs better at predicting PFOA than models based on patient characteristics and clinical assessments.

2 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

28,225 citations

Proceedings ArticleDOI
23 Jun 2014
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

21,729 citations