scispace - formally typeset
Search or ask a question
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

01 Jan 2015-
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
Citations
More filters
Journal ArticleDOI
23 Jan 2018
TL;DR: This comprehensive review summarizes state of the art, challenges, and prospects of the neuro-inspired computing with emerging nonvolatile memory devices and presents a device-circuit-algorithm codesign methodology to evaluate the impact of nonideal device effects on the system-level performance.
Abstract: This comprehensive review summarizes state of the art, challenges, and prospects of the neuro-inspired computing with emerging nonvolatile memory devices. First, we discuss the demand for developing neuro-inspired architecture beyond today’s von-Neumann architecture. Second, we summarize the various approaches to designing the neuromorphic hardware (digital versus analog, spiking versus nonspiking, online training versus offline training) and discuss why emerging nonvolatile memory is attractive for implementing the synapses in the neural network. Then, we discuss the desired device characteristics of the synaptic devices (e.g., multilevel states, weight update nonlinearity/asymmetry, variation/noise), and survey a few representative material systems and device prototypes reported in the literature that show the analog conductance tuning. These candidates include phase change memory, resistive memory, ferroelectric memory, floating-gate transistors, etc. Next, we introduce the crossbar array architecture to accelerate the weighted sum and weight update operations that are commonly used in the neuro-inspired machine learning algorithms, and review the recent progresses of array-level experimental demonstrations for pattern recognition tasks. In addition, we discuss the peripheral neuron circuit design issues and present a device-circuit-algorithm codesign methodology to evaluate the impact of nonideal device effects on the system-level performance (e.g., learning accuracy). Finally, we give an outlook on the customization of the learning algorithms for efficient hardware implementation.

730 citations

Journal ArticleDOI
03 Aug 2016-Sensors
TL;DR: A novel multi-modal Faster R-CNN model, which achieves state-of-the-art results compared to prior work with the F1 score, which takes into account both precision and recall performances improving from 0.807 to 0.838 for the detection of sweet pepper.
Abstract: This paper presents a novel approach to fruit detection using deep convolutional neural networks. The aim is to build an accurate, fast and reliable fruit detection system, which is a vital element of an autonomous agricultural robotic platform; it is a key element for fruit yield estimation and automated harvesting. Recent work in deep neural networks has led to the development of a state-of-the-art object detector termed Faster Region-based CNN (Faster R-CNN). We adapt this model, through transfer learning, for the task of fruit detection using imagery obtained from two modalities: colour (RGB) and Near-Infrared (NIR). Early and late fusion methods are explored for combining the multi-modal (RGB and NIR) information. This leads to a novel multi-modal Faster R-CNN model, which achieves state-of-the-art results compared to prior work with the F1 score, which takes into account both precision and recall performances improving from 0 . 807 to 0 . 838 for the detection of sweet pepper. In addition to improved accuracy, this approach is also much quicker to deploy for new fruits, as it requires bounding box annotation rather than pixel-level annotation (annotating bounding boxes is approximately an order of magnitude quicker to perform). The model is retrained to perform the detection of seven fruits, with the entire process taking four hours to annotate and train the new model per fruit.

729 citations

Book ChapterDOI
08 Oct 2016
TL;DR: This work proposes a novel approach for instance-level image retrieval that produces a global and compact fixed-length representation for each image by aggregating many region-wise descriptors by leveraging a ranking framework and projection weights to build the region features.
Abstract: We propose a novel approach for instance-level image retrieval. It produces a global and compact fixed-length representation for each image by aggregating many region-wise descriptors. In contrast to previous works employing pre-trained deep networks as a black box to produce features, our method leverages a deep architecture trained for the specific task of image retrieval. Our contribution is twofold: (i) we leverage a ranking framework to learn convolution and projection weights that are used to build the region features; and (ii) we employ a region proposal network to learn which regions should be pooled to form the final global descriptor. We show that using clean training data is key to the success of our approach. To that aim, we use a large scale but noisy landmark dataset and develop an automatic cleaning approach. The proposed architecture produces a global image representation in a single forward pass. Our approach significantly outperforms previous approaches based on global descriptors on standard datasets. It even surpasses most prior works based on costly local descriptor indexing and spatial verification. Additional material is available at www.xrce.xerox.com/Deep-Image-Retrieval.

727 citations

Journal ArticleDOI
TL;DR: This paper shows conversion of popular CNN architectures, including VGG-16 and Inception-v3, into SNNs that produce the best results reported to date on MNIST, CIFAR-10 and the challenging ImageNet dataset.
Abstract: Spiking neural networks (SNNs) can potentially offer an efficient way of doing inference because the neurons in the networks are sparsely activated and computations are event-driven. Previous work showed that simple continuous-valued deep Convolutional Neural Networks (CNNs) can be converted into accurate spiking equivalents. These networks did not include certain common operations such as max-pooling, softmax, batch-normalization and Inception-modules. This paper presents spiking equivalents of these operations therefore allowing conversion of nearly arbitrary CNN architectures. We show conversion of popular CNN architectures, including VGG-16 and Inception-v3, into SNNs that produce the best results reported to date on MNIST, CIFAR-10 and the challenging ImageNet dataset. SNNs can trade off classification error rate against the number of available operations whereas deep continuous-valued neural networks require a fixed number of operations to achieve their classification error rate. From the examples of LeNet for MNIST and BinaryNet for CIFAR-10, we show that with an increase in error rate of a few percentage points, the SNNs can achieve more than 2x reductions in operations compared to the original CNNs. This highlights the potential of SNNs in particular when deployed on power-efficient neuromorphic spiking neuron chips, for use in embedded applications.

725 citations

Journal ArticleDOI
TL;DR: A deep-learning enhanced algorithm for the automated detection of DR, achieves significantly better performance than a previously reported, otherwise essentially identical, algorithm that does not employ deep learning.
Abstract: Purpose To compare performance of a deep-learning enhanced algorithm for automated detection of diabetic retinopathy (DR), to the previously published performance of that algorithm, the Iowa Detection Program (IDP)-without deep learning components-on the same publicly available set of fundus images and previously reported consensus reference standard set, by three US Board certified retinal specialists. Methods We used the previously reported consensus reference standard of referable DR (rDR), defined as International Clinical Classification of Diabetic Retinopathy moderate, severe nonproliferative (NPDR), proliferative DR, and/or macular edema (ME). Neither Messidor-2 images, nor the three retinal specialists setting the Messidor-2 reference standard were used for training IDx-DR version X2.1. Sensitivity, specificity, negative predictive value, area under the curve (AUC), and their confidence intervals (CIs) were calculated. Results Sensitivity was 96.8% (95% CI: 93.3%-98.8%), specificity was 87.0% (95% CI: 84.2%-89.4%), with 6/874 false negatives, resulting in a negative predictive value of 99.0% (95% CI: 97.8%-99.6%). No cases of severe NPDR, PDR, or ME were missed. The AUC was 0.980 (95% CI: 0.968-0.992). Sensitivity was not statistically different from published IDP sensitivity, which had a CI of 94.4% to 99.3%, but specificity was significantly better than the published IDP specificity CI of 55.7% to 63.0%. Conclusions A deep-learning enhanced algorithm for the automated detection of DR, achieves significantly better performance than a previously reported, otherwise essentially identical, algorithm that does not employ deep learning. Deep learning enhanced algorithms have the potential to improve the efficiency of DR screening, and thereby to prevent visual loss and blindness from this devastating disease.

725 citations

References
More filters
Book ChapterDOI

[...]

01 Jan 2012

139,059 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations

Journal ArticleDOI

40,330 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations