scispace - formally typeset
Search or ask a question
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

01 Jan 2015-
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
Citations
More filters
Journal ArticleDOI
TL;DR: This paper provides a deep learning-based strategy for reconstruction of CS-MRI, and bridges a substantial gap between conventional non-learning methods working only on data from a single image, and prior knowledge from large training data sets.
Abstract: Compressed sensing magnetic resonance imaging (CS-MRI) enables fast acquisition, which is highly desirable for numerous clinical applications. This can not only reduce the scanning cost and ease patient burden, but also potentially reduce motion artefacts and the effect of contrast washout, thus yielding better image quality. Different from parallel imaging-based fast MRI, which utilizes multiple coils to simultaneously receive MR signals, CS-MRI breaks the Nyquist–Shannon sampling barrier to reconstruct MRI images with much less required raw data. This paper provides a deep learning-based strategy for reconstruction of CS-MRI, and bridges a substantial gap between conventional non-learning methods working only on data from a single image, and prior knowledge from large training data sets. In particular, a novel conditional Generative Adversarial Networks-based model (DAGAN)-based model is proposed to reconstruct CS-MRI. In our DAGAN architecture, we have designed a refinement learning method to stabilize our U-Net based generator, which provides an end-to-end network to reduce aliasing artefacts. To better preserve texture and edges in the reconstruction, we have coupled the adversarial loss with an innovative content loss. In addition, we incorporate frequency-domain information to enforce similarity in both the image and frequency domains. We have performed comprehensive comparison studies with both conventional CS-MRI reconstruction methods and newly investigated deep learning approaches. Compared with these methods, our DAGAN method provides superior reconstruction with preserved perceptual image details. Furthermore, each image is reconstructed in about 5 ms, which is suitable for real-time processing.

835 citations

Journal ArticleDOI
04 Sep 2017-Sensors
TL;DR: A deep-learning-based approach to detect diseases and pests in tomato plants using images captured in-place by camera devices with various resolutions, and combines each of these meta-architectures with “deep feature extractors” such as VGG net and Residual Network.
Abstract: Plant Diseases and Pests are a major challenge in the agriculture sector. An accurate and a faster detection of diseases and pests in plants could help to develop an early treatment technique while substantially reducing economic losses. Recent developments in Deep Neural Networks have allowed researchers to drastically improve the accuracy of object detection and recognition systems. In this paper, we present a deep-learning-based approach to detect diseases and pests in tomato plants using images captured in-place by camera devices with various resolutions. Our goal is to find the more suitable deep-learning architecture for our task. Therefore, we consider three main families of detectors: Faster Region-based Convolutional Neural Network (Faster R-CNN), Region-based Fully Convolutional Network (R-FCN), and Single Shot Multibox Detector (SSD), which for the purpose of this work are called "deep learning meta-architectures". We combine each of these meta-architectures with "deep feature extractors" such as VGG net and Residual Network (ResNet). We demonstrate the performance of deep meta-architectures and feature extractors, and additionally propose a method for local and global class annotation and data augmentation to increase the accuracy and reduce the number of false positives during training. We train and test our systems end-to-end on our large Tomato Diseases and Pests Dataset, which contains challenging images with diseases and pests, including several inter- and extra-class variations, such as infection status and location in the plant. Experimental results show that our proposed system can effectively recognize nine different types of diseases and pests, with the ability to deal with complex scenarios from a plant's surrounding area.

832 citations

Posted Content
TL;DR: Different novel methods based on deep learning for brain abnormality detection, recognition, and segmentation for analyzing medical images using deep learning algorithm are explored.
Abstract: This report describes my research activities in the Hasso Plattner Institute and summarizes my Ph.D. plan and several novels, end-to-end trainable approaches for analyzing medical images using deep learning algorithm. In this report, as an example, we explore different novel methods based on deep learning for brain abnormality detection, recognition, and segmentation. This report prepared for the doctoral consortium in the AIME-2017 conference.

830 citations

Posted Content
TL;DR: In this article, a fully convolutional architecture, encompassing residual learning, is proposed to model the ambiguous mapping between monocular images and depth maps, which can be trained end-to-end and does not rely on post-processing techniques such as CRFs or other additional refinement steps.
Abstract: This paper addresses the problem of estimating the depth map of a scene given a single RGB image. We propose a fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps. In order to improve the output resolution, we present a novel way to efficiently learn feature map up-sampling within the network. For optimization, we introduce the reverse Huber loss that is particularly suited for the task at hand and driven by the value distributions commonly present in depth maps. Our model is composed of a single architecture that is trained end-to-end and does not rely on post-processing techniques, such as CRFs or other additional refinement steps. As a result, it runs in real-time on images or videos. In the evaluation, we show that the proposed model contains fewer parameters and requires fewer training data than the current state of the art, while outperforming all approaches on depth estimation. Code and models are publicly available.

827 citations

Proceedings ArticleDOI
Qifeng Chen1, Vladlen Koltun1
28 Jul 2017
TL;DR: In this article, a single feed-forward network with appropriate structure is trained end-to-end with a direct regression objective to synthesize photographic images conditioned on semantic layouts, which can produce images with photographic appearance that conforms to the input layout.
Abstract: We present an approach to synthesizing photographic images conditioned on semantic layouts. Given a semantic label map, our approach produces an image with photographic appearance that conforms to the input layout. The approach thus functions as a rendering engine that takes a two-dimensional semantic specification of the scene and produces a corresponding photographic image. Unlike recent and contemporaneous work, our approach does not rely on adversarial training. We show that photographic images can be synthesized from semantic layouts by a single feedforward network with appropriate structure, trained end-to-end with a direct regression objective. The presented approach scales seamlessly to high resolutions; we demonstrate this by synthesizing photographic images at 2-megapixel resolution, the full resolution of our training data. Extensive perceptual experiments on datasets of outdoor and indoor scenes demonstrate that images synthesized by the presented approach are considerably more realistic than alternative approaches.

825 citations

References
More filters
Book ChapterDOI

[...]

01 Jan 2012

139,059 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations

Journal ArticleDOI

40,330 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations