scispace - formally typeset
Search or ask a question
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

04 Sep 2014-
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
Citations
More filters
Journal ArticleDOI
Lei Bi1, Jinman Kim1, Euijoon Ahn1, Ashnil Kumar1, Michael J. Fulham1, Dagan Feng1 
TL;DR: This work proposes to leverage fully convolutional networks (FCNs) to automatically segment the skin lesions and introduces a new parallel integration method to combine the complementary information derived from individual segmentation stages to achieve a final segmentation result that has accurate localization and well-defined lesion boundaries, even for the most challenging skin lesions.
Abstract: Objective: Segmentation of skin lesions is an important step in the automated computer aided diagnosis of melanoma. However, existing segmentation methods have a tendency to over- or under-segment the lesions and perform poorly when the lesions have fuzzy boundaries, low contrast with the background, inhomogeneous textures, or contain artifacts. Furthermore, the performance of these methods are heavily reliant on the appropriate tuning of a large number of parameters as well as the use of effective preprocessing techniques, such as illumination correction and hair removal. Methods: We propose to leverage fully convolutional networks (FCNs) to automatically segment the skin lesions. FCNs are a neural network architecture that achieves object detection by hierarchically combining low-level appearance information with high-level semantic information. We address the issue of FCN producing coarse segmentation boundaries for challenging skin lesions (e.g., those with fuzzy boundaries and/or low difference in the textures between the foreground and the background) through a multistage segmentation approach in which multiple FCNs learn complementary visual characteristics of different skin lesions; early stage FCNs learn coarse appearance and localization information while late-stage FCNs learn the subtle characteristics of the lesion boundaries. We also introduce a new parallel integration method to combine the complementary information derived from individual segmentation stages to achieve a final segmentation result that has accurate localization and well-defined lesion boundaries, even for the most challenging skin lesions. Results: We achieved an average Dice coefficient of 91.18% on the ISBI 2016 Skin Lesion Challenge dataset and 90.66% on the PH2 dataset. Conclusion and Significance: Our extensive experimental results on two well-established public benchmark datasets demonstrate that our method is more effective than other state-of-the-art methods for skin lesion segmentation.

264 citations

Journal ArticleDOI
Manhua Liu1, Fan Li1, Hao Yan1, Kundong Wang1, Yixin Ma1, Li Shen, Mingqing Xu1 
TL;DR: A multi-model deep learning framework based on convolutional neural network for joint automatic hippocampal segmentation and AD classification using structural MRI data is proposed and outperforms the single-model methods and several other competing methods.

263 citations


Cites methods from "Very Deep Convolutional Networks fo..."

  • ...We could see that our DenseNet outperformed LeNet and VGGNet....

    [...]

  • ...As for the classification, Table 10 shows that our proposed DenseNet model performs better than the other popular deep networks such as the LeNet and VGGNet....

    [...]

  • ...The first experiment was to test the classification performances of DenseNet when compared to other deep networks such as the popular LeNet (L ecun et al., 1998) and VGGNet (Simonyan and Zisserman, 2014)....

    [...]

  • ...Recently, deep learning networks, including convolutional neural networks (CNNs), have been widely used in image classification and computer vision (Chen et al., 2016; Liu et al., 2014, 2017, 2018b; Ng et al., 2015; Simonyan and Zisserman, 2014; Wang et al., 2017)....

    [...]

  • ...The LeNet network consists of 2 convolutional layers followed by 2 fully connected layers, while VGGNet consists of 13 convolutional layers and 3 fully connected layers....

    [...]

Proceedings ArticleDOI
01 Mar 2020
TL;DR: Wang et al. as mentioned in this paper introduced a large-scale Word-Level American Sign Language (WLASL) video dataset, containing more than 2000 words performed by over 100 signers.
Abstract: Vision-based sign language recognition aims at helping the deaf people to communicate with others. However, most existing sign language datasets are limited to a small number of words. Due to the limited vocabulary size, models learned from those datasets cannot be applied in practice. In this paper, we introduce a new large-scale Word-Level American Sign Language (WLASL) video dataset, containing more than 2000 words performed by over 100 signers. This dataset will be made publicly available to the research community. To our knowledge,it is by far the largest public ASL dataset to facilitate word-level sign recognition research.Based on this new large-scale dataset, we are able to experiment with several deep learning methods for word-level sign recognition and evaluate their performances in large scale scenarios. Specifically we implement and compare two different models,i.e., (i) holistic visual appearance based approach, and (ii) 2D human pose based approach. Both models are valuable baselines that will benefit the community for method benchmarking. Moreover, we also propose a novel pose-based temporal graph convolution networks (Pose-TGCN) that model spatial and temporal dependencies in human pose trajectories simultaneously, which has further boosted the performance of the pose-based method. Our results show that pose-based and appearance-based models achieve comparable performances up to 62.63% at top-10 accuracy on 2,000 words/glosses, demonstrating the validity and challenges of our dataset. Our dataset and baseline deep models are available at https://dxli94.github.io/WLASL/.

263 citations

Journal ArticleDOI
TL;DR: A novel approach to perform segmentation by leveraging the abstraction capabilities of convolutional neural networks (CNNs) based on Hough voting, which is robust, multi-region, flexible and can be easily adapted to different modalities is proposed.

263 citations


Cites background from "Very Deep Convolutional Networks fo..."

  • ...Although several networks architectures were analysed in [27], [28], we have found only one study on “very deep CNN” [40], in which the number of convolutional layers was varied systematically (8-16) while keeping kernel sizes fixed....

    [...]

Posted Content
TL;DR: A novel Minimax Entropy (MME) approach that adversarially optimizes an adaptive few-shot model for semi-supervised domain adaptation (SSDA) setting, setting a new state of the art for SSDA.
Abstract: Contemporary domain adaptation methods are very effective at aligning feature distributions of source and target domains without any target supervision. However, we show that these techniques perform poorly when even a few labeled examples are available in the target. To address this semi-supervised domain adaptation (SSDA) setting, we propose a novel Minimax Entropy (MME) approach that adversarially optimizes an adaptive few-shot model. Our base model consists of a feature encoding network, followed by a classification layer that computes the features' similarity to estimated prototypes (representatives of each class). Adaptation is achieved by alternately maximizing the conditional entropy of unlabeled target data with respect to the classifier and minimizing it with respect to the feature encoder. We empirically demonstrate the superiority of our method over many baselines, including conventional feature alignment and few-shot methods, setting a new state of the art for SSDA.

263 citations


Cites methods from "Very Deep Convolutional Networks fo..."

  • ...We employ AlexNet [16] and VGG16 [34] pre-trained on ImageNet....

    [...]

References
More filters
Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations

Proceedings ArticleDOI
23 Jun 2014
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

21,729 citations

Posted Content
TL;DR: It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation.
Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.

9,803 citations

Journal ArticleDOI
TL;DR: This paper demonstrates how constraints from the task domain can be integrated into a backpropagation network through the architecture of the network, successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service.
Abstract: The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.

9,775 citations

Journal ArticleDOI
TL;DR: A review of the Pascal Visual Object Classes challenge from 2008-2012 and an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Abstract: The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008---2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community's progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

6,061 citations