Showing papers on "Convolutional neural network published in 2015"

PDF

Open Access

Proceedings Article•DOI•

[...]

Christian Szegedy¹, Wei Liu², Yangqing Jia¹, Pierre Sermanet¹, Scott Reed³, Dragomir Anguelov¹, Dumitru Erhan¹, Vincent Vanhoucke¹, Andrew Rabinovich - Show less +5 more•Institutions (3)

Google¹, University of North Carolina at Chapel Hill², University of Michigan³

07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

...read moreread less

40,257 citations

Proceedings Article•

Spatial transformer networks

[...]

Max Jaderberg¹, Karen Simonyan¹, Andrew Zisserman¹, Koray Kavukcuoglu¹•Institutions (1)

Google¹

07 Dec 2015

TL;DR: This work introduces a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network, and can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps.

...read moreread less

Abstract: Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. We show that the use of spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping, resulting in state-of-the-art performance on several benchmarks, and for a number of classes of transformations.

...read moreread less

6,150 citations

Journal Article•DOI•

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

[...]

Kaiming He¹, Xiangyu Zhang², Shaoqing Ren³, Jian Sun¹•Institutions (3)

Microsoft¹, Xi'an Jiaotong University², University of Science and Technology of China³

01 Sep 2015-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work equips the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement, and develops a new network structure, called SPP-net, which can generate a fixed-length representation regardless of image size/scale.

...read moreread less

Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224 $\times$ 224) input image. This requirement is “artificial” and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102 $\times$ faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

...read moreread less

5,919 citations

Proceedings Article•DOI•

Deep face recognition

[...]

Omkar M. Parkhi¹, Andrea Vedaldi¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: It is shown how a very large scale dataset can be assembled by a combination of automation and human in the loop, and the trade off between data purity and time is discussed.

...read moreread less

Abstract: The goal of this paper is face recognition – from either a single photograph or from a set of faces tracked in a video. Recent progress in this area has been due to two factors: (i) end to end learning for the task using a convolutional neural network (CNN), and (ii) the availability of very large scale training datasets. We make two contributions: first, we show how a very large scale dataset (2.6M images, over 2.6K people) can be assembled by a combination of automation and human in the loop, and discuss the trade off between data purity and time; second, we traverse through the complexities of deep network training and face recognition to present methods and procedures to achieve comparable state of the art results on the standard LFW and YTF face benchmarks.

...read moreread less

5,308 citations

Posted Content•

Learning Deep Features for Discriminative Localization

[...]

Bolei Zhou¹, Aditya Khosla¹, Agata Lapedriza¹, Aude Oliva¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

14 Dec 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors revisited the global average pooling layer and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels.

...read moreread less

Abstract: In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them

...read moreread less

5,065 citations

Proceedings Article•DOI•

Long-term recurrent convolutional networks for visual recognition and description

[...]

Jeff Donahue¹, Lisa Anne Hendricks¹, Sergio Guadarrama¹, Marcus Rohrbach¹, Subhashini Venugopalan², Trevor Darrell¹, Kate Saenko³ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, University of Texas at Austin², University of Massachusetts Lowell³

07 Jun 2015

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

Abstract: Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or “temporally deep”, are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they can be compositional in spatial and temporal “layers”. Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

4,206 citations

Proceedings Article•DOI•

Deep visual-semantic alignments for generating image descriptions

[...]

Andrej Karpathy¹, Li Fei-Fei¹•Institutions (1)

Stanford University¹

07 Jun 2015

TL;DR: A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

...read moreread less

Abstract: We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

...read moreread less

3,996 citations

Proceedings Article•DOI•

FlowNet: Learning Optical Flow with Convolutional Networks

[...]

Alexey Dosovitskiy¹, Philipp Fischery, Eddy Ilg¹, Philip Häusser², Caner Hazirbas², Vladimir Golkov², Patrick van der Smagt², Daniel Cremers², Thomas Brox¹ - Show less +5 more•Institutions (2)

University of Freiburg¹, Technische Universität München²

07 Dec 2015

TL;DR: In this paper, the authors propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations, and show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI.

...read moreread less

Abstract: Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks CNNs succeeded at. In this paper we construct CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations. Since existing ground truth data sets are not sufficiently large to train a CNN, we generate a large synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps.

...read moreread less

3,833 citations

Proceedings Article•

Striving for Simplicity: The All Convolutional Net

[...]

Jost Tobias Springenberg¹, Alexey Dosovitskiy¹, Thomas Brox¹, Martin Riedmiller¹•Institutions (1)

University of Freiburg¹

01 Jan 2015

TL;DR: It is found that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks.

...read moreread less

Abstract: Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.

...read moreread less

3,601 citations

Posted Content•

Learning Transferable Features with Deep Adaptation Networks

[...]

Mingsheng Long¹, Mingsheng Long², Yue Cao¹, Jianmin Wang¹, Michael I. Jordan² - Show less +1 more•Institutions (2)

Tsinghua University¹, University of California, Berkeley²

10 Feb 2015-arXiv: Learning

TL;DR: A new Deep Adaptation Network (DAN) architecture is proposed, which generalizes deep convolutional neural network to the domain adaptation scenario and can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding.

...read moreread less

Abstract: Recent studies reveal that a deep neural network can learn transferable features which generalize well to novel tasks for domain adaptation. However, as deep features eventually transition from general to specific along the network, the feature transferability drops significantly in higher layers with increasing domain discrepancy. Hence, it is important to formally reduce the dataset bias and enhance the transferability in task-specific layers. In this paper, we propose a new Deep Adaptation Network (DAN) architecture, which generalizes deep convolutional neural network to the domain adaptation scenario. In DAN, hidden representations of all task-specific layers are embedded in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched. The domain discrepancy is further reduced using an optimal multi-kernel selection method for mean embedding matching. DAN can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding. Extensive empirical evidence shows that the proposed architecture yields state-of-the-art image classification error rates on standard domain adaptation benchmarks.

...read moreread less

3,351 citations

Proceedings Article•DOI•

VoxNet: A 3D Convolutional Neural Network for real-time object recognition

[...]

Daniel Maturana¹, Sebastian Scherer¹•Institutions (1)

Carnegie Mellon University¹

01 Sep 2015

TL;DR: VoxNet is proposed, an architecture to tackle the problem of robust object recognition by integrating a volumetric Occupancy Grid representation with a supervised 3D Convolutional Neural Network (3D CNN).

...read moreread less

Abstract: Robust object recognition is a crucial skill for robots operating autonomously in real world environments. Range sensors such as LiDAR and RGBD cameras are increasingly found in modern robotic systems, providing a rich source of 3D information that can aid in this task. However, many current systems do not fully utilize this information and have trouble efficiently dealing with large amounts of point cloud data. In this paper, we propose VoxNet, an architecture to tackle this problem by integrating a volumetric Occupancy Grid representation with a supervised 3D Convolutional Neural Network (3D CNN). We evaluate our approach on publicly available benchmarks using LiDAR, RGBD, and CAD data. VoxNet achieves accuracy beyond the state of the art while labeling hundreds of instances per second.

...read moreread less

Proceedings Article•DOI•

Deep neural networks are easily fooled: High confidence predictions for unrecognizable images

[...]

Anh Nguyen¹, Jason Yosinski², Jeff Clune¹•Institutions (2)

University of Wyoming¹, Cornell University²

07 Jun 2015

TL;DR: In this article, the authors show that it is possible to produce images that are completely unrecognizable to humans, but that state-of-the-art DNNs believe to be recognizable objects with 99.99% confidence.

...read moreread less

Abstract: Deep neural networks (DNNs) have recently been achieving state-of-the-art performance on a variety of pattern-recognition tasks, most notably visual classification problems. Given that DNNs are now able to classify objects in images with near-human-level performance, questions naturally arise as to what differences remain between computer and human vision. A recent study [30] revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library). Here we show a related result: it is easy to produce images that are completely unrecognizable to humans, but that state-of-the-art DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion). Specifically, we take convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and then find images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class. It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects, which we call “fooling images” (more generally, fooling examples). Our results shed light on interesting differences between human vision and current DNNs, and raise questions about the generality of DNN computer vision.

...read moreread less

Proceedings Article•DOI•

MatConvNet: Convolutional Neural Networks for MATLAB

[...]

Andrea Vedaldi¹, Karel Lenc¹•Institutions (1)

University of Oxford¹

13 Oct 2015

TL;DR: MatConvNet exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing routines for computing convolutions with filter banks, feature pooling, normalisation, and much more.

...read moreread less

Abstract: MatConvNet is an open source implementation of Convolutional Neural Networks (CNNs) with a deep integration in the MATLAB environment. The toolbox is designed with an emphasis on simplicity and flexibility. It exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing routines for computing convolutions with filter banks, feature pooling, normalisation, and much more. MatConvNet can be easily extended, often using only MATLAB code, allowing fast prototyping of new CNN architectures. At the same time, it supports efficient computation on CPU and GPU, allowing to train complex models on large datasets such as ImageNet ILSVRC containing millions of training examples

...read moreread less

Proceedings Article•

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

[...]

Liang-Chieh Chen¹, George Papandreou², Iasonas Kokkinos³, Kevin Murphy², Alan L. Yuille¹ - Show less +1 more•Institutions (3)

University of California, Los Angeles¹, Google², CentraleSupélec³

07 May 2015

TL;DR: DeepLab as mentioned in this paper combines the responses at the final layer with a fully connected CRF to localize segment boundaries at a level of accuracy beyond previous methods, achieving 71.6% IOU accuracy in the test set.

...read moreread less

Abstract: Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segmentation"). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our "DeepLab" system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the 'hole' algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.

...read moreread less

Posted Content•

Empirical Evaluation of Rectified Activations in Convolutional Network.

[...]

Bing Xu, Naiyan Wang, Tianqi Chen, Mu Li

05 May 2015-arXiv: Learning

TL;DR: The experiments suggest that incorporating a non-zero slope for negative part in rectified activation units could consistently improve the results, and are negative on the common belief that sparsity is the key of good performance in ReLU.

...read moreread less

Abstract: In this paper we investigate the performance of different types of rectified activation functions in convolutional neural network: standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), parametric rectified linear unit (PReLU) and a new randomized leaky rectified linear units (RReLU). We evaluate these activation function on standard image classification task. Our experiments suggest that incorporating a non-zero slope for negative part in rectified activation units could consistently improve the results. Thus our findings are negative on the common belief that sparsity is the key of good performance in ReLU. Moreover, on small scale dataset, using deterministic negative slope or learning it are both prone to overfitting. They are not as effective as using their randomized counterpart. By using RReLU, we achieved 75.68\% accuracy on CIFAR-100 test set without multiple test or ensemble.

...read moreread less

Proceedings Article•DOI•

Multi-view Convolutional Neural Networks for 3D Shape Recognition

[...]

Hang Su¹, Subhransu Maji¹, Evangelos Kalogerakis¹, Erik Learned-Miller¹•Institutions (1)

University of Massachusetts Amherst¹

07 Dec 2015

TL;DR: In this article, a CNN architecture is proposed to combine information from multiple views of a 3D shape into a single and compact shape descriptor, which can be applied to accurately recognize human hand-drawn sketches of shapes.

...read moreread less

Abstract: A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descriptors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images We first present a standard CNN architecture trained to recognize the shapes' rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors Recognition rates further increase when multiple views of the shapes are provided In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN architectures and their derivatives

...read moreread less

Proceedings Article•DOI•

Holistically-Nested Edge Detection

[...]

Saining Xie¹, Zhuowen Tu¹•Institutions (1)

University of California, San Diego¹

07 Dec 2015

TL;DR: HED turns pixel-wise edge classification into image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets to approach the human ability to resolve the challenging ambiguity in edge and object boundary detection.

...read moreread less

Abstract: We develop a new edge detection algorithm that addresses two critical issues in this long-standing vision problem: (1) holistic image training, and (2) multi-scale feature learning. Our proposed method, holistically-nested edge detection (HED), turns pixel-wise edge classification into image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets. HED automatically learns rich hierarchical representations (guided by deep supervision on side responses) that are crucially important in order to approach the human ability to resolve the challenging ambiguity in edge and object boundary detection. We significantly advance the state-of-the-art on the BSD500 dataset (ODS F-score of 0.782) and the NYU Depth dataset (ODS F-score of 0.746), and do so with an improved speed (0.4 second per image) that is orders of magnitude faster than recent CNN-based edge detection algorithms.

...read moreread less

Proceedings Article•DOI•

Beyond short snippets: Deep networks for video classification

[...]

Joe Yue-Hei Ng¹, Matthew Hausknecht², Sudheendra Vijayanarasimhan³, Oriol Vinyals³, Rajat Monga³, George Toderici³ - Show less +2 more•Institutions (3)

University of Maryland, College Park¹, University of Texas at Austin², Google³

07 Jun 2015

TL;DR: In this article, a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN was proposed to model the video as an ordered sequence of frames.

...read moreread less

Abstract: Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 73.0%).

...read moreread less

Proceedings Article•

Recurrent convolutional neural networks for text classification

[...]

Siwei Lai¹, Liheng Xu¹, Kang Liu¹, Jun Zhao¹•Institutions (1)

Chinese Academy of Sciences¹

25 Jan 2015

TL;DR: A recurrent convolutional neural network is introduced for text classification without human-designed features to capture contextual information as far as possible when learning word representations, which may introduce considerably less noise compared to traditional window-based neural networks.

...read moreread less

Abstract: Text classification is a foundational task in many NLP applications. Traditional text classifiers often rely on many human-designed features, such as dictionaries, knowledge bases and special tree kernels. In contrast to traditional methods, we introduce a recurrent convolutional neural network for text classification without human-designed features. In our model, we apply a recurrent structure to capture contextual information as far as possible when learning word representations, which may introduce considerably less noise compared to traditional window-based neural networks. We also employ a max-pooling layer that automatically judges which words play key roles in text classification to capture the key components in texts. We conduct experiments on four commonly used datasets. The experimental results show that the proposed method outperforms the state-of-the-art methods on several datasets, particularly on document-level datasets.

...read moreread less

Proceedings Article•DOI•

Conditional Random Fields as Recurrent Neural Networks

[...]

Shuai Zheng¹, Sadeep Jayasumana¹, Bernardino Romera-Paredes¹, Vibhav Vineet², Zhizhong Su, Dalong Du, Chang Huang³, Philip H. S. Torr¹ - Show less +4 more•Institutions (3)

University of Oxford¹, Stanford University², Baidu³

07 Dec 2015

TL;DR: In this article, a new form of convolutional neural network that combines the strengths of Convolutional Neural Networks (CNNs) and Conditional Random Fields (CRFs)-based probabilistic graphical modelling is introduced.

...read moreread less

Abstract: Pixel-level labelling tasks, such as semantic segmentation, play a central role in image understanding. Recent approaches have attempted to harness the capabilities of deep learning techniques for image recognition to tackle pixel-level labelling tasks. One central issue in this methodology is the limited capacity of deep learning techniques to delineate visual objects. To solve this problem, we introduce a new form of convolutional neural network that combines the strengths of Convolutional Neural Networks (CNNs) and Conditional Random Fields (CRFs)-based probabilistic graphical modelling. To this end, we formulate Conditional Random Fields with Gaussian pairwise potentials and mean-field approximate inference as Recurrent Neural Networks. This network, called CRF-RNN, is then plugged in as a part of a CNN to obtain a deep network that has desirable properties of both CNNs and CRFs. Importantly, our system fully integrates CRF modelling with CNNs, making it possible to train the whole deep network end-to-end with the usual back-propagation algorithm, avoiding offline post-processing methods for object delineation. We apply the proposed method to the problem of semantic image segmentation, obtaining top results on the challenging Pascal VOC 2012 segmentation benchmark.

...read moreread less

Proceedings Article•DOI•

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

[...]

Chen Zhang¹, Peng Li², Guangyu Sun¹, Yijin Guan¹, Bingjun Xiao², Jason Cong² - Show less +2 more•Institutions (2)

Peking University¹, University of California, Los Angeles²

22 Feb 2015

TL;DR: This work implements a CNN accelerator on a VC707 FPGA board and compares it to previous approaches, achieving a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.

...read moreread less

Abstract: Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.

...read moreread less

Proceedings Article•DOI•

Conditional Random Fields as Recurrent Neural Networks

[...]

Shuai Zheng¹, Sadeep Jayasumana¹, Bernardino Romera-Paredes¹, Vibhav Vineet², Zhizhong Su, Dalong Du, Chang Huang³, Philip H. S. Torr¹ - Show less +4 more•Institutions (3)

University of Oxford¹, Stanford University², Baidu³

11 Feb 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: A new form of convolutional neural network that combines the strengths of Convolutional Neural Networks (CNNs) and Conditional Random Fields (CRFs)-based probabilistic graphical modelling is introduced, and top results are obtained on the challenging Pascal VOC 2012 segmentation benchmark.

...read moreread less

Abstract: Pixel-level labelling tasks, such as semantic segmentation, play a central role in image understanding. Recent approaches have attempted to harness the capabilities of deep learning techniques for image recognition to tackle pixel-level labelling tasks. One central issue in this methodology is the limited capacity of deep learning techniques to delineate visual objects. To solve this problem, we introduce a new form of convolutional neural network that combines the strengths of Convolutional Neural Networks (CNNs) and Conditional Random Fields (CRFs)-based probabilistic graphical modelling. To this end, we formulate mean-field approximate inference for the Conditional Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks. This network, called CRF-RNN, is then plugged in as a part of a CNN to obtain a deep network that has desirable properties of both CNNs and CRFs. Importantly, our system fully integrates CRF modelling with CNNs, making it possible to train the whole deep network end-to-end with the usual back-propagation algorithm, avoiding offline post-processing methods for object delineation. We apply the proposed method to the problem of semantic image segmentation, obtaining top results on the challenging Pascal VOC 2012 segmentation benchmark.

...read moreread less

Proceedings Article•

Convolutional networks on graphs for learning molecular fingerprints

[...]

David Duvenaud¹, Dougal Maclaurin¹, Jorge Aguilera-Iparraguirre¹, Rafael Gómez-Bombarelli¹, Timothy D. Hirzel¹, Alán Aspuru-Guzik¹, Ryan P. Adams¹ - Show less +3 more•Institutions (1)

Harvard University¹

07 Dec 2015

TL;DR: In this paper, a convolutional neural network that operates directly on graphs is proposed to learn end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape.

...read moreread less

Abstract: We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.

...read moreread less

Posted Content•

Learning Multi-Domain Convolutional Neural Networks for Visual Tracking

[...]

Hyeonseob Nam¹, Bohyung Han¹•Institutions (1)

Pohang University of Science and Technology¹

27 Oct 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as discussed by the authors proposed a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN), which pretrain a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation.

...read moreread less

Abstract: We propose a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN). Our algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. Our network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify the target in each domain. We train the network with respect to each domain iteratively to obtain generic target representations in the shared layers. When tracking a target in a new sequence, we construct a new network by combining the shared layers in the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state. The proposed algorithm illustrates outstanding performance compared with state-of-the-art methods in existing tracking benchmarks.

...read moreread less

Proceedings Article•DOI•

Hierarchical Convolutional Features for Visual Tracking

[...]

Chao Ma¹, Jia-Bin Huang², Xiaokang Yang¹, Ming-Hsuan Yang³•Institutions (3)

Shanghai Jiao Tong University¹, University of Illinois at Urbana–Champaign², University of California, Merced³

07 Dec 2015

TL;DR: This paper adaptively learn correlation filters on each convolutional layer to encode the target appearance and hierarchically infer the maximum response of each layer to locate targets.

...read moreread less

Abstract: Visual object tracking is challenging as target objects often undergo significant appearance changes caused by deformation, abrupt motion, background clutter and occlusion. In this paper, we exploit features extracted from deep convolutional neural networks trained on object recognition datasets to improve tracking accuracy and robustness. The outputs of the last convolutional layers encode the semantic information of targets and such representations are robust to significant appearance variations. However, their spatial resolution is too coarse to precisely localize targets. In contrast, earlier convolutional layers provide more precise localization but are less invariant to appearance changes. We interpret the hierarchies of convolutional layers as a nonlinear counterpart of an image pyramid representation and exploit these multiple levels of abstraction for visual tracking. Specifically, we adaptively learn correlation filters on each convolutional layer to encode the target appearance. We hierarchically infer the maximum response of each layer to locate targets. Extensive experimental results on a largescale benchmark dataset show that the proposed algorithm performs favorably against state-of-the-art methods.

...read moreread less

Posted Content•

Understanding Neural Networks Through Deep Visualization

[...]

Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas J. Fuchs, Hod Lipson - Show less +1 more

22 Jun 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces several new regularization methods that combine to produce qualitatively clearer, more interpretable visualizations of convolutional neural networks.

...read moreread less

Abstract: Recent years have produced great advances in training large, deep neural networks (DNNs), including notable successes in training convolutional neural networks (convnets) to recognize natural images. However, our understanding of how these models work, especially what computations they perform at intermediate layers, has lagged behind. Progress in the field will be further accelerated by the development of better tools for visualizing and interpreting neural nets. We introduce two such tools here. The first is a tool that visualizes the activations produced on each layer of a trained convnet as it processes an image or video (e.g. a live webcam stream). We have found that looking at live activations that change in response to user input helps build valuable intuitions about how convnets work. The second tool enables visualizing features at each layer of a DNN via regularized optimization in image space. Because previous versions of this idea produced less recognizable images, here we introduce several new regularization methods that combine to produce qualitatively clearer, more interpretable visualizations. Both tools are open source and work on a pre-trained convnet with minimal setup.

...read moreread less

Proceedings Article•DOI•

PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization

[...]

Alex Kendall¹, Matthew Grimes, Roberto Cipolla¹•Institutions (1)

University of Cambridge¹

07 Dec 2015

TL;DR: PoseNet as mentioned in this paper uses a CNN to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation.

...read moreread less

Abstract: We present a robust and real-time monocular six degree of freedom relocalization system. Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation. The algorithm can operate indoors and outdoors in real time, taking 5ms per frame to compute. It obtains approximately 2m and 3 degrees accuracy for large scale outdoor scenes and 0.5m and 5 degrees accuracy indoors. This is achieved using an efficient 23 layer deep convnet, demonstrating that convnets can be used to solve complicated out of image plane regression problems. This was made possible by leveraging transfer learning from large scale classification data. We show that the PoseNet localizes from high level features and is robust to difficult lighting, motion blur and different camera intrinsics where point based SIFT registration fails. Furthermore we show how the pose feature that is produced generalizes to other scenes allowing us to regress pose with only a few dozen training examples.

...read moreread less

Proceedings Article•DOI•

Understanding deep image representations by inverting them

[...]

Aravindh Mahendran¹, Andrea Vedaldi¹•Institutions (1)

University of Oxford¹

07 Jun 2015

TL;DR: In this article, a general framework was proposed to invert representations such as HOG and Bag of Visual Words (BOW) to reconstruct the image itself, which can be applied to CNNs too.

...read moreread less

Abstract: Image representations, from SIFT and Bag of Visual Words to Convolutional Neural Networks (CNNs), are a crucial component of almost any image understanding system. Nevertheless, our understanding of them remains limited. In this paper we conduct a direct analysis of the visual information contained in representations by asking the following question: given an encoding of an image, to which extent is it possible to reconstruct the image itself? To answer this question we contribute a general framework to invert representations. We show that this method can invert representations such as HOG more accurately than recent alternatives while being applicable to CNNs too. We then use this technique to study the inverse of recent state-of-the-art CNN image representations for the first time. Among our findings, we show that several layers in CNNs retain photographically accurate information about the image, with different degrees of geometric and photometric invariance.

...read moreread less

Posted Content•

Multi-view Convolutional Neural Networks for 3D Shape Recognition

[...]

Hang Su¹, Subhransu Maji¹, Evangelos Kalogerakis¹, Erik Learned-Miller¹•Institutions (1)

University of Massachusetts Amherst¹

05 May 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents a standard CNN architecture trained to recognize the shapes' rendered views independently of each other, and shows that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art3D shape descriptors.

...read moreread less

Abstract: A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descriptors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images. We first present a standard CNN architecture trained to recognize the shapes' rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors. Recognition rates further increase when multiple views of the shapes are provided. In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes. We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN architectures and their derivatives.

...read moreread less

Proceedings Article•DOI•

Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks

[...]

Tara N. Sainath¹, Oriol Vinyals¹, Andrew W. Senior¹, Hasim Sak¹•Institutions (1)

Google¹

08 Sep 2015

TL;DR: This paper takes advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture, and finds that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.

...read moreread less

Abstract: Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements over Deep Neural Networks (DNNs) across a wide variety of speech recognition tasks. CNNs, LSTMs and DNNs are complementary in their modeling capabilities, as CNNs are good at reducing frequency variations, LSTMs are good at temporal modeling, and DNNs are appropriate for mapping features to a more separable space. In this paper, we take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture. We explore the proposed architecture, which we call CLDNN, on a variety of large vocabulary tasks, varying from 200 to 2,000 hours. We find that the CLDNN provides a 4–6% relative improvement in WER over an LSTM, the strongest of the three individual models.

...read moreread less

Collapse