scispace - formally typeset
Search or ask a question

Showing papers on "Convolutional neural network published in 2012"


Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations


Proceedings Article
03 Dec 2012
TL;DR: This work describes new algorithms that take into account the variable cost of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation and shows that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms.
Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a "black art" requiring expert experience, rules of thumb, or sometimes brute-force search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm's generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks.

5,654 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: In this paper, a biologically plausible, wide and deep artificial neural network architectures was proposed to match human performance on tasks such as the recognition of handwritten digits or traffic signs, achieving near-human performance.
Abstract: Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.

3,717 citations


Journal ArticleDOI
TL;DR: A publicly available traffic sign dataset with more than 50,000 images of German road signs in 43 classes is presented, and Convolutional neural networks showed particularly high classification accuracies in the competition, and the CNNs outperformed the human test persons.

1,138 citations


Posted Content
TL;DR: In this paper, a learning algorithm's generalization performance is modeled as a sample from a Gaussian process and the tractable posterior distribution induced by the GP leads to efficient use of the information gathered by previous experiments, enabling optimal choices about what parameters to try next.
Abstract: Machine learning algorithms frequently require careful tuning of model hyperparameters, regularization terms, and optimization parameters. Unfortunately, this tuning is often a "black art" that requires expert experience, unwritten rules of thumb, or sometimes brute-force search. Much more appealing is the idea of developing automatic approaches which can optimize the performance of a given learning algorithm to the task at hand. In this work, we consider the automatic tuning problem within the framework of Bayesian optimization, in which a learning algorithm's generalization performance is modeled as a sample from a Gaussian process (GP). The tractable posterior distribution induced by the GP leads to efficient use of the information gathered by previous experiments, enabling optimal choices about what parameters to try next. Here we show how the effects of the Gaussian process prior and the associated inference procedure can have a large impact on the success or failure of Bayesian optimization. We show that thoughtful choices can lead to results that exceed expert-level performance in tuning machine learning algorithms. We also describe new algorithms that take into account the variable cost (duration) of learning experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization on a diverse set of contemporary algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks.

1,110 citations



Proceedings ArticleDOI
25 Mar 2012
TL;DR: The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
Abstract: Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of local filtering and max-pooling in the CNN architecture. In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. We propose to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance. In our method, a pair of local filtering layer and max-pooling layer is added at the lowest end of neural network (NN) to normalize spectral variations of speech signals. In our experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the core TIMIT test sets when comparing with a regular NN using the same number of hidden layers and weights. Our results also show that the best result of the proposed CNN model is better than previously published results on the same TIMIT test sets that use a pre-trained deep NN model.

901 citations


Proceedings Article
01 Nov 2012
TL;DR: This paper combines the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows them to use a common framework to train highly-accurate text detector and character recognizer modules.
Abstract: Full end-to-end text recognition in natural images is a challenging problem that has received much attention recently. Traditional systems in this area have relied on elaborate models incorporating carefully hand-engineered features or large amounts of prior knowledge. In this paper, we take a different route and combine the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows us to use a common framework to train highly-accurate text detector and character recognizer modules. Then, using only simple off-the-shelf methods, we integrate these two modules into a full end-to-end, lexicon-driven, scene text recognition system that achieves state-of-the-art performance on standard benchmarks, namely Street View Text and ICDAR 2003.

900 citations


Journal ArticleDOI
TL;DR: A hybrid model of integrating the synergy of two superior classifiers: Convolutional Neural Network (CNN) and Support Vector Machine (SVM) which have proven results in recognizing different types of patterns is presented.

585 citations


Proceedings ArticleDOI
10 Jun 2012
TL;DR: Not only does the proposed Max-Pooling Convolutional Neural Network approach obtain much better results, but the proposed method also works directly on raw pixel intensities of detected and segmented steel defects, avoiding further time consuming and hard to optimize ad-hoc preprocessing.
Abstract: We present a Max-Pooling Convolutional Neural Network approach for supervised steel defect classification. On a classification task with 7 defects, collected from a real production line, an error rate of 7% is obtained. Compared to SVM classifiers trained on commonly used feature descriptors our best net performs at least two times better. Not only we do obtain much better results, but the proposed method also works directly on raw pixel intensities of detected and segmented steel defects, avoiding further time consuming and hard to optimize ad-hoc preprocessing.

262 citations


Book ChapterDOI
07 Oct 2012
TL;DR: It is concluded that convolutional neural networks are suitable for learning 3D scene layout from noisy labels and provides a relative improvement of 7% compared to the baseline and combining color planes provides a statistical description of road areas that exhibits maximal uniformity.
Abstract: Road scene segmentation is important in computer vision for different applications such as autonomous driving and pedestrian detection. Recovering the 3D structure of road scenes provides relevant contextual information to improve their understanding. In this paper, we use a convolutional neural network based algorithm to learn features from noisy labels to recover the 3D scene layout of a road image. The novelty of the algorithm relies on generating training labels by applying an algorithm trained on a general image dataset to classify on–board images. Further, we propose a novel texture descriptor based on a learned color plane fusion to obtain maximal uniformity in road areas. Finally, acquired (off–line) and current (on–line) information are combined to detect road areas in single images. From quantitative and qualitative experiments, conducted on publicly available datasets, it is concluded that convolutional neural networks are suitable for learning 3D scene layout from noisy labels and provides a relative improvement of 7% compared to the baseline. Furthermore, combining color planes provides a statistical description of road areas that exhibits maximal uniformity and provides a relative improvement of 8% compared to the baseline. Finally, the improvement is even bigger when acquired and current information from a single image are combined.

Proceedings Article
18 Apr 2012
TL;DR: This work augmented the traditional ConvNet architecture by learning multi-stage features and by using Lp pooling and establishes a new state-of-the-art of 95.10% accuracy on the SVHN dataset (48% error improvement).
Abstract: We classify digits of real-world house numbers using convolutional neural networks (ConvNets). Con-vNets are hierarchical feature learning neural networks whose structure is biologically inspired. Unlike many popular vision approaches that are hand-designed, ConvNets can automatically learn a unique set of features optimized for a given task. We augmented the traditional ConvNet architecture by learning multi-stage features and by using Lp pooling and establish a new state-of-the-art of 95.10% accuracy on the SVHN dataset (48% error improvement). Furthermore, we analyze the benefits of different pooling methods and multi-stage features in ConvNets. The source code and a tutorial are available at eblearn.sf.net.

Posted Content
TL;DR: In this paper, a new state-of-the-art of 94.85% accuracy on the SVHN dataset (45.2% error improvement) was achieved by using Lp pooling.
Abstract: We classify digits of real-world house numbers using convolutional neural networks (ConvNets). ConvNets are hierarchical feature learning neural networks whose structure is biologically inspired. Unlike many popular vision approaches that are hand-designed, ConvNets can automatically learn a unique set of features optimized for a given task. We augmented the traditional ConvNet architecture by learning multi-stage features and by using Lp pooling and establish a new state-of-the-art of 94.85% accuracy on the SVHN dataset (45.2% error improvement). Furthermore, we analyze the benefits of different pooling methods and multi-stage features in ConvNets. The source code and a tutorial are available at this http URL.

Book ChapterDOI
11 Sep 2012
TL;DR: This paper proposes different strategies for simplifying filters, used as feature extractors, to be learnt in convolutional neural networks (ConvNets) in order to modify the hypothesis space, and to speed-up learning and processing times.
Abstract: In this paper, we propose different strategies for simplifying filters, used as feature extractors, to be learnt in convolutional neural networks (ConvNets) in order to modify the hypothesis space, and to speed-up learning and processing times. We study two kinds of filters that are known to be computationally efficient in feed-forward processing: fused convolution/sub-sampling filters, and separable filters. We compare the complexity of the back-propagation algorithm on ConvNets based on these different kinds of filters. We show that using these filters allows to reach the same level of recognition performance as with classical ConvNets for handwritten digit recognition, up to 3.3 times faster.

Proceedings ArticleDOI
12 Dec 2012
TL;DR: This work adopts a different perspective of the problem, where several seconds of pitch spectra are classified directly by a convolutional neural network, and achieves state of the art performance through this initial effort to chord recognition.
Abstract: Despite early success in automatic chord recognition, recent efforts are yielding diminishing returns while basically iterating over the same fundamental approach. Here, we abandon typical conventions and adopt a different perspective of the problem, where several seconds of pitch spectra are classified directly by a convolutional neural network. Using labeled data to train the system in a supervised manner, we achieve state of the art performance through this initial effort in an otherwise unexplored area. Subsequent error analysis provides insight into potential areas of improvement, and this approach to chord recognition shows promise for future harmonic analysis systems.

Journal ArticleDOI
TL;DR: An Event-Driven Convolution Module for computing 2D convolutions on such event streams and has multi-kernel capability, which means it will select the convolution kernel depending on the origin of the event.
Abstract: Event-Driven vision sensing is a new way of sensing visual reality in a frame-free manner. This is, the vision sensor (camera) is not capturing a sequence of still frames, as in conventional video and computer vision systems. In Event-Driven sensors each pixel autonomously and asynchronously decides when to send its address out. This way, the sensor output is a continuous stream of address events representing reality dynamically continuously and without constraining to frames. In this paper we present an Event-Driven Convolution Module for computing 2D convolutions on such event streams. The Convolution Module has been designed to assemble many of them for building modular and hierarchical Convolutional Neural Networks for robust shape and pose invariant object recognition. The Convolution Module has multi-kernel capability. This is, it will select the convolution kernel depending on the origin of the event. A proof-of-concept test prototype has been fabricated in a 0.35 μm CMOS process and extensive experimental results are provided. The Convolution Processor has also been combined with an Event-Driven Dynamic Vision Sensor (DVS) for high-speed recognition examples. The chip can discriminate propellers rotating at 2 k revolutions per second, detect symbols on a 52 card deck when browsing all cards in 410 ms, or detect and follow the center of a phosphor oscilloscope trace rotating at 5 KHz.

Proceedings ArticleDOI
27 Mar 2012
TL;DR: This paper applies Convolutional Neural Networks for offline handwritten English character recognition using a modified LeNet-5 CNN model, with special settings of the number of neurons in each layer and the connecting way between some layers.
Abstract: This paper applies Convolutional Neural Networks (CNNs) for offline handwritten English character recognition. We use a modified LeNet-5 CNN model, with special settings of the number of neurons in each layer and the connecting way between some layers. Outputs of the CNN are set with error-correcting codes, thus the CNN has the ability to reject recognition results. For training of the CNN, an error-samples-based reinforcement learning strategy is developed. Experiments are evaluated on UNIPEN lowercase and uppercase datasets, with recognition rates of 93.7% for uppercase and 90.2% for lowercase, respectively.

Journal ArticleDOI
TL;DR: A comparison study of the Frame-Based or Frame-Free Spiking ConvNet Convolution Processors and spike-based convolution processors, two neuro-inspired solutions for real-time visual processing.
Abstract: Most scene segmentation and categorization architectures for the extraction of features in images and patches make exhaustive use of 2D convolution operations for template matching, template search and denoising. Convolutional Neural Networks (ConvNets) are one example of such architectures that can implement general-purpose bio-inspired vision systems. In standard digital computers 2D convolutions are usually expensive in terms of resource consumption and impose severe limitations for efficient real-time applications. Nevertheless, neuro-cortex inspired solutions, like dedicated Frame-Based or Frame-Free Spiking ConvNet Convolution Processors, are advancing real-time visual processing. These two approaches share the neural inspiration, but each of them solves the problem in different ways. Frame-Based ConvNets process frame by frame video in- formation in a very robust and fast way that requires to use and share the available hardware resources (such as: multipliers, adders). Hardware resources are fixed and time multiplexed by fetching data in and out. Thus memory bandwidth and size is important for good performance. On the other hand, spike-based convolution processors are a frame-free alternative that is able to perform convolution of a spike-based source of visual information with very low latency, which makes ideal for very high speed applications. However, hardware resources need to be available all the time and cannot be time-multiplexed. Thus, hardware should be modular, reconfigurable and expansible. Hardware implementations in both VLSI custom integrated circuits (digital and analog) and FPGA have been already used to demonstrate the performance of these systems. In this paper we present a comparison study of these two neuro- inspired solutions. A brief description of both systems is presented and also discussions about their differences, pros and cons.

Book ChapterDOI
07 Oct 2012
TL;DR: An algorithm based on convolutional neural networks is proposed to learn local features from training data at different scales and resolutions using a weighted linear combination and its performance is similar to state---of---the---art methods using other sources of information such as depth, motion or stereo.
Abstract: Semantic segmentation refers to the process of assigning an object label (e.g., building, road, sidewalk, car, pedestrian) to every pixel in an image. Common approaches formulate the task as a random field labeling problem modeling the interactions between labels by combining local and contextual features such as color, depth, edges, SIFT or HoG. These models are trained to maximize the likelihood of the correct classification given a training set. However, these approaches rely on hand---designed features (e.g., texture, SIFT or HoG) and a higher computational time required in the inference process. Therefore, in this paper, we focus on estimating the unary potentials of a conditional random field via ensembles of learned features. We propose an algorithm based on convolutional neural networks to learn local features from training data at different scales and resolutions. Then, diversification between these features is exploited using a weighted linear combination. Experiments on a publicly available database show the effectiveness of the proposed method to perform semantic road scene segmentation in still images. The algorithm outperforms appearance based methods and its performance is similar compared to state---of---the---art methods using other sources of information such as depth, motion or stereo.

Proceedings ArticleDOI
12 Dec 2012
TL;DR: Experimental results indicate that the CNSVM can be successfully applied to visual learning and recognition of hand gestures as well as to measure learning progress.
Abstract: We introduce Convolutional Neural Support Vector Machines (CNSVMs), a combination of two heterogeneous supervised classification techniques, Convolutional Neural Networks (CNNs) and Support Vector Machines (SVMs). CNSVMs are trained using a Stochastic Gradient Descent approach, that provides the computational capability of online incremental learning and is robust for typical learning scenarios in which training samples arrive in mini-batches. This is the case for visual learning and recognition in multi-robot systems, where each robot acquires a different image of the same sample. The experimental results indicate that the CNSVM can be successfully applied to visual learning and recognition of hand gestures as well as to measure learning progress.

Book ChapterDOI
05 Nov 2012
TL;DR: The proposed approach gives better classification rates than classical state-of-the-art methods allowing a safer Computer---Aided Diagnosis of pleural cancer.
Abstract: We present a Multiscale Convolutional Neural Network (MCNN) approach for vision---based classification of cells. Based on several deep Convolutional Neural Networks (CNN) acting at different resolutions, the proposed architecture avoid the classical handcrafted features extraction step, by processing features extraction and classification as a whole. The proposed approach gives better classification rates than classical state---of---the---art methods allowing a safer Computer---Aided Diagnosis of pleural cancer.

Proceedings Article
25 Apr 2012
TL;DR: A convolutional network architecture that includes innovative elements, such as multiple output maps, suitable loss functions, supervised pretraining, multiscale inputs, reused outputs, and pairwise class location lters is proposed.
Abstract: After successes at image classication, segmentation is the next step towards image understanding for neural networks. We propose a convolutional network architecture that includes innovative elements, such as multiple output maps, suitable loss functions, supervised pretraining, multiscale inputs, reused outputs, and pairwise class location lters. Ex- periments on three data sets show that our method performs on par with current in computer vision methods with regards to accuracy and exceeds them in speed.

Journal Article
TL;DR: In this paper, an offline signature verification scheme based on Convolutional Neural Network (CNN) is proposed and the simulation results reveal the efficiency of the suggested algorithm.
Abstract: The style of people’s handwritten signature is a biometric feature used in person authentication. In this paper, an offline signature verification scheme based on Convolutional Neural Network (CNN) is proposed. CNN focuses on the problems of feature extraction without prior knowledge on the data. The classification task is performed by Multilayer perceptron network (MLP). This method is not only capable of extracting features relevant to a given signature, but also robust with regard to signature location changes and scale variations when compared to classical methods. The proposed method is evaluated on a dataset of Persian signatures gathered originally from 22 people. The simulation results reveal the efficiency of the suggested algorithm.

Proceedings ArticleDOI
Brian Cheung1
12 Dec 2012
TL;DR: This work trained a convolutional neural network to distinguish between images of human faces from computer generated avatars as part of the ICMLA 2012 Face Recognition Challenge, and achieved a classification accuracy of 99% on the Avatar CAPTCHA dataset.
Abstract: Convolutional neural network models have covered a broad scope of computer vision applications, achieving competitive performance with minimal domain knowledge. In this work, we apply such a model to a task designed to deter automated systems. We trained a convolutional neural network to distinguish between images of human faces from computer generated avatars as part of the ICMLA 2012 Face Recognition Challenge. The network achieved a classification accuracy of 99\% on the \textit{Avatar CAPTCHA} dataset. Furthermore, we demonstrated the potential of utilizing support vector machines on the same problem and achieved equally competitive performance.

Proceedings ArticleDOI
27 Mar 2012
TL;DR: A novel method to recognize scene texts avoiding the conventional character segmentation step is proposed, relying on a neural classification approach, to every window in order to recognize valid characters and identify non valid ones.
Abstract: Understanding text captured in real-world scenes is a challenging problem in the field of visual pattern recognition and continues to generate a significant interest in the OCR (Optical Character Recognition) community. This paper proposes a novel method to recognize scene texts avoiding the conventional character segmentation step. The idea is to scan the text image with multi-scale windows and apply a robust recognition model, relying on a neural classification approach, to every window in order to recognize valid characters and identify non valid ones. Recognition results are represented as a graph model in order to determine the best sequence of characters. Some linguistic knowledge is also incorporated to remove errors due to recognition confusions. The designed method is evaluated on the ICDAR 2003 database of scene text images and outperforms state-of-the-art approaches.

Proceedings ArticleDOI
09 Sep 2012
TL;DR: This paper focuses on appropriate feature extraction and proper classification by integrating the features using convolutional neural network, and calculates gray level co-occurrence matrix (GLCM) descriptors with different offsets from a short-term mel spectrogram.
Abstract: A map-based approach, which treats 2-dimensional acoustic features like image processing, has recently attracted attention in music genre classification. While this is successful at extracting local music-patterns compared with other timbral-feature-based methods, the extracted features are not sufficient for music genre classification. In this paper, we focus on appropriate feature extraction and proper classification by integrating the features. For the musical feature extraction, we calculate gray level co-occurrence matrix (GLCM) descriptors with different offsets from a short-term mel spectrogram. These feature maps are integratively classified using convolutional neural network (CNN). In our experiments, we got a large improvement of more than 10 points in the classification accuracy, compared with conventional map-based methods. Index Terms: music genre classification, music infor- mation retrieval, music feature extraction, convolutional neural network

Proceedings ArticleDOI
Aiquan Yuan1, Gang Bai1, Po Yang1, Yanni Guo1, Xinting Zhao1 
18 Sep 2012
TL;DR: A novel segmentation-based and lexicon-driven handwritten English recognition systems using convolutional neural networks for offline character recognition and modified online segmentation method based on rules are presented.
Abstract: This paper presents a novel segmentation-based and lexicon-driven handwritten English recognition systems. For the segmentation, a modified online segmentation method based on rules are applied. Then, convolutional neural networks are introduced for offline character recognition. Experiments are evaluated on UNIPEN lowercase data sets, with the word recognition rate of 92.20%.

Journal ArticleDOI
TL;DR: This paper introduces a new sensitivity-based approach capable of picking the right image features from a pre-trained SOM-like feature detector for hand-written digit recognition and shows that pruned network architectures impact a transparent representation of the features actually present in the data while improving network robustness.

Proceedings ArticleDOI
24 May 2012
TL;DR: A new algorithm consisting of applying first the GSA and next the BP in order to ensure performance improvements by avoiding the algorithms' traps in local minima for a six layer CNN dedicated to OCR applications is presented.
Abstract: This paper presents aspects concerning embedding Gravitational Search Algorithms (GSAs) in Convolutional Neural Networks (CNNs) for Optical Character Recognition (OCR) systems. The GSAs are used in combination with the Back Propagation (BP) algorithm as optimization algorithms in the training process of a specific CNN architecture for OCR applications. The new algorithm consists of applying first the GSA and next the BP in order to ensure performance improvements by avoiding the algorithms' traps in local minima. A performance analysis for a given benchmark application shows the advantages of our algorithm over the classical BP algorithm for a six layer CNN dedicated to OCR applications.

01 Jan 2012
TL;DR: This work proposes a method that exploits pose information in order to improve object classification, and investigates both Multi-layer Perceptrons and Convolutional Neural Network architectures, and achieves state-of-the-art results in the challenging NORB dataset.
Abstract: We propose a method that exploits pose information in order to improve object classification. A lot of research has focused in other strategies, such as engineering feature extractors, trying different classifiers and even using transfer learning. Here, we use neural network architectures in a multi-task setup, whose outputs predict both the class and the camera azimuth. We investigate both Multi-layer Perceptrons and Convolutional Neural Network architectures, and achieve state-of-the-art results in the challenging NORB dataset.