scispace - formally typeset
Search or ask a question

Showing papers on "Artificial neural network published in 2013"


Journal ArticleDOI
TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.
Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

11,201 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper developed a novel 3D CNN model for action recognition, which extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Abstract: We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep model that can act directly on the raw inputs. However, such models are currently limited to handling 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation combines information from all channels. To further boost the performance, we propose regularizing the outputs with high-level features and combining the predictions of a variety of different models. We apply the developed models to recognize human actions in the real-world environment of airport surveillance videos, and they achieve superior performance in comparison to baseline methods.

4,545 citations


Posted Content
TL;DR: With enhanced local modeling via the micro network, the proposed deep network structure NIN is able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers.
Abstract: We propose a novel deep network structure called "Network In Network" (NIN) to enhance model discriminability for local patches within the receptive field. The conventional convolutional layer uses linear filters followed by a nonlinear activation function to scan the input. Instead, we build micro neural networks with more complex structures to abstract the data within the receptive field. We instantiate the micro neural network with a multilayer perceptron, which is a potent function approximator. The feature maps are obtained by sliding the micro networks over the input in a similar manner as CNN; they are then fed into the next layer. Deep NIN can be implemented by stacking mutiple of the above described structure. With enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers. We demonstrated the state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST datasets.

3,905 citations


Posted Content
TL;DR: This work considers a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network.
Abstract: Stochastic neurons and hard non-linearities can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic or non-smooth neurons? I.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and compare four families of solutions, applicable in different settings. One of them is the minimum variance unbiased gradient estimator for stochatic binary neurons (a special case of the REINFORCE algorithm). A second approach, introduced here, decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part, which approximates the expected effect of the pure stochatic binary neuron to first order. A third approach involves the injection of additive or multiplicative noise in a computational graph that is otherwise differentiable. A fourth approach heuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument (we call this the straight-through estimator). To explore a context where these estimators are useful, we consider a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network. In this case, it is important that the gating units produce an actual 0 most of the time. The resulting sparsity can be potentially be exploited to greatly reduce the computational cost of large deep networks for which conditional computation would be useful.

2,178 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.
Abstract: Recently, pre-trained deep neural networks (DNNs) have outperformed traditional acoustic models based on Gaussian mixture models (GMMs) on a variety of large vocabulary speech recognition benchmarks. Deep neural nets have also achieved excellent results on various computer vision tasks using a random “dropout” procedure that drastically improves generalization error by randomly omitting a fraction of the hidden units in all layers. Since dropout helps avoid over-fitting, it has also been successful on a small-scale phone recognition task using larger neural nets. However, training deep neural net acoustic models for large vocabulary speech recognition takes a very long time and dropout is likely to only increase training time. Neural networks with rectified linear unit (ReLU) non-linearities have been highly successful for computer vision tasks and proved faster to train than standard sigmoid units, sometimes also improving discriminative performance. In this work, we show on a 50-hour English Broadcast News task that modified deep neural networks using ReLUs trained with dropout during frame level training provide an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improvement over a strong GMM/HMM system. We were able to obtain our results with minimal human hyper-parameter tuning using publicly available Bayesian optimization code.

1,342 citations


Posted Content
TL;DR: This article showed that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend, which suggests that it is the space, rather than individual units, that contains of the semantic information in the high layers of neural networks.
Abstract: Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties. First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks. Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. We can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

1,313 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: An overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications,” as organized by the authors is provided.
Abstract: In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications,” as organized by the authors. We also describe the historical context in which acoustic models based on deep neural networks have been developed. The technical overview of the papers presented in our special session is organized into five ways of improving deep learning methods: (1) better optimization; (2) better types of neural activation function and better network architectures; (3) better ways to determine the myriad hyper-parameters of deep neural networks; (4) more appropriate ways to preprocess speech for deep neural networks; and (5) ways of leveraging multiple languages or dialects that are more easily achieved with deep neural networks than with Gaussian mixture models.

1,098 citations


Proceedings Article
05 Dec 2013
TL;DR: Comparison with the state-of-the-art trackers on some challenging benchmark video sequences shows that the deep learning tracker is more accurate while maintaining low computational cost with real-time performance when the MATLAB implementation of the tracker is used with a modest graphics processing unit (GPU).
Abstract: In this paper, we study the challenging problem of tracking the trajectory of a moving object in a video with possibly very complex background. In contrast to most existing trackers which only learn the appearance of the tracked object online, we take a different approach, inspired by recent advances in deep learning architectures, by putting more emphasis on the (unsupervised) feature learning problem. Specifically, by using auxiliary natural images, we train a stacked de-noising autoencoder offline to learn generic image features that are more robust against variations. This is then followed by knowledge transfer from offline training to the online tracking process. Online tracking involves a classification neural network which is constructed from the encoder part of the trained autoencoder as a feature extractor and an additional classification layer. Both the feature extractor and the classifier can be further tuned to adapt to appearance changes of the moving object. Comparison with the state-of-the-art trackers on some challenging benchmark video sequences shows that our deep learning tracker is more accurate while maintaining low computational cost with real-time performance when our MATLAB implementation of the tracker is used with a modest graphics processing unit (GPU).

926 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters.
Abstract: Conventional approaches to statistical parametric speech synthesis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limitations, e.g. decision trees are inefficient to model complex context dependencies. This paper examines an alternative scheme that is based on a deep neural network (DNN). The relationship between input texts and their acoustic realizations is modeled by a DNN. The use of the DNN can address some limitations of the conventional approach. Experimental results show that the DNN-based systems outperformed the HMM-based systems with similar numbers of parameters.

880 citations


Proceedings Article
16 Jun 2013
TL;DR: This work presents a guided policy search algorithm that uses trajectory optimization to direct policy learning and avoid poor local optima, and shows how differential dynamic programming can be used to generate suitable guiding samples, and describes a regularized importance sampled policy optimization that incorporates these samples into the policy search.
Abstract: Direct policy search can effectively scale to high-dimensional systems, but complex policies with hundreds of parameters often present a challenge for such methods, requiring numerous samples and often falling into poor local optima. We present a guided policy search algorithm that uses trajectory optimization to direct policy learning and avoid poor local optima. We show how differential dynamic programming can be used to generate suitable guiding samples, and describe a regularized importance sampled policy optimization that incorporates these samples into the policy search. We evaluate the method by learning neural network controllers for planar swimming, hopping, and walking, as well as simulated 3D humanoid running.

773 citations


Posted Content
TL;DR: The results using L2-SVMs show that by simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop's face expression recognition challenge.
Abstract: Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide variety of tasks such as speech recognition, image classification, natural language processing, and bioinformatics. For classification tasks, most of these "deep learning" models employ the softmax activation function for prediction and minimize cross-entropy loss. In this paper, we demonstrate a small but consistent advantage of replacing the softmax layer with a linear support vector machine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neural nets and SVMs in prior art, our results using L2-SVMs show that by simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop's face expression recognition challenge.

Posted Content
TL;DR: In this article, the authors investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions and find that the dropout algorithm is consistently best at adapting to the new task, remembering the old task and has the best tradeoff curve between these two extremes.
Abstract: Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated.

Journal ArticleDOI
TL;DR: The experimental results show that with minimum errors the proposed approach can be used for wind speed prediction in renewable energy systems and the perfect design of the neural network based on the selection criteria is substantiated using convergence theorem.
Abstract: This paper reviews methods to fix a number of hidden neurons in neural networks for the past 20 years. And it also proposes a new method to fix the hidden neurons in Elman networks for wind speed prediction in renewable energy systems. The random selection of a number of hidden neurons might cause either overfitting or underfitting problems. This paper proposes the solution of these problems. To fix hidden neurons, 101 various criteria are tested based on the statistical errors. The results show that proposed model improves the accuracy and minimal error. The perfect design of the neural network based on the selection criteria is substantiated using convergence theorem. To verify the effectiveness of the model, simulations were conducted on real-time wind data. The experimental results show that with minimum errors the proposed approach can be used for wind speed prediction. The survey has been made for the fixation of hidden neurons in neural networks. The proposed model is simple, with minimal error, and efficient for fixation of hidden neurons in Elman networks.

Proceedings ArticleDOI
01 Aug 2013
TL;DR: Different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on a standard 300 hour American conversational telephone speech task.
Abstract: Sequence-discriminative training of deep neural networks (DNNs) is investigated on a standard 300 hour American En- glish conversational telephone speech task. Different sequence- discriminative criteria — maximum mutual information (MMI), minimum phone error (MPE), state-level minimum Bayes risk (sMBR), and boosted MMI — are compared. Two different heuristics are investigated to improve the performance of the DNNs trained using sequence-based criteria — lattices are re- generated after the first iteration of training; and, for MMI and BMMI, the frames where the numerator and denominator hy- potheses are disjoint are removed from the gradient compu- tation. Starting from a competitive DNN baseline trained us- ing cross-entropy, different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on aver- age. Little difference is noticed between the different sequence- based criteria that are investigated. The experiments are done using the open-source Kaldi toolkit, which makes it possible for the wider community to reproduce these results. Index Terms: speech recognition, deep learning, sequence- criterion training, neural networks, reproducible research

Journal ArticleDOI
TL;DR: The potential of a simple photonic architecture to process information at unprecedented data rates is demonstrated, implementing a learning-based approach and all digits with very low classification errors are identified and chaotic time-series prediction with 10% error is performed.
Abstract: Inspired by neural networks, reservoir computing uses nonlinear transient states to perform computations, offering faster parallel information processing Brunner et al show a photonic approach to reservoir computing capable of simultaneous spoken digit and speaker recognition at high data rates

Posted Content
TL;DR: In this paper, the authors show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.
Abstract: Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.

Proceedings ArticleDOI
26 May 2013
TL;DR: The noise robustness of DNN-based acoustic models can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation and can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training.
Abstract: Recently, a new acoustic model based on deep neural networks (DNN) has been introduced. While the DNN has generated significant improvements over GMM-based systems on several tasks, there has been no evaluation of the robustness of such systems to environmental distortion. In this paper, we investigate the noise robustness of DNN-based acoustic models and find that they can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation. This performance can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training. When combined with the recently proposed dropout training technique, a 7.5% relative improvement over the previously best published result on this task is achieved using only a single decoding pass and no additional decoding complexity compared to a standard DNN.

Journal ArticleDOI
TL;DR: The philosophy, capabilities, and limitations of artificial neural networks in medical diagnosis through selected examples are reviewed and discussed.

Journal ArticleDOI
TL;DR: Huang et al. as mentioned in this paper proposed ELM-AE, a special case of ELM, where the input is equal to output, and the randomly generated weights are chosen to be orthogonal.
Abstract: Geoffrey Hinton and Pascal Vincent showed that a restricted Boltzmann machine (RBM) and auto-encoders (AE) could be used for feature engineering. These engineered features then could be used to train multiple-layer neural networks, or deep networks. Two types of deep networks based on RBM exist: the deep belief network (DBN)1 and the deep Boltzmann machine (DBM). Guang-Bin Huang and colleagues introduced the extreme learning machine (ELM) as an single-layer feed-forward neural networks (SLFN) with a fast learning speed and good generalization capability. The ELM for SLFNs shows that hidden nodes can be randomly generated. ELM-AE output weights can be determined analytically, unlike RBMs and traditional auto-encoders, which require iterative algorithms. ELM-AE can be seen as a special case of ELM, where the input is equal to output, and the randomly generated weights are chosen to be orthogonal.

Proceedings ArticleDOI
Tara N. Sainath1, Brian Kingsbury1, Vikas Sindhwani1, Ebru Arisoy1, Bhuvana Ramabhadran1 
26 Jun 2013
TL;DR: A low-rank matrix factorization of the final weight layer is proposed and applied to DNNs for both acoustic modeling and language modeling, showing an equivalent reduction in training time and a significant loss in final recognition accuracy compared to a full-rank representation.
Abstract: While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training of these networks is slow. One reason is that DNNs are trained with a large number of training parameters (i.e., 10-50 million). Because networks are trained with a large number of output targets to achieve good performance, the majority of these parameters are in the final weight layer. In this paper, we propose a low-rank matrix factorization of the final weight layer. We apply this low-rank technique to DNNs for both acoustic modeling and language modeling. We show on three different LVCSR tasks ranging between 50-400 hrs, that a low-rank factorization reduces the number of parameters of the network by 30-50%. This results in roughly an equivalent reduction in training time, without a significant loss in final recognition accuracy, compared to a full-rank representation.

Book ChapterDOI
22 Sep 2013
TL;DR: A novel system for voxel classification integrating three 2D CNNs, which have a one-to-one association with the xy, yz and zx planes of 3D image, respectively, which performs better than a state-of-the-art method using 3D multi-scale features.
Abstract: Segmentation of anatomical structures in medical images is often based on a voxel/pixel classification approach. Deep learning systems, such as convolutional neural networks (CNNs), can infer a hierarchical representation of images that fosters categorization. We propose a novel system for voxel classification integrating three 2D CNNs, which have a one-to-one association with the xy, yz and zx planes of 3D image, respectively. We applied our method to the segmentation of tibial cartilage in low field knee MRI scans and tested it on 114 unseen scans. Although our method uses only 2D features at a single scale, it performs better than a state-of-the-art method using 3D multi-scale features. In the latter approach, the features and the classifier have been carefully adapted to the problem at hand. That we were able to get better results by a deep learning architecture that autonomously learns the features from the images is the main insight of this study.

Journal ArticleDOI
TL;DR: Two delay-dependent criteria are derived to ensure the stochastic stability of the error systems, and thus, the master systems stochastically synchronize with the slave systems.
Abstract: In this paper, the problem of sampled-data synchronization for Markovian jump neural networks with time-varying delay and variable samplings is considered. In the framework of the input delay approach and the linear matrix inequality technique, two delay-dependent criteria are derived to ensure the stochastic stability of the error systems, and thus, the master systems stochastically synchronize with the slave systems. The desired mode-independent controller is designed, which depends upon the maximum sampling interval. The effectiveness and potential of the obtained results is verified by two simulation examples.

Proceedings ArticleDOI
26 May 2013
TL;DR: The proposed feature enhancement algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask.
Abstract: We propose a feature enhancement algorithm to improve robust automatic speech recognition (ASR). The algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask. The estimated IRM is used to filter out noise from a noisy Mel spectrogram before performing cepstral feature extraction for ASR. On the noisy subset of the Aurora-4 robust ASR corpus, the proposed enhancement obtains a relative improvement of over 38% in terms of word error rates using ASR models trained in clean conditions, and an improvement of over 14% when the models are trained using the multi-condition training data. In terms of instantaneous SNR estimation performance, the proposed system obtains a mean absolute error of less than 4 dB in most frequency channels.

Proceedings ArticleDOI
26 May 2013
TL;DR: This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.
Abstract: Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units These units are linear when their input is positive and zero otherwise In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data

Book
16 Jun 2013
TL;DR: A study of Adaptive Neural Network Control System based on Differential Evolution Algorithm.
Abstract: A Study of Adaptive Neural Network Control System. Zhong, Heng Design of Fuzzy Logic Controller Based on Differential Evolution Algorithm. Shuai, Li (et al.). Neural Networks, Fuzzy Logic and Genetic Algorithms: Synthesis. Fuzzy Logic and Neural Networks: Basic Concepts and Applications. logic genetic by rajasekaran ebook. srajasekaran and ga vijayalakshmi pai neural networks. MODERN MAGNETIC MATERIALS PRINCIPLES AND APPLICATIONS PDF FREE NETWORKS FUZZY LOGIC AND GENETIC ALGORITHMS SYNTHESIS.

Journal ArticleDOI
TL;DR: In this article, a single-layer perceptron network implemented with a memrisitive crossbar circuit and trained using the perceptron learning rule by ex situ and in situ methods is presented.
Abstract: Memristors are memory resistors that promise the efficient implementation of synaptic weights in artificial neural networks. Whereas demonstrations of the synaptic operation of memristors already exist, the implementation of even simple networks is more challenging and has yet to be reported. Here we demonstrate pattern classification using a single-layer perceptron network implemented with a memrisitive crossbar circuit and trained using the perceptron learning rule by ex situ and in situ methods. In the first case, synaptic weights, which are realized as conductances of titanium dioxide memristors, are calculated on a precursor software-based network and then imported sequentially into the crossbar circuit. In the second case, training is implemented in situ, so the weights are adjusted in parallel. Both methods work satisfactorily despite significant variations in the switching behaviour of the memristors. These results give hope for the anticipated efficient implementation of artificial neuromorphic networks and pave the way for dense, high-performance information processing systems.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: The results show that on this task, both types of recurrent networks outperform the CRF baseline substantially, and a bi-directional Jordantype network that takes into account both past and future dependencies among slots works best, outperforming a CRFbased baseline by 14% in relative error reduction.
Abstract: One of the key problems in spoken language understanding (SLU) is the task of slot filling. In light of the recent success of applying deep neural network technologies in domain detection and intent identification, we carried out an in-depth investigation on the use of recurrent neural networks for the more difficult task of slot filling involving sequence discrimination. In this work, we implemented and compared several important recurrent-neural-network architectures, including the Elman-type and Jordan-type recurrent networks and their variants. To make the results easy to reproduce and compare, we implemented these networks on the common Theano neural network toolkit, and evaluated them on the ATIS benchmark. We also compared our results to a conditional random fields (CRF) baseline. Our results show that on this task, both types of recurrent networks outperform the CRF baseline substantially, and a bi-directional Jordantype network that takes into account both past and future dependencies among slots works best, outperforming a CRFbased baseline by 14% in relative error reduction.

Journal ArticleDOI
TL;DR: In this article, a deep multi-task artificial neural network is used to predict multiple electronic ground and excited-state properties, such as atomization energy, polarizability, frontier orbital eigenvalues, ionization potential, electron affinity and excitation energies.
Abstract: The combination of modern scientific computing with electronic structure theory can lead to an unprecedented amount of data amenable to intelligent data analysis for the identification of meaningful, novel and predictive structure?property relationships. Such relationships enable high-throughput screening for relevant properties in an exponentially growing pool of virtual compounds that are synthetically accessible. Here, we present a machine learning model, trained on a database of ab initio calculation results for thousands of organic molecules, that simultaneously predicts multiple electronic ground- and excited-state properties. The properties include atomization energy, polarizability, frontier orbital eigenvalues, ionization potential, electron affinity and excitation energies. The machine learning model is based on a deep multi-task artificial neural network, exploiting the underlying correlations between various molecular properties. The input is identical to ab initio methods, i.e.?nuclear charges and Cartesian coordinates of all atoms. For small organic molecules, the accuracy of such a ?quantum machine? is similar, and sometimes superior, to modern quantum-chemical methods?at negligible computational cost.

Proceedings ArticleDOI
Dong Yu1, Kaisheng Yao1, Hang Su1, Gang Li1, Frank Seide1 
26 May 2013
TL;DR: Experiments demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups.
Abstract: We propose a novel regularized adaptation technique for context dependent deep neural network hidden Markov models (CD-DNN-HMMs). The CD-DNN-HMM has a large output layer and many large hidden layers, each with thousands of neurons. The huge number of parameters in the CD-DNN-HMM makes adaptation a challenging task, esp. when the adaptation set is small. The technique developed in this paper adapts the model conservatively by forcing the senone distribution estimated from the adapted model to be close to that from the unadapted model. This constraint is realized by adding Kullback-Leibler divergence (KLD) regularization to the adaptation criterion. We show that applying this regularization is equivalent to changing the target distribution in the conventional backpropagation algorithm. Experiments on Xbox voice search, short message dictation, and Switchboard and lecture speech transcription tasks demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups.

Proceedings Article
05 Dec 2013
TL;DR: A general formalism for studying dropout on either units or connections, with arbitrary probability values, is introduced and used to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks.
Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically "dropping out" neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function.