scispace - formally typeset
Search or ask a question

Showing papers by "Geoffrey E. Hinton published in 2011"


Proceedings Article
28 Jun 2011
TL;DR: The power of RNNs trained with the new Hessian-Free optimizer by applying them to character-level language modeling tasks is demonstrated, and a new RNN variant that uses multiplicative connections which allow the current input character to determine the transition matrix from one hidden state vector to the next is introduced.
Abstract: Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessian-free optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems. In this paper we demonstrate the power of RNNs trained with the new Hessian-Free optimizer (HF) by applying them to character-level language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so we introduce a new RNN variant that uses multiplicative (or "gated") connections which allow the current input character to determine the transition matrix from one hidden state vector to the next. After training the multiplicative RNN with the HF optimizer for five days on 8 high-end Graphics Processing Units, we were able to surpass the performance of the best previous single method for character-level language modeling – a hierarchical non-parametric sequence model. To our knowledge this represents the largest recurrent neural network application to date.

1,382 citations


Book ChapterDOI
14 Jun 2011
TL;DR: It is argued that neural networks can be used to learn features that output a whole vector of instantiation parameters and this is a much more promising way of dealing with variations in position, orientation, scale and lighting than the methods currently employed in the neural networks community.
Abstract: The artificial neural networks that are used to recognize shapes typically use one or more layers of learned feature detectors that produce scalar outputs. By contrast, the computer vision community uses complicated, hand-engineered features, like SIFT [6], that produce a whole vector of outputs including an explicit representation of the pose of the feature. We show how neural networks can be used to learn features that output a whole vector of instantiation parameters and we argue that this is a much more promising way of dealing with variations in position, orientation, scale and lighting than the methods currently employed in the neural networks community. It is also more promising than the hand-engineered features currently used in computer vision because it provides an efficient way of adapting the features to the domain.

1,120 citations


Proceedings Article
01 Jan 2011
TL;DR: This work shows how to learn many layers of features on color images and how these features are used to initialize deep autoencoders, which are then used to map images to short binary codes.
Abstract: We show how to learn many layers of features on color images and we use these features to initialize deep autoencoders. We then use the autoencoders to map images to short binary codes. Using semantic hashing [6], 28-bit codes can be used to retrieve images that are similar to a query image in a time that is independent of the size of the database. This extremely fast retrieval makes it possible to search using multiple di erent transformations of the query image. 256-bit binary codes allow much more accurate matching and can be used to prune the set of images found using the 28-bit codes.

406 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: Deep Belief Networks work even better when their inputs are speaker adaptive, discriminative features, and on the standard TIMIT corpus, they give phone error rates of 19.6% using monophone HMMs and a bigram language model.
Abstract: Deep Belief Networks (DBNs) are multi-layer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of features that capture the higher-order statistical structure of the data. These features can be used to initialize the hidden units of a feed-forward neural network that is then trained to predict the HMM state for the central frame of the window. Initializing with features that are good at generating speech makes the neural network perform much better than initializing with random weights. DBNs have already been used successfully for phone recognition with input coefficients that are MFCCs or filterbank outputs [1, 2]. In this paper, we demonstrate that they work even better when their inputs are speaker adaptive, discriminative features. On the standard TIMIT corpus, they give phone error rates of 19.6% using monophone HMMs and a bigram language model and 19.4% using monophone HMMs and a trigram language model.

321 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: This work uses one of the best, pixel-level, generative models of natural images–a gated MRF–as the lowest level of a deep belief network that has several hidden layers and shows that the resulting DBN is very good at coping with occlusion when predicting expression categories from face images.
Abstract: The most popular way to use probabilistic models in vision is first to extract some descriptors of small image patches or object parts using well-engineered features, and then to use statistical learning tools to model the dependencies among these features and eventual labels. Learning probabilistic models directly on the raw pixel values has proved to be much more difficult and is typically only used for regularizing discriminative methods. In this work, we use one of the best, pixel-level, generative models of natural images–a gated MRF–as the lowest level of a deep belief network (DBN) that has several hidden layers. We show that the resulting DBN is very good at coping with occlusion when predicting expression categories from face images, and it can produce features that perform comparably to SIFT descriptors for discriminating different types of scene. The generative ability of the model also makes it easy to see what information is captured and what is lost at each level of representation.

245 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: A novel approach for modeling speech sound waves using a Restricted Boltzmann machine (RBM) with a novel type of hidden variable is presented and initial results demonstrate phoneme recognition performance better than the current state-of-the-art for methods based on Mel cepstrum coefficients.
Abstract: State of the art speech recognition systems rely on preprocessed speech features such as Mel cepstrum or linear predictive coding coefficients that collapse high dimensional speech sound waves into low dimensional encodings. While these have been successfully applied in speech recognition systems, such low dimensional encodings may lose some relevant information and express other information in a way that makes it difficult to use for discrimination. Higher dimensional encodings could both improve performance in recognition tasks, and also be applied to speech synthesis by better modeling the statistical structure of the sound waves. In this paper we present a novel approach for modeling speech sound waves using a Restricted Boltzmann machine (RBM) with a novel type of hidden variable and we report initial results demonstrating phoneme recognition performance better than the current state-of-the-art for methods based on Mel cepstrum coefficients.

223 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: A DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models even though it currently uses an impoverished representation of the input.
Abstract: This paper considers application of Deep Belief Nets (DBNs) to natural language call routing. DBNs have been successfully applied to a number of tasks, including image, audio and speech classification, thanks to the recent discovery of an efficient learning technique. DBNs learn a multi-layer generative model from unlabeled data and the features discovered by this model are then used to initialize a feed-forward neural network which is fine-tuned with backpropagation. We compare a DBN-initialized neural network to three widely used text classification algorithms; Support Vector machines (SVM), Boosting and Maximum Entropy (MaxEnt). The DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models even though it currently uses an impoverished representation of the input.

144 citations


Journal ArticleDOI
TL;DR: A model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued "visible" variables that can capture diverse styles of motion with a single set of parameters and introduces multiplicative three-way interactions.
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued "visible" variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This "conditional" RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/~gwtaylor/publications/jmlr2011.

130 citations


Journal ArticleDOI
TL;DR: A deep generative model in which the lowest layer represents the word-count vector of a document and the top layer represents a learned binary code for that document is described, which allows more accurate and much faster retrieval than latent semantic analysis.
Abstract: We describe a deep generative model in which the lowest layer represents the word-count vector of a document and the top layer represents a learned binary code for that document. The top two layers of the generative model form an undirected associative memory and the remaining layers form a belief net with directed, top-down connections. We present efficient learning and inference procedures for this type of generative model and show that it allows more accurate and much faster retrieval than latent semantic analysis. By using our method as a filter for a much slower method called TF-IDF we achieve higher accuracy than TF-IDF alone and save several orders of magnitude in retrieval time. By using short binary codes as addresses, we can perform retrieval on very large document sets in a time that is independent of the size of the document set using only one word of memory to describe each document.

119 citations


Proceedings Article
14 Jul 2011
TL;DR: In this article, the authors argue that Contrastive Divergence-based learning may not be suitable for training conditional restricted Boltzmann machines (CRBMs) for structured output prediction.
Abstract: Conditional Restricted Boltzmann Machines (CRBMs) are rich probabilistic models that have recently been applied to a wide range of problems, including collaborative filtering, classification, and modeling motion capture data. While much progress has been made in training non-conditional RBMs, these algorithms are not applicable to conditional models and there has been almost no work on training and generating predictions from conditional RBMs for structured output problems. We first argue that standard Contrastive Divergence-based learning may not be suitable for training CRBMs. We then identify two distinct types of structured output prediction problems and propose an improved learning algorithm for each. The first problem type is one where the output space has arbitrary structure but the set of likely output configurations is relatively small, such as in multi-label classification. The second problem is one where the output space is arbitrarily structured but where the output space variability is much greater, such as in image denoising or pixel labeling. We show that the new learning algorithms can work much better than Contrastive Divergence on both types of problems.

114 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: The model is defined as a factored three-way Boltzmann machine, in which hidden variables collaborate to define the joint correlation matrix for image pairs, which makes it possible to efficiently match images that are the same according to a learned measure of similarity.
Abstract: We describe a generative model of the relationship between two images. The model is defined as a factored three-way Boltzmann machine, in which hidden variables collaborate to define the joint correlation matrix for image pairs. Modeling the joint distribution over pairs makes it possible to efficiently match images that are the same according to a learned measure of similarity. We apply the model to several face matching tasks, and show that it learns to represent the input images using task-specific basis functions. Matching performance is superior to previous similar generative models, including recent conditional models of transformations. We also show that the model can be used as a plug-in matching score to perform invariant classification.

Journal ArticleDOI
TL;DR: Key strands of modern machine learning grew out of attempts to understand how large numbers of interconnected, more or less neuron-like elements could learn to achieve behaviourally meaningful computations and to extract useful features from images or sound waves.
Abstract: What is machine learning? Machine learning is a type of statistics that places particular emphasis on the use of advanced computational algorithms. As computers become more powerful, and modern experimental methods in areas such as imaging generate vast bodies of data, machine learning is becoming ever more important for extracting reliable and meaningful relationships and for making accurate predictions. Key strands of modern machine learning grew out of attempts to understand how large numbers of interconnected, more or less neuron-like elements could learn to achieve behaviourally meaningful computations and to extract useful features from images or sound waves. By the 1990s, key approaches had converged on an elegant framework called ‘graphical models’, explained in Koller and Friedman, in which the nodes of a graph represent variables such as edges and corners in an image, or phonemes and words in speech. The probabilistic relationships between nodes are represented by conditional probability tables or simple functions whose parameters are learned from the data. There are three main problems in fitting graphical models to data: inference, parameter learning and structure learning. The inference problem is how to infer the probable values of unobserved variables when the values of a subset of the variables have been observed, and is a problem that perceptual systems need to solve if they are to infer the hidden causes of their sensory input. The parameter-learning problem is how to adjust the parameters governing the way in which one variable influences another, so that the graphical model is a better fit to some observed data. In the brain, this is presumably done by changing synapse strengths. The structure-learning problem is how to decide which unobserved variables are needed and how they must be connected to model the correlations between observed variables. In the brain, evolution and early pruning of connections presumably have a large role to play in determining the structure. Could you provide a brief description of the methods of machine learning? Machine learning can be divided into three parts: 1) in supervised learning, the aim is to predict a class label or a real value from an input (classifying objects in images or predicting the future value of a stock are examples of this type of learning); 2) in unsupervised learning, the aim is to discover good features for representing the input data; and 3) in reinforcement learning, the aim is to discover what action should be performed next in order to maximize the eventual payoff.

Journal ArticleDOI
TL;DR: It has been shown that multiple layers of feature detectors can be learned greedily, one layer at a time, by using unsupervised learning that does not require labeled data.
Abstract: A TYPICAL MACHINE learning program uses weighted combinations of features to discriminate between classes or to predict real-valued outcomes. The art of machine learning is in constructing the features, and a radically new method of creating features constitutes a major advance. In the 1980s, the new method was backpropagation, which uses the chain rule to backpropagate error derivatives through a multilayer, feed-forward, neural network and adjusts the weights between layers by following the gradient of the backpropagated error. This worked well for recognizing simple shapes, such as handwritten digits, especially in convolutional neural networks that use local feature detectors replicated across the image. 5 For many tasks, however, it proved extremely difficult to optimize deep neural nets with many layers of non-linear features, and a huge number of labeled training cases was required for large neural networks to generalize well to test data. In the 1990s, Support Vector Machines (SVMs) 8 introduced a very different way of creating features: the user defines a kernel function that computes the similarity between two input vectors, then a judiciously chosen subset of the training examples is used to create \" landmark \" features that measure how similar a test case is to each training case. SVMs have a clever way of choosing which training cases to use as landmarks and deciding how to weight them. They work remarkably well on many machine learning tasks even though the selected features are non-adaptive. The success of SVMs dampened the earlier enthusiasm for neural networks. More recently, however, it has been shown that multiple layers of feature detectors can be learned greedily, one layer at a time, by using unsupervised learning that does not require labeled data. The features in each layer are designed to model the statistical structure of the patterns of feature activations in the previous layer. After learning several layers of features this way without paying any attention to the final goal, many of the high-level features will be irrelevant for any particular task, but others will be highly relevant because high-order correlations are the signature of the data's true underlying causes and the labels are more directly related to these causes than to the raw inputs. A subsequent stage of fine-tuning using backpropagation then yields neural networks that work much better than those trained by backpropagation alone and better than SVMs for important tasks such as object or speech recognition. The neural …