scispace - formally typeset
Search or ask a question

Showing papers by "Geoffrey E. Hinton published in 2002"


Journal ArticleDOI
TL;DR: A product of experts (PoE) is an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary because it is hard even to approximate the derivatives of the renormalization term in the combination rule.
Abstract: It is possible to combine multiple latent-variable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual "expert" models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A product of experts (PoE) is therefore an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary. Training a PoE by maximizing the likelihood of the data is difficult because it is hard even to approximate the derivatives of the renormalization term in the combination rule. Fortunately, a PoE can be trained using a different objective function called "contrastive divergence" whose derivatives with regard to the parameters can be approximated accurately and efficiently. Examples are presented of contrastive divergence learning using several types of expert on several types of data.

5,150 citations


Proceedings Article
01 Jan 2002
TL;DR: This probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images, which allows ambiguous objects, like the document count vector for the word "bank", to have versions close to the images of both "river" and "finance" without forcing the image of outdoor concepts to be located close to those of corporate concepts.
Abstract: We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to define a probability distribution over all the potential neighbors of the object. The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional "images" of the objects. A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. This allows ambiguous objects, like the document count vector for the word "bank", to have versions close to the images of both "river" and "finance" without forcing the images of outdoor concepts to be located close to those of corporate concepts.

1,593 citations


Journal ArticleDOI
TL;DR: The procedures used in conventional data analysis are formulated in terms of hierarchical linear models and a connection between classical inference and parametric empirical Bayes (PEB) through covariance component estimation is established through covariances component estimation.

647 citations


Proceedings Article
01 Jan 2002
TL;DR: A model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs is proposed and used as a prior to derive the "iterated Wiener filter" for the purpose of denoising images.
Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Student-t distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the "iterated Wiener filter" for the purpose of denoising images.

155 citations


Journal Article
TL;DR: In this article, a contrastive divergence optimization criterion was proposed to maximize the divergence between one-step reconstructions of the data and the equilibrium distribution, which eliminates the need to estimate equilibrium statistics, and does not need to approximate the multimodal probability distribution with the unimodal mean field distribution.
Abstract: We present a new learning algorithm for Mean Field Boltzmann Machines based on the contrastive divergence optimization criterion In addition to minimizing the divergence between the data distribution and the equilibrium distribution, we maximize the divergence between one-step reconstructions of the data and the equilibrium distribution This eliminates the need to estimate equilibrium statistics, so we do not need to approximate the multimodal probability distribution of the free network with the unimodal mean field distribution We test the learning algorithm on the classification of digits

134 citations


Journal ArticleDOI
TL;DR: On the MNIST database, the system is comparable with current state-of-the-art discriminative methods, demonstrating that the product of experts learning procedure can produce effective generative models of high-dimensional data.
Abstract: The product of experts learning procedure can discover a set of stochastic binary features that constitute a nonlinear generative model of handwritten images of digits. The quality of generative models learned in this way can be assessed by learning a separate model for each class of digit and then comparing the unnormalized probabilities of test images under the 10 different class-specific models. To improve discriminative performance, a hierarchy of separate models can be learned, for each digit class. Each model in the hierarchy learns a layer of binary feature detectors that model the probability distribution of vectors of activity of feature detectors in the layer below. The models in the hierarchy are trained sequentially and each model uses a layer of binary feature detectors to learn a generative model of the patterns of feature activities in the preceding layer. After training, each layer of feature detectors produces a separate, unnormalized log probability score. With three layers of feature detectors for each of the 10 digit classes, a test image produces 30 scores which can be used as inputs to a supervised, logistic classification network that is trained on separate data.

81 citations


Proceedings Article
01 Jan 2002
TL;DR: A novel input device and interface for interactively controlling the animation of graphical human character from a desktop environment and a layered kinematic motion recording strategy accesses subsets of the total degrees of freedom of the character.
Abstract: We present a novel input device and interface for interactively controlling the animation of graphical human character from a desktop environment. The trackers are embedded in a new physical design, which is both simple yet also provides significant benefits, and establishes a tangible interface with coordinate frames inherent to the character. A layered kinematic motion recording strategy accesses subsets of the total degrees of freedom of the character. We present the experiences of three novice users with the system, and that of a long-term user who has prior experience with other complex continuous interfaces.

71 citations


Proceedings Article
01 Jan 2002
TL;DR: A sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of "negative examples" generated from the model's current estimate of the data density is proposed.
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of "negative examples" generated from the model's current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model's current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.

57 citations


01 Jan 2002
TL;DR: This thesis presents a thesis in which novel algorithms are presented for learning the dynamics, learning the value function, and selecting good actions for Markov decision processes, and Simulation results show that these new methods can be used to solve very large problems.
Abstract: Learning to act optimally in a complex, dynamic and noisy environment is a hard problem. Various threads of research from reinforcement learning, animal conditioning, operations research, machine learning, statistics and optimal control are beginning to come together to offer solutions to this problem. I present a thesis in which novel algorithms are presented for learning the dynamics, learning the value function, and selecting good actions for Markov decision processes. The problems considered have high-dimensional factored state and action spaces, and are either fully or partially observable. The approach I take is to recognize similarities between the problems being solved in the reinforcement learning and graphical models literature, and to use and combine techniques from the two fields in novel ways. In particular I present two new algorithms. First, the DBN algorithm learns a compact representation of the core process of a partially observable MDP. Because inference in the DBN is intractable, I use approximate inference to maintain the belief state. A belief state action-value function is learned using reinforcement learning. I show that this DBN algorithm can solve POMDPs with very large state spaces and useful hidden state. Second, the PoE algorithm learns an approximation to value functions over large factored state-action spaces. The algorithm approximates values as (negative) free energies in a product of experts model. The model parameters can be learned efficiently because inference is tractable in a product of experts. I show that good actions can be found even in large factored action spaces by the use of brief Gibbs sampling. These two new algorithms take techniques from the machine learning community and apply them in new ways to reinforcement learning problems. Simulation results show that these new methods can be used to solve very large problems. The DBN method is used to solve a POMDP with a hidden state space and an observation space of size greater than 2180. The DBN model of the core process has 232 states represented as 32 binary variables. The PoE method is used to find actions in action spaces of size 240 .

37 citations


Journal ArticleDOI
TL;DR: Although local physical models are a quite simple approximation to real physical behaviour, it is shown that they are extremely useful for interactive character control, and contribute positively to the expressiveness of the character's motion.
Abstract: Our goal is to design and build a tool for the creation of expressive character animation. Virtual puppetry, also known as performance animation, is a technique in which the user interactively controls a character’s motion. In this paper we introduce local physical models for performance animation and describe how they can augment an existing kinematic method to achieve very effective animation control. These models approximate specific physically-generated aspects of a character’s motion. They automate certain behaviours, while still letting the user override such motion via a PD-controller if he so desires. Furthermore, they can be tuned to ignore certain undesirable effects, such as the risk of having a character fall over, by ignoring corresponding components of the force. Although local physical models are a quite simple approximation to real physical behaviour, we show that they are extremely useful for interactive character control, and contribute positively to the expressiveness of the character’s motion. In this paper, we develop such models at the knees and ankles of an interactively-animated 3D anthropomorphic character, and demonstrate a resulting animation. This approach can be applied in a straightforward way to other joints.

36 citations


Book ChapterDOI
28 Aug 2002
TL;DR: A new learning algorithm for Mean Field Boltzmann Machines based on the contrastive divergence optimization criterion that eliminates the need to estimate equilibrium statistics, so it does not need to approximate the multimodal probability distribution of the free network with the unimodal mean field distribution.
Abstract: We present a new learning algorithm for Mean Field Boltzmann Machines based on the contrastive divergence optimization criterion. In addition to minimizing the divergence between the data distribution and the equilibrium distribution, we maximize the divergence between one-step reconstructions of the data and the equilibrium distribution. This eliminates the need to estimate equilibrium statistics, so we do not need to approximate the multimodal probability distribution of the free network with the unimodal mean field distribution. We test the learning algorithm on the classification of digits.

Proceedings Article
07 Aug 2002
TL;DR: In this article, the authors presented the Undercomplete Product of Experts (UPoE) model, where each expert models a one dimensional projection of the data, and the UPoE may be interpreted as a parametric probabilistic model for projection pursuit.
Abstract: Product models of low dimensional experts are a powerful way to avoid the curse of dimensionality. We present the "undercomplete product of experts" (UPoE), where each expert models a one dimensional projection of the data. The UPoE may be interpreted as a parametric probabilistic model for projection pursuit. Its ML learning rules are identical to the approximate learning rules proposed before for under-complete ICA. We also derive an efficient sequential learning algorithm and discuss its relationship to projection pursuit density estimation and feature induction algorithms for additive random field models.

Book ChapterDOI
01 Jan 2002
TL;DR: Non-Linear Relational Embedding (NLRE) is introduced, and it is shown that it can represent relations that LRE cannot, and can easily modify these representations to incorporate new objects and relations.
Abstract: This thesis introduces new methods for solving the problem of generalizing from relational data. I consider a situation in which we have a set of concepts and a set of relations among these concepts, and the data consists of few instances of these relations that hold among the concepts; the aim is to infer other instances of these relations. My approach is to learn from the data a representation of the objects in terms of the features which are relevant for the set of relations at hand, together with the rules of how these features interact. I then use these distributed representations to infer how objects relate. Linear Relational Embedding (LRE) is a new method for learning distributed representations for objects and relations. It finds a mapping from the objects into a feature-space by imposing the constraint that relations in this feature-space are modelled by linear operations. Having learned such distributed representations, it becomes possible to learn a probabilistic model which can be used to infer both positive and negative instances of the relations. LRE shows excellent generalization performance. On a classical problem results are far better than those obtained by any previously published method. I also present results on other problems, which show that the generalization performance of LRE is excellent. Moreover, after learning a distributed representation for a set of objects and relations, LRE can easily modify these representations to incorporate new objects and relations. Learning is fast and LRE rarely converges to solutions with poor generalization. Due to its linearity LRE cannot represent some relations of arity greater than two. I therefore introduce Non-Linear Relational Embedding (NLRE), and show that it can represent relations that LRE cannot. A probabilistic model, which can be used to infer both positive and negative instances of the relations, can also be learned for NLRE. Hierarchical LRE and Hierarchical NLRE are modifications of the above methods for learning a distributed representation of variable-sized recursive data structures. Results show that these methods are able to extract semantic features from trees and use them to generalize correctly to novel trees.


Journal ArticleDOI
TL;DR: Raymond Hsieh left a legacy of groundbreaking, deep insights that have changed the course of AI.
Abstract: Ray dedicated his life to his research with the wonder of a child, the fearlessness of an explorer, the precision of a mathematician, and the tirelessness of a researcher who found shallowness and confusion intolerable. He leaves a legacy of groundbreaking, deep insights that have changed the course of AI.