scispace - formally typeset
Search or ask a question

Showing papers by "Geoffrey E. Hinton published in 2003"


Journal ArticleDOI
TL;DR: A new way of extending independent components analysis (ICA) to overcomplete representations that defines features as deterministic (linear) functions of the inputs and assigns energies to the features through the Boltzmann distribution.
Abstract: We present a new way of extending independent components analysis (ICA) to overcomplete representations. In contrast to the causal generative extensions of ICA which maintain marginal independence of sources, we define features as deterministic (linear) functions of the inputs. This assumption results in marginal dependencies among the features, but conditional independence of the features given the inputs. By assigning energies to the features a probability distribution over the input states is defined through the Boltzmann distribution. Free parameters of this model are trained using the contrastive divergence objective (Hinton, 2002). When the number of features is equal to the number of input dimensions this energy-based model reduces to noiseless ICA and we show experimentally that the proposed learning algorithm is able to perform blind source separation on speech data. In additional experiments we train overcomplete energy-based models to extract features from various standard data-sets containing speech, natural images, hand-written digits and faces.

194 citations


01 Jan 2003
TL;DR: This paper shows that advance knowledge of the location of these modes can be incorporated into the MCMC sampler by introducing mode-hopping moves that satisfy detailed balance.
Abstract: One of the main shortcomings of Markov chain Monte Carlo samplers is their inability to mix between modes of the target distribution. In this paper we show that advance knowledge of the location of these modes can be incorporated into the MCMC sampler by introducing mode-hopping moves that satisfy detailed balance. The proposed sampling algorithm explores local mode structure through local MCMC moves (e.g. diffusion or Hybrid Monte Carlo) but in addition also represents the relative strengths of the different modes correctly using a set of global moves. This “mode-hopping” MCMC sampler can be viewed as a generalization of the darting method [1].

25 citations


Journal ArticleDOI
TL;DR: Within the neural network community, the "Hebbian" approach of using the product of pre and postsynaptic activities to drive learning was seen as inferior to error-- driven methods that use theproduct of the presynaptic activity and the post synapse activity derivative - the rate at which the objective function changes as the post Synaptic activity is changed.
Abstract: Modelers have come up with many different learning rules for neural networks. When a teacher specifies the correct output, error-driven rules work better than pure Hebb rules in which the changes in synapse strength depend on the correlation between pre and postsynaptic activities. But for unsupervised learning, Hebb rules can be very effective if they are combined with suitable normalization or "unlearning" terms to prevent the synapses growing without bound. Hebb rules that use rates of change of activity instead of activity itself are useful for discovering perceptual invariants and may also provide a way of implementing error-driven learning. It would be truly wonderful if randomly connected neural networks could turn themselves into useful computing devices by using some simple rule to modify the strengths of synapses. This was the hope that lay behind the original Hebb learning rule and it is the vision that has driven neural network modelers for half a century. Initially, researchers tried simulating various rules to see what would happen. After a decade or two of messing around, researchers realized that there was a much better way to explore the space of possible learning rules: First write down an objective function (a quantitative definition of how well the network is performing) and then use elementary calculus to derive a learning rule that will improve the objective function. For the last few decades, the big theoretical advances in learning rules for neural networks have been associated with new optimization methods and new ideas about what objective function should be optimized. If we think of a neural network as a device for converting input vectors into output vectors, it is obvious that one sensible objective is to minimize some measure of the difference between the output the network actually produces and the output it ought to produce. This approach led to effective "error-driven" learning rules such as the Widrow-Hoff rule (Widrow & Hoff, 1960) and the perceptron convergence procedure (Rosenblatt, 1961) and it was later generalized to multilayer networks by using backpropagation of the errors to get training signals for intermediate "hidden" layers (Rumelhart, Hinton, & Williams, 1986). Within the neural network community, the "Hebbian" approach of using the product of pre and postsynaptic activities to drive learning was seen as inferior to error-- driven methods that use the product of the presynaptic activity and the postsynaptic activity derivative - the rate at which the objective function changes as the postsynaptic activity is changed. Even when the task was merely to associate random input vectors with random output vectors, it was shown that an error-driven rule worked much better than a Hebbian rule. Unfortunately, error-driven learning has some serious drawbacks. It requires a teacher to specify the right answer and it is hard to see how neurons could implement the backpropagation required by multilayer versions. It is possible to get the teaching signal from the data itself by trying to predict the next term in a temporal sequence (Elman, 1991) or by trying to reconstruct the input data at the output (Hinton, 1989) but it is also possible to use quite different objective functions for learning. Some of these alternative objective functions lead to learning rules that are far more Hebbian in flavour. A common objective in processing high-dimensional data is to reduce the dimensionality without losing the ability to reconstruct the raw data from the reduced representation. If we measure the accuracy of the reconstruction by the squared error, the optimal strategy is to extract the principal components - the dominant directions of variation in the data. Oja (1982) showed how to extract the first principal component using Hebbian learning to maximize the squared output of a neuron combined with normalization of the synapse strengths to prevent them growing without bound. …

23 citations


Proceedings Article
09 Dec 2003
TL;DR: This work shows how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution, if the model is approximately correct, and these long- range moves have a reasonable acceptance rate.
Abstract: In models that define probabilities via energies, maximum likelihood learning typically involves using Markov Chain Monte Carlo to sample from the model's distribution. If the Markov chain is started at the data distribution, learning often works well even if the chain is only run for a few time steps [3]. But if the data distribution contains modes separated by regions of very low density, brief MCMC will not ensure that different modes have the correct relative energies because it cannot move particles from one mode to another. We show how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution. If the model is approximately correct, these long-range moves have a reasonable acceptance rate.

16 citations


01 Jan 2003
TL;DR: This thesis develops RBMrate, a model for discretized continuous-valued data and describes sparse and over-complete representations of data where the inference process is trivial since it is simply an EBM, and contributes a theory relating belief propagation and iterative scaling to the Bethe free energy approximations.
Abstract: As the machine learning community tackles more complex and harder problems, the graphical models needed to solve such problems become larger and more complicated. As a result performing inference and learning exactly for such graphical models become ever more expensive, and approximate inference and learning techniques become ever more prominent. There are a variety of techniques for approximate inference and learning in the literature. This thesis contributes some new ideas in the products of experts (PoEs) class of models (Hinton, 2002), and the Bethe free energy approximations (Yedidia et al., 2001). For PoEs, our contribution is in developing new PoE models for continuous-valued domains. We developed RBMrate, a model for discretized continuous-valued data. We applied it to face recognition to demonstrate its abilities. We also developed energy-based models (EBMs)—flexible probabilistic models where the building blocks consist of energy terms computed using a feed-forward network. We show that standard square noiseless independent components analysis (ICA) (Bell and Sejnowski, 1995) can be viewed as a restricted form of EBMs. Extending this relationship with ICA, we describe sparse and over-complete representations of data where the inference process is trivial since it is simply an EBM. For Bethe free energy approximations, our contribution is a theory relating belief propagation and iterative scaling. We show that both belief propagation and iterative scaling updates can be derived as fixed point equations for constrained minimization of the Bethe free energy. This allows us to develop a new algorithm to directly minimize the Bethe free energy, and to apply the Bethe free energy to learning in addition to inference. We also describe improvements to the efficiency of standard learning algorithms for undirected graphical models (Jirousek and Preucil, 1995).

8 citations