scispace - formally typeset
Search or ask a question

Showing papers by "Geoffrey E. Hinton published in 1993"


Proceedings Article
29 Nov 1993
TL;DR: It is shown that the recognition weights of an autoencoder can be used to compute an approximation to the Boltzmann distribution and that this approximation gives an upper bound on the description length.
Abstract: An autoencoder network uses a set of recognition weights to convert an input vector into a code vector. It then uses a set of generative weights to convert the code vector into an approximate reconstruction of the input vector. We derive an objective function for training autoencoders based on the Minimum Description Length (MDL) principle. The aim is to minimize the information required to describe both the code vector and the reconstruction error. We show that this information is minimized by choosing code vectors stochastically according to a Boltzmann distribution, where the generative weights define the energy of each possible code vector given the input vector. Unfortunately, if the code vectors use distributed representations, it is exponentially expensive to compute this Boltzmann distribution because it involves all possible code vectors. We show that the recognition weights of an autoencoder can be used to compute an approximation to the Boltzmann distribution and that this approximation gives an upper bound on the description length. Even when this bound is poor, it can be used as a Lyapunov function for learning both the generative and the recognition weights. We demonstrate that this approach can be used to learn factorial codes.

1,114 citations


Proceedings ArticleDOI
01 Aug 1993
TL;DR: A method of computing the derivatives of the expected squared error and of the amount of information in the noisy weights in a network that contains a layer of non-linear hidden units without time-consuming Monte Carlo simulations is described.
Abstract: Supervised neural networks generalize well if there is much less information in the weights than there is in the output vectors of the training cases. So during learning, it is important to keep the weights simple by penalizing the amount of information they contain. The amount of information in a weight can be controlled by adding Gaussian noise and the noise level can be adapted during learning to optimize the trade-o between the expected squared error of the network and the amount of information in the weights. We describe a method of computing the derivatives of the expected squared error and of the amount of information in the noisy weights in a network that contains a layer of non-linear hidden units. Provided the output units are linear, the exact derivatives can be computed e ciently without time-consuming Monte Carlo simulations. The idea of minimizing the amount of information that is required to communicate the weights of a neural network leads to a number of interesting schemes for encoding the weights.

1,092 citations


Journal ArticleDOI
TL;DR: To illustrate the potential of multilayer neural networks for adaptive interfaces, a VPL Data-Glove connected to a DECtalk speech synthesizer via five neural networks was used to implement a hand-gesture to speech system, demonstrating that neural networks can be used to develop the complex mappings required in a high bandwidth interface that adapts to the individual user.
Abstract: To illustrate the potential of multilayer neural networks for adaptive interfaces, a VPL Data-Glove connected to a DECtalk speech synthesizer via five neural networks was used to implement a hand-gesture to speech system. Using minor variations of the standard backpropagation learning procedure, the complex mapping of hand movements to speech is learned using data obtained from a single 'speaker' in a simple training phase. With a 203 gesture-to-word vocabulary, the wrong word is produced less than 1% of the time, and no word is produced about 5% of the time. Adaptive control of the speaking rate and word stress is also available. The training times and final performance speed are improved by using small, separate networks for each naturally defined subtask. The system demonstrates that neural networks can be used to develop the complex mappings required in a high bandwidth interface that adapts to the individual user. >

394 citations


Journal ArticleDOI
TL;DR: In 1944 a young soldier survived the war with a strange disability: although he could read and comprehend some words with ease, many others gave him trouble; for example, he saw wise and said Owisdom.
Abstract: In 1944 a young soldier suÝered a bullet wound to the head. He survived the war with a strange disability: although he could read and comprehend some words with ease, many others gave him trouble. He read the word antique as OvaseO and uncle as Onephew.O The injury was devastating to the patient, G.R., but it provided invaluable information to researchers investigating the mechanisms by which the brain comprehends written language. A properly functioning system for converting letters on a page to spoken sounds reveals little of its inner structure, but when that system is disrupted, the peculiar pattern of the resulting dysfunction may oÝer essential clues to the original, undamaged architecture. During the past few years, computer simulations of brain function have advanced to the point where they can be used to model information-processing pathways. We have found that deliberate damage to artiÞcial systems can mimic the symptoms displayed by people who have sustained brain injury. Indeed, building a model that makes the same errors as brain-injured people do gives us conÞdence that we are on the right track in trying to understand how the brain works. We have yet to make computer models that exhibit even a tiny fraction of the capabilities of the human brain. Nevertheless, our results so far have produced unexpected insights into the way the brain transforms a string of letter shapes into the meaning of a word. When John C. Marshall and Freda Newcombe of the University of Oxford analyzed G.R.Os residual problems in 1966, they found a highly idiosyncratic pattern of reading deÞcits. In addition to his many semantic errors, G.R. made visual ones, reading stock as OshockO and crowd as Ocrown.O Many of his misreadings resembled the correct word in both form and meaning; for example, he saw wise and said Owisdom.O Detailed testing showed that G.R. could read concrete words, such as table, much more easily than abstract words, such as truth. He was fair at reading nouns (46 percent correct), worse at adjectives (16 percent), still worse at verbs (6 percent) and worst of all at

85 citations


Journal ArticleDOI
TL;DR: The decision-directed least-mean-square algorithm is shown to be an approximation to maximizing the likelihood that the equalizer outputs come from such an independently and identically distributed source.
Abstract: An adaptation algorithm for equalizers operating on very distorted channels is presented. The algorithm is based on the idea of adjusting the equalizer tap gains to maximize the likelihood that the equalizer outputs would be generated by a mixture of two Gaussians with known means. The decision-directed least-mean-square algorithm is shown to be an approximation to maximizing the likelihood that the equalizer outputs come from such an independently and identically distributed source. The algorithm is developed in the context of a binary pulse-amplitude-modulation channel, and simulations demonstrate that the algorithm converges in channels for which the decision-directed LMS algorithms does not converge. >

72 citations


Journal ArticleDOI
29 Nov 1993
TL;DR: In this paper, the minimum description length principle is used to train the hidden units of a neural network to extract a representation that is cheap to describe but nonetheless allows the input to be reconstructed accurately.
Abstract: The Minimum Description Length principle (MDL) can be used to train the hidden units of a neural network to extract a representation that is cheap to describe but nonetheless allows the input to be reconstructed accurately. We show how MDL can be used to develop highly redundant population codes. Each hidden unit has a location in a low-dimensional implicit space. If the hidden unit activities form a bump of a standard shape in this space, they can be cheaply encoded by the center ofthis bump. So the weights from the input units to the hidden units in an autoencoder are trained to make the activities form a standard bump. The coordinates of the hidden units in the implicit space are also learned, thus allowing flexibility, as the network develops a discontinuous topography when presented with different input classes. Population-coding in a space other than the input enables a network to extract nonlinear higher-order properties of the inputs.

59 citations


Journal ArticleDOI
TL;DR: Two new models that handle surfaces with discontinuities are proposed that develop a mixture of expert interpolators and specialized, asymmetric interpolators that do not cross the discontinUities.
Abstract: We have previously described an unsupervised learning procedure that discovers spatially coherent properties of the world by maximizing the information that parameters extracted from different parts of the sensory input convey about some common underlying cause. When given random dot stereograms of curved surfaces, this procedure learns to extract surface depth because that is the property that is coherent across space. It also learns how to interpolate the depth at one location from the depths at nearby locations (Becker and Hinton 1992b). In this paper, we propose two new models that handle surfaces with discontinuities. The first model attempts to detect cases of discontinuities and reject them. The second model develops a mixture of expert interpolators. It learns to detect the locations of discontinuities and to invoke specialized, asymmetric interpolators that do not cross the discontinuities.

58 citations


Book ChapterDOI
13 Sep 1993
TL;DR: A method of computing the derivatives of the expected squared error and of the amount of information in the noisy weights in a network that contains a layer of non-linear hidden units without time-consuming Monte Carlo simulations is described.
Abstract: Supervised neural networks generalize well if there is much less information in the weights than there is in the output vectors of the training cases. So during learning, it is important to keep the weights simple by penalizing the amount of information they contain. The amount of information in a weight can be controlled by adding Gaussian noise and the noise level can be adapted during learning to optimize the trade-off between the expected squared error and the information in the weights. We describe a method of computing the derivatives of the expected squared error and of the amount of information in the noisy weights in a network that contains a layer of non-linear hidden units. Provided the output units are linear, the exact derivatives can be computed efficiently without time-consuming Monte Carlo simulations.

44 citations


01 Jan 1993
TL;DR: It is shown how the Minimum Description Length principle (MDL) can be used to train the hidden units of a neural network to develop a population code for the instantiation parameters of an object in an image.
Abstract: An efficient and useful representation for an object viewed from different positions is in terms of its instantiation parameters. We show how the Minimum Description Length principle (MDL) can be used to train the hidden units of a neural network to develop a population code for the instantiation parameters of an object in an image. Each hidden unit has a location in a low-dimensional implicit space. If the hidden unit activities form a standard shape (a bump) in this space, they can be cheaply encoded by the center of this bump. So the weights from the input units to the hidden units in a self-supervised network are trained to make the activities form a bump. The coordinates of the hidden units in the implicit space are also learned, thus allowing flexibility, as the network develops separate population codes when presented with different objects.