scispace - formally typeset
Search or ask a question

Showing papers by "Geoffrey E. Hinton published in 1997"


Dissertation
01 Jan 1997
TL;DR: It is shown that a Bayesian approach to learning in multi-layer perceptron neural networks achieves better performance than the commonly used early stopping procedure, even for reasonably short amounts of computation time.
Abstract: This thesis develops two Bayesian learning methods relying on Gaussian processes and a rigorous statistical approach for evaluating such methods. In these experimental designs the sources of uncertainty in the estimated generalisation performances due to both variation in training and test sets are accounted for. The framework allows for estimation of generalisation performance as well as statistical tests of significance for pairwise comparisons. Two experimental designs are recommended and supported by the DELVE software environment. Two new non-parametric Bayesian learning methods relying on Gaussian process priors over functions are developed. These priors are controlled by hyperparameters which set the characteristic length scale for each input dimension. In the simplest method, these parameters are fit from the data using optimization. In the second, fully Bayesian method, a Markov chain Monte Carlo technique is used to integrate over the hyperparameters. One advantage of these Gaussian process methods is that the priors and hyperparameters of the trained models are easy to interpret. The Gaussian process methods are benchmarked against several other methods, on regression tasks using both real data and data generated from realistic simulations. The experiments show that small datasets are unsuitable for benchmarking purposes because the uncertainties in performance measurements are large. A second set of experiments provide strong evidence that the bagging procedure is advantageous for the Multivariate Adaptive Regression Splines (MARS) method. The simulated datasets have controlled characteristics which make them useful for understanding the relationship between properties of the dataset and the performance of different methods. The dependency of the performance on available computation time is also investigated. It is shown that a Bayesian approach to learning in multi-layer perceptron neural networks achieves better performance than the commonly used early stopping procedure, even for reasonably short amounts of computation time. The Gaussian process methods are shown to consistently outperform the more conventional methods.

467 citations


Journal ArticleDOI
TL;DR: Two new methods for modeling the manifolds of digitized images of handwritten digits of principal components analysis and factor analysis are described, based on locally linear low-dimensional approximations to the underlying data manifold.
Abstract: This paper describes two new methods for modeling the manifolds of digitized images of handwritten digits. The models allow a priori information about the structure of the manifolds to be combined with empirical data. Accurate modeling of the manifolds allows digits to be discriminated using the relative probability densities under the alternative models. One of the methods is grounded in principal components analysis, the other in factor analysis. Both methods are based on locally linear low-dimensional approximations to the underlying data manifold. Links with other methods that model the manifold are discussed.

388 citations


Journal ArticleDOI
TL;DR: A hierarchical, generative model that can be viewed as a nonlinear generalization of factor analysis and can be implemented in a neural network that learns to extract sparse, distributed, hierarchical representations is described.
Abstract: We describe a hierarchical, generative model that can be viewed as a nonlinear generalization of factor analysis and can be implemented in a neural network. The model uses bottom-up, top-down and lateral connections to perform Bayesian perceptual inference correctly. Once perceptual inference has been performed the connection strengths can be updated using a very simple learning rule that only requires locally available information. We demonstrate that the network learns to extract sparse, distributed, hierarchical representations.

227 citations


Journal ArticleDOI
TL;DR: It is shown circumstances under which applying the RPP is guaranteed to increase the mean return, even though it can make large changes in the values of the parameters.
Abstract: We discuss Hinton's (1989) relative payoff procedure (RPP), a static reinforcement learning algorithm whose foundation is not stochastic gradient ascent. We show circumstances under which applying the RPP is guaranteed to increase the mean return, even though it can make large changes in the values of the parameters. The proof is based on a mapping between the RPP and a form of the expectation-maximization procedure of Dempster, Laird, and Rubin (1977).

201 citations


Journal ArticleDOI
TL;DR: A neural network can be used to allow a mobile robot to derive an accurate estimate of its location from noisy sonar sensors and noisy motion information, and the robot can learn the relationship between location and sonar readings without requiring an external supervision signal.
Abstract: We show how a neural network can be used to allow a mobile robot to derive an accurate estimate of its location from noisy sonar sensors and noisy motion information. The robot's model of its locat...

75 citations


Proceedings ArticleDOI
07 Jul 1997
TL;DR: This paper shows how the GTM algorithm can be extended to model time series by incorporating it as the emission density in a hidden Markov model and illustrates the performance of GTM through time using flight recorder data from a helicopter.
Abstract: The standard GTM (generative topographic mapping) algorithm assumes that the data on which it is trained consists of independent, identically distributed (i.i.d.) vectors. For time series, however, the i.i.d. assumption is a poor approximation. In this paper we show how the GTM algorithm can be extended to model time series by incorporating it as the emission density in a hidden Markov model. Since GTM has discrete hidden states we are able to find a tractable EM algorithm, based on the forward-backward algorithm, to train the model. We illustrate the performance of GTM through time using flight recorder data from a helicopter.

64 citations


Dissertation
01 Jan 1997
TL;DR: The Bayesian network framework exposes the similarities between these codes and leads the way to a new class of "trellis-constraint codes" which also operate close to Shannon's limit.
Abstract: Pattern classification, data compression, and channel coding are tasks that usually must deal with complex but structured natural or artificial systems. Patterns that we wish to classify are a consequence of a causal physical process. Images that we wish to compress are also a consequence of a causal physical process. Noisy outputs from a telephone line are corrupted versions of a signal produced by a structured man-made telephone modem. Not only are these tasks characterized by complex structure, but they also contain random elements. Graphical models such as Bayesian networks provide a way to describe the relationships between random variables in a stochastic system. In this thesis, I use Bayesian networks as an overarching framework to describe and solve problems in the areas of pattern classification, data compression, and channel coding. Results on the classification of handwritten digits show that Bayesian network pattern classifiers outperform other standard methods, such as the k-nearest neighbor method. When Bayesian networks are used as source models for data compression, an exponentially large number of codewords are associated with each input pattern. It turns out that the code can still be used efficiently, if a new technique called "bits-back coding" is used. Several new error-correcting decoding algorithms are instances of "probability propagation" in various Bayesian networks. These new schemes are rapidly closing the gap between the performances of practical channel coding systems and Shannon's 50-year-old channel coding limit. The Bayesian network framework exposes the similarities between these codes and leads the way to a new class of "trellis-constraint codes" which also operate close to Shannon's limit.

39 citations


Journal ArticleDOI
TL;DR: It turns out that a commonly used technique for determining parameters— maximum-likelihood estimation—actually minimizes the bits-back coding cost when codewords are chosen according to the Boltzmann distribution.
Abstract: In this paper, we introduce a new algorithm calledbits-back coding' that makes stochastic source codes ef"cient. For a given one-to-many source code, we show that this algorithm can actually be more ef"cient than the algorithm that always picks the shortest codeword. Optimal ef"ciency is achieved when codewords are chosen according to the Boltzmann distribution based on the codeword lengths. It turns out that a commonly used technique for determining parameters— maximum-likelihood estimation—actually minimizes the bits-back coding cost when codewords are chosen according to the Boltzmann distribution. A tractable approximation to maximum-likelihood estimation—the generalized expectation-maximization algorithm—minimizes the bits-back coding cost. After presenting a binary Bayesian network model that assigns exponentially many codewords to each symbol, we show how a tractable approximation to the Boltzmann distribution can be used for bits-back coding. We illustrate the performance of bits-back coding using non-synthetic data with a binary Bayesian network source model that produces 2 60 possible codewords for each input symbol. The rate for bits-back coding is nearly one half of that obtained by picking the shortest

26 citations


Proceedings Article
01 Dec 1997
TL;DR: In this article, a hierarchical, generative model that can be viewed as a non-linear generalisation of factor analysis and can be implemented in a neural network is described, which performs perceptual inference in a probabilistically consistent manner by using top-down, bottom-up and lateral connections.
Abstract: We first describe a hierarchical, generative model that can be viewed as a non-linear generalisation of factor analysis and can be implemented in a neural network. The model performs perceptual inference in a probabilistically consistent manner by using top-down, bottom-up and lateral connections. These connections can be learned using simple rules that require only locally available information. We then show how to incorporate lateral connections into the generative model. The model extracts a sparse, distributed, hierarchical representation of depth from simplified random-dot stereograms and the localised disparity detectors in the first hidden layer form a topographic map. When presented with image patches from natural scenes, the model develops topographically organised local feature detectors.

21 citations


Book
01 Sep 1997
TL;DR: The \wake-sleep" algorithm that allows a multilayer, unsupervised, neural network to build a hierarchy of representations of sensory input is described, which is driven top-down by the generative connections to produce a fantasized representation and a fantasizing sensory input.
Abstract: We describe the \wake-sleep" algorithm that allows a multilayer, unsupervised, neural network to build a hierarchy of representations of sensory input. The network has bottom-up \recognition" connections that are used to convert sensory input into underlying representations. Unlike most arti cial neural networks, it also has top-down \generative" connections that can be used to reconstruct the sensory input from the representations. In the \wake" phase of the learning algorithm, the network is driven by the bottom-up recognition connections and the top-down generative connections are trained to be better at reconstructing the sensory input from the representation chosen by the recognition process. In the \sleep" phase, the network is driven top-down by the generative connections to produce a fantasized representation and a fantasized sensory input. The recognition connections are then trained to be better at recovering the fantasized representation from the fantasized sensory input. In both phases, the synaptic learning rule is simple and local. The combined e ect of the two phases is to create representations of the sensory input that are e cient in the following sense: On average, it takes more bits to describe each sensory input vector directly than to rst describe the representation of the sensory input chosen by the recognition process and then describe the di erence between the sensory input and its reconstruction from the chosen representation.

19 citations


Journal ArticleDOI
TL;DR: By training a neural network to predict how a deformable model should be instantiated from an input image, improved starting points can be obtained and the search time can be significantly reduced without compromising recognition performance.

Dissertation
01 Jan 1997
TL;DR: This thesis presents PSP as a combinatorial optimization problem, and presents and analyze a fast procedure, based on multinomial sampling and a novel coding scheme, that avoids the exhaustive search, prior limits on the order k, and exponentially large parameter space of other methods.
Abstract: The protein structure prediction problem (PSP) is one of the central problems in molecular and structural biology. A computational method that could produce a correct detailed three-dimensional structural model for a protein, given its linear sequence of amino acids, would greatly accelerate progress in the biomedical sciences and industries. This thesis presents PSP as a combinatorial optimization problem, the most straightforward formulations of which require search of an exponentially-large conformation space and are known to be NP-Hard. This otherwise intractable search can in practice be reduced or eliminated through the discovery and use of motifs. Motifs are abstractions of observed patterns that encode structurally important relationships among constituent parts of a complex object like a protein tertiary structure. Motif discovery is accomplished by particular combinatorial search and statistical estimation methods. This thesis explores in detail two particular motif discovery subproblems, and discusses how their solutions can be applied to the overall structure prediction problem: (1) For a complex multi-stage prediction task, what makes a good intermediate representation language? We address this question by presenting and analyzing methods for the discovery of protein secondary structure classes that are more predictable from amino acid sequence than the standard classes of $\alpha$-helix, $\beta$-sheet, and "random coil". (2) Given a database of M objects, each characterized by values $a\sb{ij}\in {\cal A}\sb{j}$ for each of N discrete variables $\{c\sb{j}\}\sbsp{j=1}{N},$ return the list of "most interesting" higher-order features $\gamma\sb{l},$ i.e., sets of $k\sb{l}$ variables with highest estimated correlation, for any $2 \le k\sb{l} \le N$. In the PSP context, the problem is the detection of correlations between amino acid residues in an aligned set of evolutionarily-related protein sequences. We present and analyze a fast procedure, based on multinomial sampling and a novel coding scheme, that avoids the exhaustive search, prior limits on the order k, and exponentially large parameter space of other methods. The focus of this thesis is PSP, but the techniques and analysis are also aimed at wider application to other hard, multi-stage prediction problems.