scispace - formally typeset
Search or ask a question

Showing papers by "Michael I. Jordan published in 1996"


Journal ArticleDOI
TL;DR: In this article, the optimal data selection techniques have been used with feed-forward neural networks and showed how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression.
Abstract: For many types of machine learning algorithms, one can compute the statistically "optimal" way to select training data. In this paper, we review how optimal data selection techniques have been used with feedforward neural networks. We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are computationally expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression are both efficient and accurate. Empirically, we observe that the optimality criterion sharply decreases the number of training examples the learner needs in order to achieve good performance.

2,122 citations


Journal ArticleDOI
TL;DR: The mathematical connection between the Expectation-Maximization (EM) algorithm and gradient-based approaches for maximum likelihood learning of finite gaussian mixtures is built up and an explicit expression for the matrix is provided.
Abstract: We build up the mathematical connection between the “Expectation-Maximization” (EM) algorithm and gradient-based approaches for maximum likelihood learning of finite gaussian mixtures. We show that the EM step in parameter space is obtained from the gradient via a projection matrix P, and we provide an explicit expression for the matrix. We then analyze the convergence of EM in terms of special properties of P and provide new results analyzing the effect that P has on the likelihood surface. Based on these mathematical results, we present a comparative discussion of the advantages and disadvantages of EM and other algorithms for the learning of gaussian mixture models.

849 citations


Journal ArticleDOI
TL;DR: The utility of a mean field theory for sigmoid belief networks based on ideas from statistical mechanics is demonstrated on a benchmark problem in statistical pattern recognition-the classification of handwritten digits.
Abstract: We develop a mean field theory for sigmoid belief networks based on ideas from statistical mechanics. Our mean field theory provides a tractable approximation to the true probability distribution in these networks; it also yields a lower bound on the likelihood of evidence. We demonstrate the utility of this framework on a benchmark problem in statistical pattern recognition-the classification of handwritten digits.

428 citations


Posted Content
TL;DR: This work shows how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression.
Abstract: For many types of machine learning algorithms, one can compute the statistically `optimal' way to select training data. In this paper, we review how optimal data selection techniques have been used with feedforward neural networks. We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are computationally expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression are both efficient and accurate. Empirically, we observe that the optimality criterion sharply decreases the number of training examples the learner needs in order to achieve good performance.

274 citations


Journal ArticleDOI
TL;DR: A simple model, in which the transformation is computed via the population activity of a set of units with large sensory receptive fields, is shown to capture the observed pattern.
Abstract: During visually guided movement, visual representations of target location must be transformed into coordinates appropriate for movement. To investigate the representation and plasticity of the visuomotor coordinate transformation, we examined the changes in pointing behavior after local visuomotor remappings. The visual feedback of finger position was limited to one or two locations in the workspace, at which a discrepancy was introduced between the actual and visually perceived finger position. These remappings induced changes in pointing, which were largest near the locus of remapping and decreased away from it. This pattern of spatial generalization highly constrains models of the computation of the visuomotor transformation in the CNS. A simple model, in which the transformation is computed via the population activity of a set of units with large sensory receptive fields, is shown to capture the observed pattern.

204 citations


Proceedings Article
03 Dec 1996
TL;DR: A time series model that can be viewed as a decision tree with Markov temporal structure is studied and a Viterbi-like assumption is made to pick out a single most likely state sequence.
Abstract: We study a time series model that can be viewed as a decision tree with Markov temporal structure. The model is intractable for exact calculations, thus we utilize variational approximations. We consider three different distributions for the approximation: one in which the Markov calculations are performed exactly and the layers of the decision tree are decoupled, one in which the decision tree calculations are performed exactly and the time steps of the Markov chain are decoupled, and one in which a Viterbi-like assumption is made to pick out a single most likely state sequence. We present simulation results for artificial data and the Bach chorales.

109 citations


Journal ArticleDOI
TL;DR: A structure composed of local linear perceptrons for approximating global class discriminants is investigated and it is concluded that even on such a high-dimensional problem, such local models are promising, much better than RBF's and use much less memory.
Abstract: A structure composed of local linear perceptrons for approximating global class discriminants is investigated. Such local linear models may be combined in a cooperative or competitive way. In the cooperative model, a weighted sum of the outputs of the local perceptrons is computed where the weight is a function of the distance between the input and the position of the local perceptron. In the competitive model, the cost function dictates a mixture model where only one of the local perceptrons give output. Learning of the local models' positions and the linear mappings they implement are coupled and both supervised. We show that this is preferable to the uncoupled case where the positions are trained in an unsupervised manner before the separate, supervised training of mappings. We use goodness criteria based on the cross-entropy and give learning equations for both the cooperative and competitive cases. The coupled and uncoupled versions of cooperative and competitive approaches are compared among themselves and with multilayer perceptrons of sigmoidal hidden units and radial basis functions (RBFs) of Gaussian units on the application of recognition of handwritten digits. The criteria of comparison are the generalization accuracy, learning time, and the number of free parameters. We conclude that even on such a high-dimensional problem, such local models are promising. They generalize much better than RBF's and use much less memory. When compared with multilayer perceptrons, we note that local models learn much faster and generalize as well and sometimes better with comparable number of parameters.

82 citations


Proceedings Article
01 Aug 1996
TL;DR: In this paper, the authors present deterministic techniques for computing upper and lower bounds on marginal probabilities in sigmoid and noisy-OR networks and illustrate the tightness of the bounds by numerical experiments.
Abstract: We present deterministic techniques for computing upper and lower bounds on marginal probabilities in sigmoid and noisy-OR networks These techniques become useful when the size of the network (or clique size) precludes exact computations We illustrate the tightness of the bounds by numerical experiments

74 citations


Book ChapterDOI
01 Jan 1996
TL;DR: This chapter reviews various computational issues that arise in the study of motor control and motor learning, and develops some of the basic ideas in the control of dynamical systems, distinguishing between feedback control and feedforward control.
Abstract: Publisher Summary This chapter reviews various computational issues that arise in the study of motor control and motor learning. It describes feedback control, feedforward control, the problem of delay, observers, learning algorithms, motor learning, and reference models. It focuses on basic theoretical issues with broad applicability. The chapter develops some of the basic ideas in the control of dynamical systems, distinguishing between feedback control and feedforward control. In general, controlling a system involves finding an input to the system that will cause a desired behavior at its output. Feedback control and feedforward control can both be understood as techniques for inverting a dynamical system. The chapter discusses some mathematical representations for dynamical systems.

51 citations


Proceedings Article
03 Dec 1996
TL;DR: A recursive node-elimination formalism for efficiently approximating large probabilistic networks and shows that Boltzmann machines, sigmoid belief networks, or any combination (i.e., chain graphs) can be handled within the same framework.
Abstract: We develop a recursive node-elimination formalism for efficiently approximating large probabilistic networks. No constraints are set on the network topologies. Yet the formalism can be straightforwardly integrated with exact methods whenever they are/become applicable. The approximations we use are controlled: they maintain consistently upper and lower bounds on the desired quantities at all times. We show that Boltzmann machines, sigmoid belief networks, or any combination (i.e., chain graphs) can be handled within the same framework. The accuracy of the methods is verified experimentally.

40 citations


Posted Content
TL;DR: In this article, a mean field theory for sigmoid belief networks based on ideas from statistical mechanics was developed, which provides a tractable approximation to the true probability distribution in these networks; it also yields a lower bound on the likelihood of evidence.
Abstract: We develop a mean field theory for sigmoid belief networks based on ideas from statistical mechanics. Our mean field theory provides a tractable approximation to the true probability distribution in these networks; it also yields a lower bound on the likelihood of evidence. We demonstrate the utility of this framework on a benchmark problem in statistical pattern recognition---the classification of handwritten digits.

Journal ArticleDOI
TL;DR: An overview of current research on artificial neural networks is presented, emphasizing a statistical perspective, that views neural networks as parameterized graphs that make probabilistic assumptions about data and learning algorithms as methods for finding parameter values that look probable in the light of the data.
Abstract: We present an overview of current research on artificial neural networks, emphasizing a statistical perspective. We view neural networks as parameterized graphs that make probabilistic assumptions about data, and view learning algorithms as methods for finding parameter values that look probable in the light of the data. We discuss basic issues in representation and learning, and treat some of the practical issues that arise in fitting networks to data. We also discuss links between neural networks and the general formalism of graphical models.

Proceedings Article
03 Dec 1996
TL;DR: From this path functional, the Euler-Lagrange equations for extremal motion are derived and it is shown that this interpolation can be done efficiently, in high dimensions, for Gaussian, Dirichlet, and mixture models.
Abstract: Given a multidimensional data set and a model of its density, we consider how to define the optimal interpolation between two points. This is done by assigning a cost to each path through space, based on two competing goals-one to interpolate through regions of high density, the other to minimize arc length. From this path functional, we derive the Euler-Lagrange equations for extremal motion; given two points, the desired interpolation is found by solving a boundary value problem. We show that this interpolation can be done efficiently, in high dimensions, for Gaussian, Dirichlet, and mixture models.

Proceedings Article
03 Dec 1996
TL;DR: Two ways of embedding the triangulation problem into continuous domain are presented and it is shown that they perform well compared to the best known heuristic.
Abstract: When triangulating a belief network we aim to obtain a junction tree of minimum state space. According to (Rose, 1970), searching for the optimal triangulation can be cast as a search over all the permutations of the graph's vertices. Our approach is to embed the discrete set of permutations in a convex continuous domain D. By suitably extending the cost function over D and solving the continous nonlinear optimization task we hope to obtain a good triangulation with respect to the aformentioned cost. This paper presents two ways of embedding the triangulation problem into continuous domain and shows that they perform well compared to the best known heuristic.