Top 21 papers published by Ruslan Salakhutdinov from Carnegie Mellon University in 2012

Posted Content•

Improving neural networks by preventing co-adaptation of feature detectors

[...]

Geoffrey E. Hinton¹, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (1)

University of Toronto¹

03 Jul 2012-arXiv: Neural and Evolutionary Computing

TL;DR: The authors randomly omits half of the feature detectors on each training case to prevent complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors.

...read moreread less

Abstract: When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

...read moreread less

6,899 citations

Proceedings Article•

Multimodal Learning with Deep Boltzmann Machines

[...]

Nitish Srivastava¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: In this paper, a Deep Boltzmann Machine (DBM) is proposed for learning a generative model of data that consists of multiple and diverse input modalities, which can be used to extract a unified representation that fuses modalities together.

...read moreread less

Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains.

...read moreread less

1,002 citations

Proceedings Article•

Hamming Distance Metric Learning

[...]

Mohammad Norouzi¹, David J. Fleet¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: A new loss-augmented inference algorithm that is quadratic in the code length and inspired by latent structural SVMs is developed, showing strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes.

...read moreread less

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes.

...read moreread less

562 citations

Journal Article•DOI•

An efficient learning procedure for deep boltzmann machines

[...]

Ruslan Salakhutdinov¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

01 Aug 2012-Neural Computation

TL;DR: A new learning algorithm for Boltzmann machines that contain many layers of hidden variables is presented and results on the MNIST and NORB data sets are presented showing that deep BoltZmann machines learn very good generative models of handwritten digits and 3D objects.

...read moreread less

Abstract: We present a new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent statistics are estimated using a variational approximation that tends to focus on a single mode, and data-independent statistics are estimated using persistent Markov chains. The use of two quite different techniques for estimating the two types of statistic that enter into the gradient of the log likelihood makes it practical to learn Boltzmann machines with multiple hidden layers and millions of parameters. The learning can be made more efficient by using a layer-by-layer pretraining phase that initializes the weights sensibly. The pretraining also allows the variational inference to be initialized sensibly with a single bottom-up pass. We present results on the MNIST and NORB data sets showing that deep Boltzmann machines learn very good generative models of handwritten digits and 3D objects. We also show that the features discovered by deep Boltzmann machines are a very effective way to initialize the hidden layers of feedforward neural nets, which are then discriminatively fine-tuned.

...read moreread less

463 citations

Learning Representations for Multimodal Data with Deep Belief Nets

[...]

Nitish Srivastava¹, Ruslan Salakhutdinov•Institutions (1)

University of Toronto¹

01 Jan 2012

TL;DR: The experimental results on bi-modal data consisting of images and text show that the Multimodal DBN can learn a good generative model of the joint space of image and text inputs that is useful for lling in missing data so it can be used both for image annotation and image retrieval.

...read moreread less

Abstract: We propose a Deep Belief Network architecture for learning a joint representation of multimodal data. The model denes a probability distribution over the space of multimodal inputs and allows sampling from the conditional distributions over each data modality. This makes it possible for the model to create a multimodal representation even when some data modalities are missing. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBN can learn a good generative model of the joint space of image and text inputs that is useful for lling in missing data so it can be used both for image annotation and image retrieval. We further demonstrate that using the representation discovered by the Multimodal DBN our model can significantly outperform SVMs and LDA on discriminative tasks.

...read moreread less

200 citations

Proceedings Article•DOI•

Robust Boltzmann Machines for recognition and denoising

[...]

Yichuan Tang¹, Ruslan Salakhutdinov¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

16 Jun 2012

TL;DR: This paper introduces a novel model, the Robust Boltzmann Machine (RoBM), which allows BoltZmann Machines to be robust to corruptions and is significantly better at recognition and denoising on several face databases.

...read moreread less

Abstract: While Boltzmann Machines have been successful at unsupervised learning and density modeling of images and speech data, they can be very sensitive to noise in the data. In this paper, we introduce a novel model, the Robust Boltzmann Machine (RoBM), which allows Boltzmann Machines to be robust to corruptions. In the domain of visual recognition, the RoBM is able to accurately deal with occlusions and noise by using multiplicative gating to induce a scale mixture of Gaussians over pixels. Image denoising and in-painting correspond to posterior inference in the RoBM. Our model is trained in an unsupervised fashion with unlabeled noisy data and can learn the spatial structure of the occluders. Compared to standard algorithms, the RoBM is significantly better at recognition and denoising on several face databases.

...read moreread less

181 citations

Proceedings Article•

A Better Way to Pretrain Deep Boltzmann Machines

[...]

Geoffrey E. Hinton¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: A different method of pretraining DBMs is developed that distributes the modelling work more evenly over the hidden layers and demonstrates that the new pretraining algorithm allows us to learn better generative models.

...read moreread less

Abstract: We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models.

...read moreread less

129 citations

Proceedings Article•

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model

[...]

Ruslan Salakhutdinov¹, Josh Tenenbaum², Antonio Torralba²•Institutions (2)

University of Toronto¹, Massachusetts Institute of Technology²

27 Jun 2012

TL;DR: A hierarchical Bayesian model that learns categories from single training examples that transfers acquired knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances is developed.

...read moreread less

Abstract: We develop a hierarchical Bayesian model that learns categories from single training examples. The model transfers acquired knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances. The model discovers how to group categories into meaningful super-categories that express different priors for new classes. Given a single example of a novel category, we can efficiently infer which super-category the novel category belongs to, and thereby estimate not only the new category's mean but also an appropriate similarity metric based on parameters inherited from the super-category. On MNIST and MSR Cambridge image datasets the model learns useful representations of novel categories based on just a single training example, and performs significantly better than simpler hierarchical Bayesian approaches. It can also discover new categories in a completely unsupervised fashion, given just one or a few examples.

...read moreread less

118 citations

Posted Content•

Exploiting compositionality to explore a large space of model structures

[...]

Roger Grosse¹, Ruslan Salakhutdinov², William T. Freeman¹, Joshua B. Tenenbaum¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

16 Oct 2012-arXiv: Learning

TL;DR: This work organizes a space of matrix decomposition models into a context-free grammar which generates a wide variety of structures through the compositional application of a few simple rules and automatically chooses the decomposition structure from raw data by evaluating only a small fraction of all models.

...read moreread less

Abstract: The recent proliferation of richly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset. We investigate this question for a space of matrix decomposition models which can express a variety of widely used models from unsupervised learning. To enable model selection, we organize these models into a context-free grammar which generates a wide variety of structures through the compositional application of a few simple rules. We use our grammar to generically and efficiently infer latent components and estimate predictive likelihood for nearly 2500 structures using a small toolbox of reusable algorithms. Using a greedy search over our grammar, we automatically choose the decomposition structure from raw data by evaluating only a small fraction of all models. The proposed method typically finds the correct structure for synthetic data and backs off gracefully to simpler models under heavy noise. It learns sensible structures for datasets as diverse as image patches, motion capture, 20 Questions, and U.S. Senate votes, all using exactly the same code.

...read moreread less

96 citations

Proceedings Article•

Deep Lambertian Networks

[...]

Yichuan Tang¹, Geoffrey E. Hinton¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

26 Jun 2012

TL;DR: A multilayer generative model where the latent variables include the albedo, surface normals, and the light source is introduced, and it is demonstrated that this model is able to generalize as well as improve over standard baselines in one-shot face recognition.

...read moreread less

Abstract: Visual perception is a challenging problem in part due to illumination variations. A possible solution is to first estimate an illumination invariant representation before using it for recognition. The object albedo and surface normals are examples of such representations. In this paper, we introduce a multilayer generative model where the latent variables include the albedo, surface normals, and the light source. Combining Deep Belief Nets with the Lambertian reectance assumption, our model can learn good priors over the albedo from 2D images. Illumination variations can be explained by changing only the lighting latent variable in our model. By transferring learned knowledge from similar objects, albedo and surface normals estimation from a single image is possible in our model. Experiments demonstrate that our model is able to generalize as well as improve over standard baselines in one-shot face recognition.

...read moreread less

78 citations

Proceedings Article•DOI•

Resource configurable spoken query detection using Deep Boltzmann Machines

[...]

Yaodong Zhang¹, Ruslan Salakhutdinov², Hung-An Chang¹, James Glass¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

25 Mar 2012

TL;DR: A spoken query detection method based on posteriorgrams generated from Deep Boltzmann Machines (DBMs) that can be deployed in both semi-supervised and unsupervised training scenarios.

...read moreread less

Abstract: In this paper we present a spoken query detection method based on posteriorgrams generated from Deep Boltzmann Machines (DBMs). The proposed method can be deployed in both semi-supervised and unsupervised training scenarios. The DBM-based posteriorgrams were evaluated on a series of keyword spotting tasks using the TIMIT speech corpus. In unsupervised training conditions, the DBM-approach improved upon our previous best unsupervised keyword detection performance using Gaussian mixture model-based posteriorgrams by over 10%. When limited amounts of labeled data were incorporated into training, the DBM-approach required less than one third of the annotated data in order to achieve a comparable performance of a system that used all of the annotated data for training.

...read moreread less

Posted Content•

Deep Mixtures of Factor Analysers

[...]

Yichuan Tang¹, Ruslan Salakhutdinov¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

18 Jun 2012-arXiv: Learning

TL;DR: In this paper, a greedy layer-wise learning algorithm for deep Mixtures of Factor Analysers (DMFAs) is presented, which can be converted to an equivalent shallow MFA by multiplying together the factor loading matrices at different levels.

...read moreread less

Abstract: An efficient way to learn deep density models that have many layers of latent variables is to learn one layer at a time using a model that has only one layer of latent variables. After learning each layer, samples from the posterior distributions for that layer are used as training data for learning the next layer. This approach is commonly used with Restricted Boltzmann Machines, which are undirected graphical models with a single hidden layer, but it can also be used with Mixtures of Factor Analysers (MFAs) which are directed graphical models. In this paper, we present a greedy layer-wise learning algorithm for Deep Mixtures of Factor Analysers (DMFAs). Even though a DMFA can be converted to an equivalent shallow MFA by multiplying together the factor loading matrices at different levels, learning and inference are much more efficient in a DMFA and the sharing of each lower-level factor loading matrix by many different higher level MFAs prevents overfitting. We demonstrate empirically that DMFAs learn better density models than both MFAs and two types of Restricted Boltzmann Machine on a wide variety of datasets.

...read moreread less

Posted Content•

On the Convergence of Bound Optimization Algorithms

[...]

Ruslan Salakhutdinov¹, Sam T. Roweis¹, Zoubin Ghahramani²•Institutions (2)

University of Toronto¹, University College London²

19 Oct 2012-arXiv: Learning

TL;DR: This paper derives a general relationship between the updates performed by bound optimization methods and those of gradient and second-order methods and identifies analytic conditions under which bound optimization algorithms exhibit quasi-Newton behavior, and under which they possess poor, first-order convergence.

...read moreread less

Abstract: Many practitioners who use the EM algorithm complain that it is sometimes slow. When does this happen, and what can be done about it? In this paper, we study the general class of bound optimization algorithms - including Expectation-Maximization, Iterative Scaling and CCCP - and their relationship to direct optimization algorithms such as gradient-based methods for parameter learning. We derive a general relationship between the updates performed by bound optimization methods and those of gradient and second-order methods and identify analytic conditions under which bound optimization algorithms exhibit quasi-Newton behavior, and conditions under which they possess poor, first-order convergence. Based on this analysis, we consider several specific algorithms, interpret and analyze their convergence properties and provide some recipes for preprocessing input to these algorithms to yield faster convergence behavior. We report empirical results supporting our analysis and showing that simple data preprocessing can result in dramatically improved performance of bound optimizers in practice.

...read moreread less

Posted Content•

Deep Lambertian Networks

[...]

Yichuan Tang¹, Ruslan Salakhutdinov¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

27 Jun 2012-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a multilayer generative model was proposed to estimate the surface normals and albedo from a single image using deep belief networks and Lambertian reflectance assumption.

...read moreread less

Abstract: Visual perception is a challenging problem in part due to illumination variations. A possible solution is to first estimate an illumination invariant representation before using it for recognition. The object albedo and surface normals are examples of such representations. In this paper, we introduce a multilayer generative model where the latent variables include the albedo, surface normals, and the light source. Combining Deep Belief Nets with the Lambertian reflectance assumption, our model can learn good priors over the albedo from 2D images. Illumination variations can be explained by changing only the lighting latent variable in our model. By transferring learned knowledge from similar objects, albedo and surface normals estimation from a single image is possible in our model. Experiments demonstrate that our model is able to generalize as well as improve over standard baselines in one-shot face recognition.

...read moreread less

Proceedings Article•

Cardinality Restricted Boltzmann Machines

[...]

Kevin Swersky¹, Ilya Sutskever¹, Daniel Tarlow¹, Richard S. Zemel¹, Ruslan Salakhutdinov¹, Ryan P. Adams² - Show less +2 more•Institutions (2)

University of Toronto¹, Harvard University²

03 Dec 2012

TL;DR: It is shown that a dynamic programming algorithm can be used to implement exact sparsity in the RBM's hidden units and how to pass derivatives through the resulting posterior marginals, which makes it possible to fine-tune a pre-trained neural network with sparse hidden layers.

...read moreread less

Abstract: The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is that, given an input, the posterior distribution over hidden variables is factorizable and can be easily computed and sampled from. Sparsity and competition in the hidden representation is beneficial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are typically not added, as the resulting posterior over the hidden units seemingly becomes intractable. In this paper we show that a dynamic programming algorithm can be used to implement exact sparsity in the RBM's hidden units. We also show how to pass derivatives through the resulting posterior marginals, which makes it possible to fine-tune a pre-trained neural network with sparse hidden layers.

...read moreread less

Proceedings Article•

Exploiting compositionality to explore a large space of model structures

[...]

Roger Grosse¹, Ruslan Salakhutdinov², William T. Freeman¹, Joshua B. Tenenbaum¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

14 Aug 2012

TL;DR: In this paper, a context-free grammar for matrix decomposition models is proposed to automatically choose the decomposition structure from raw data by evaluating only a small fraction of all models.

...read moreread less

Abstract: The recent proliferation of richly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset. We investigate this question for a space of matrix decomposition models which can express a variety of widely used models from unsupervised learning. To enable model selection, we organize these models into a context-free grammar which generates a wide variety of structures through the compositional application of a few simple rules. We use our grammar to generically and efficiently infer latent components and estimate predictive likelihood for nearly 2500 structures using a small toolbox of reusable algorithms. Using a greedy search over our grammar, we automatically choose the decomposition structure from raw data by evaluating only a small fraction of all models. The proposed method typically finds the correct structure for synthetic data and backs off gracefully to simpler models under heavy noise. It learns sensible structures for datasets as diverse as image patches, motion capture, 20 Questions, and U.S. Senate votes, all using exactly the same code.

...read moreread less

Proceedings Article•

Deep Mixtures of Factor Analysers

[...]

Yichuan Tang¹, Geoffrey E. Hinton¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

26 Jun 2012

TL;DR: This paper presents a greedy layer-wise learning algorithm for Deep Mixtures of Factor Analysers (DMFAs) and demonstrates empirically that DMFAs learn better density models than both MFAs and two types of Restricted Boltzmann Machine on a wide variety of datasets.

...read moreread less

Abstract: An efficient way to learn deep density models that have many layers of latent variables is to learn one layer at a time using a model that has only one layer of latent variables. After learning each layer, samples from the posterior distributions for that layer are used as training data for learning the next layer. This approach is commonly used with Restricted Boltzmann Machines, which are undirected graphical models with a single hidden layer, but it can also be used with Mixtures of Factor Analysers (MFAs) which are directed graphical models. In this paper, we present a greedy layer-wise learning algorithm for Deep Mixtures of Factor Analysers (DMFAs). Even though a DMFA can be converted to an equivalent shallow MFA by multiplying together the factor loading matrices at different levels, learning and inference are much more efficient in a DMFA and the sharing of each lower-level factor loading matrix by many different higher level MFAs prevents overfitting. We demonstrate empirically that DMFAs learn better density models than both MFAs and two types of Restricted Boltzmann Machine on a wide variety of datasets.

...read moreread less

Journal Article•

Concept learning as motor program induction: A large-scale empirical study

[...]

Brenden M. Lake, Ruslan Salakhutdinov, Joshua B. Tenenbaum

01 Jan 2012-Cognitive Science

TL;DR: A large-scale empirical study of one-shot concept learning as motor program induction is presented, suggesting that rich generative knowledge in the form of a motor program can be induced from just a single example of a novel concept.

...read moreread less

Proceedings Article•

Matrix reconstruction with the local max norm

[...]

Rina Foygel¹, Nathan Srebro², Ruslan Salakhutdinov³•Institutions (3)

Stanford University¹, Toyota Technological Institute at Chicago², University of Toronto³

03 Dec 2012

TL;DR: In this paper, a new family of matrix norms, the local max norm (local max) is introduced, which generalizes existing methods such as the max norm, the nuclear norm, and the weighted or smoothed weighted trace norm.

...read moreread less

Abstract: We introduce a new family of matrix norms, the "local max" norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms.

...read moreread less

Posted Content•

Matrix reconstruction with the local max norm

[...]

Rina Foygel¹, Nathan Srebro², Ruslan Salakhutdinov³•Institutions (3)

Stanford University¹, Toyota Technological Institute at Chicago², University of Toronto³

18 Oct 2012-arXiv: Machine Learning

TL;DR: A new family of matrix norms, the "local max" norms, are introduced, generalizing existing methods such as the max norm, the trace norm, and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems.

...read moreread less

Abstract: We introduce a new family of matrix norms, the "local max" norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms.

...read moreread less

Proceedings Article•

Domain Adaptation: A Small Sample Statistical Approach

[...]

Ruslan Salakhutdinov¹, Sham M. Kakade², Dean P. Foster²•Institutions (2)

University of Toronto¹, University of Pennsylvania²

21 Mar 2012

TL;DR: The theoretical analysis shows that one can select many more features than domains while avoiding overfitting by utilizing data-dependent variance properties, and presents a greedy feature selection algorithm based on using T -statistics.

...read moreread less

Abstract: We study the prevalent problem when a test distribution differs from the training distribution. We consider a setting where our training set consists of a small number of sample domains, but where we have many samples in each domain. Our goal is to generalize to a new domain. For example, we may want to learn a similarity function using only certain classes of objects, but we desire that this similarity function be applicable to object classes not present in our training sample (e.g. we might seek to learn that “dogs are similar to dogs” even though images of dogs were absent from our training set). Our theoretical analysis shows that we can select many more features than domains while avoiding overfitting by utilizing data-dependent variance properties. We present a greedy feature selection algorithm based on using T -statistics. Our experiments validate this theory showing that our T statistic based greedy feature selection is more robust at avoiding overfitting than the classical greedy procedure.

...read moreread less

Showing papers by "Ruslan Salakhutdinov published in 2012"