Top 30 papers published by Ruslan Salakhutdinov from Carnegie Mellon University in 2016

Proceedings Article•

Revisiting semi-supervised learning with graph embeddings

[...]

Zhilin Yang¹, William W. Cohen¹, Ruslan Salakhutdinov¹•Institutions (1)

19 Jun 2016

TL;DR: In this article, a semi-supervised learning framework based on graph embeddings is proposed, where given a graph between instances, an embedding for each instance is trained to jointly predict the class label and the neighborhood context in the graph.

...read moreread less

Abstract: We present a semi-supervised learning framework based on graph embeddings. Given a graph between instances, we train an embedding for each instance to jointly predict the class label and the neighborhood context in the graph. We develop both transductive and inductive variants of our method. In the transductive variant of our method, the class labels are determined by both the learned embeddings and input feature vectors, while in the inductive variant, the embeddings are defined as a parametric function of the feature vectors, so predictions can be made on instances not seen during training. On a large and diverse set of benchmark tasks, including text classification, distantly supervised entity extraction, and entity classification, we show improved performance over many of the existing models.

...read moreread less

1,012 citations

Proceedings Article•

Importance Weighted Autoencoders

[...]

Yuri Burda¹, Roger Grosse¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

01 Jan 2016

TL;DR: The importance weighted autoencoder (IWAE) as mentioned in this paper uses a strictly tighter log-likelihood lower bound derived from importance weighting to model complex posteriors which do not fit the VAE modeling assumptions.

...read moreread less

Abstract: The variational autoencoder (VAE; Kingma, Welling (2014)) is a recently proposed generative model pairing a top-down generative network with a bottom-up recognition network which approximates posterior inference. It typically makes strong assumptions about posterior inference, for instance that the posterior distribution is approximately factorial, and that its parameters can be approximated with nonlinear regression from the observations. As we show empirically, the VAE objective can lead to overly simplified representations which fail to use the network's entire modeling capacity. We present the importance weighted autoencoder (IWAE), a generative model with the same architecture as the VAE, but which uses a strictly tighter log-likelihood lower bound derived from importance weighting. In the IWAE, the recognition network uses multiple samples to approximate the posterior, giving it increased flexibility to model complex posteriors which do not fit the VAE modeling assumptions. We show empirically that IWAEs learn richer latent space representations than VAEs, leading to improved test log-likelihood on density estimation benchmarks.

...read moreread less

547 citations

Proceedings Article•

Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

[...]

Emilio Parisotto¹, Jimmy Ba¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

01 Jan 2016

TL;DR: This work defines a novel method of multitask and transfer learning that enables an autonomous agent to learn how to behave in multiple tasks simultaneously, and then generalize its knowledge to new domains, and uses Atari games as a testing environment to demonstrate these methods.

...read moreread less

Abstract: The ability to act in multiple environments and transfer previous knowledge to new situations can be considered a critical aspect of any intelligent agent. Towards this goal, we define a novel method of multitask and transfer learning that enables an autonomous agent to learn how to behave in multiple tasks simultaneously, and then generalize its knowledge to new domains. This method, termed "Actor-Mimic", exploits the use of deep reinforcement learning and model compression techniques to train a single policy network that learns how to act in a set of distinct tasks by using the guidance of several expert teachers. We then show that the representations learnt by the deep policy network are capable of generalizing to new tasks with no prior expert guidance, speeding up learning in novel environments. Although our method can in general be applied to a wide range of problems, we use Atari games as a testing environment to demonstrate these methods.

...read moreread less

445 citations

Proceedings Article•

Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks

[...]

Zhilin Yang¹, Ruslan Salakhutdinov¹, William W. Cohen¹•Institutions (1)

Carnegie Mellon University¹

04 Nov 2016

TL;DR: The effects of transfer learning for deep hierarchical recurrent networks across domains, applications, and languages are examined, and it is shown that significant improvement can often be obtained.

...read moreread less

Abstract: Recent papers have shown that neural networks obtain state-of-the-art performance on several different sequence tagging tasks. One appealing property of such systems is their generality, as excellent performance can be achieved with a unified architecture and without task-specific feature engineering. However, it is unclear if such systems can be used for tasks without large amounts of training data. In this paper we explore the problem of transfer learning for neural sequence taggers, where a source task with plentiful annotations (e.g., POS tagging on Penn Treebank) is used to improve performance on a target task with fewer available annotations (e.g., POS tagging for microblogs). We examine the effects of transfer learning for deep hierarchical recurrent networks across domains, applications, and languages, and show that significant improvement can often be obtained. These improvements lead to improvements over the current state-of-the-art on several well-studied tasks.

...read moreread less

333 citations

Proceedings Article•

Deep Kernel Learning

[...]

Andrew Gordon Wilson¹, Zhiting Hu², Ruslan Salakhutdinov³, Eric P. Xing²•Institutions (3)

Pennsylvania State University¹, Carnegie Mellon University², University of Toronto³

02 May 2016

TL;DR: In this article, the authors introduce scalable deep kernels, which combine the structural properties of deep learning architectures with the non-parametric flexibility of kernel methods, and jointly learn the properties of these kernels through the marginal likelihood of a Gaussian process.

...read moreread less

Abstract: We introduce scalable deep kernels, which combine the structural properties of deep learning architectures with the non-parametric flexibility of kernel methods. Specifically, we transform the inputs of a spectral mixture base kernel with a deep architecture, using local kernel interpolation, inducing points, and structure exploiting (Kronecker and Toeplitz) algebra for a scalable kernel representation. These closed-form kernels can be used as drop-in replacements for standard kernels, with benefits in expressive power and scalability. We jointly learn the properties of these kernels through the marginal likelihood of a Gaussian process. Inference and learning cost O(n) for n training points, and predictions cost O(1) per test point. On a large and diverse collection of applications, including a dataset with 2 million examples, we show improved performance over scalable Gaussian processes with flexible kernel learning models, and stand-alone deep architectures.

...read moreread less

325 citations

Proceedings Article•

Generating Images from Captions with Attention

[...]

Elman Mansimov¹, Emilio Parisotto¹, Jimmy Ba¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

01 Jan 2016

TL;DR: In this paper, a model that generates images from natural language descriptions is proposed, which iteratively draws patches on a canvas, while attending to the relevant words in the description, and demonstrates that their model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.

...read moreread less

Abstract: Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.

...read moreread less

274 citations

Posted Content•

Multi-Task Cross-Lingual Sequence Tagging from Scratch

[...]

Zhilin Yang, Ruslan Salakhutdinov, William W. Cohen

20 Mar 2016-arXiv: Computation and Language

TL;DR: A deep hierarchical recurrent neural network for sequence tagging that employs deep gated recurrent units on both character and word levels to encode morphology and context information, and applies a conditional random field layer to predict the tags.

...read moreread less

Abstract: We present a deep hierarchical recurrent neural network for sequence tagging. Given a sequence of words, our model employs deep gated recurrent units on both character and word levels to encode morphology and context information, and applies a conditional random field layer to predict the tags. Our model is task independent, language independent, and feature engineering free. We further extend our model to multi-task and cross-lingual joint training by sharing the architecture and parameters. Our model achieves state-of-the-art results in multiple languages on several benchmark tasks including POS tagging, chunking, and NER. We also demonstrate that multi-task and cross-lingual joint training can improve the performance in various cases.

...read moreread less

235 citations

Proceedings Article•

Review Networks for Caption Generation

[...]

Zhilin Yang¹, Ye Yuan, Yuexin Wu², William W. Cohen¹, Ruslan Salakhutdinov³ - Show less +1 more•Institutions (3)

Carnegie Mellon University¹, Tsinghua University², University of Toronto³

05 Dec 2016

Abstract: We propose a novel extension of the encoder-decoder framework, called a review network. The review network is generic and can enhance any existing encoder- decoder model: in this paper, we consider RNN decoders with both CNN and RNN encoders. The review network performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a thought vector after each review step; the thought vectors are used as the input of the attention mechanism in the decoder. We show that conventional encoder-decoders are a special case of our framework. Empirically, we show that our framework improves over state-of- the-art encoder-decoder systems on the tasks of image captioning and source code captioning.

...read moreread less

219 citations

Posted Content•

Spatially Adaptive Computation Time for Residual Networks

[...]

Michael Figurnov¹, Maxwell D. Collins², Yukun Zhu², Li Zhang², Jonathan Huang², Dmitry Vetrov¹, Ruslan Salakhutdinov³ - Show less +3 more•Institutions (3)

National Research University – Higher School of Economics¹, Google², Carnegie Mellon University³

07 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image is proposed, which is end-to-end trainable, deterministic and problem-agnostic.

...read moreread less

Abstract: This paper proposes a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image. This architecture is end-to-end trainable, deterministic and problem-agnostic. It is therefore applicable without any modifications to a wide range of computer vision problems such as image classification, object detection and image segmentation. We present experimental results showing that this model improves the computational efficiency of Residual Networks on the challenging ImageNet classification and COCO object detection datasets. Additionally, we evaluate the computation time maps on the visual saliency dataset cat2000 and find that they correlate surprisingly well with human eye fixation positions.

...read moreread less

214 citations

Posted Content•

The More You Know: Using Knowledge Graphs for Image Classification

[...]

Kenneth Marino¹, Ruslan Salakhutdinov¹, Abhinav Gupta¹•Institutions (1)

Carnegie Mellon University¹

14 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Graph Search Neural Networks (GSNNs) as discussed by the authors use structured prior knowledge in the form of knowledge graphs and show that using this knowledge improves performance on image classification, which is similar to our work.

...read moreread less

Abstract: One characteristic that sets humans apart from modern learning-based computer vision algorithms is the ability to acquire knowledge about the world and use that knowledge to reason about the visual world. Humans can learn about the characteristics of objects and the relationships that occur between them to learn a large variety of visual concepts, often with few examples. This paper investigates the use of structured prior knowledge in the form of knowledge graphs and shows that using this knowledge improves performance on image classification. We build on recent work on end-to-end learning on graphs, introducing the Graph Search Neural Network as a way of efficiently incorporating large knowledge graphs into a vision classification pipeline. We show in a number of experiments that our method outperforms standard neural network baselines for multi-label classification.

...read moreread less

167 citations

Proceedings Article•

On the Quantitative Analysis of Decoder-Based Generative Models

[...]

Yuhuai Wu¹, Yuri Burda¹, Ruslan Salakhutdinov², Roger Grosse¹•Institutions (2)

University of Toronto¹, Carnegie Mellon University²

04 Nov 2016

TL;DR: In this article, the authors use Annealed Importance Sampling (AIS) for evaluating log-likelihoods for decoder-based models and validate its accuracy using bidirectional Monte Carlo.

...read moreread less

Abstract: The past several years have seen remarkable progress in generative models which produce convincing samples of images and other modalities. A shared component of many powerful generative models is a decoder network, a parametric deep neural net that defines a generative distribution. Examples include variational autoencoders, generative adversarial networks, and generative moment matching networks. Unfortunately, it can be difficult to quantify the performance of these models because of the intractability of log-likelihood estimation, and inspecting samples can be misleading. We propose to use Annealed Importance Sampling for evaluating log-likelihoods for decoder-based models and validate its accuracy using bidirectional Monte Carlo. The evaluation code is provided at this https URL. Using this technique, we analyze the performance of decoder-based models, the effectiveness of existing log-likelihood estimators, the degree of overfitting, and the degree to which these models miss important modes of the data distribution.

...read moreread less

Posted Content•

Stochastic Variational Deep Kernel Learning

[...]

Andrew Gordon Wilson¹, Zhiting Hu¹, Ruslan Salakhutdinov¹, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

01 Nov 2016-arXiv: Machine Learning

TL;DR: An efficient form of stochastic variational inference is derived which leverages local kernel interpolation, inducing points, and structure exploiting algebra within this framework to enable classification, multi-task learning, additive covariance structures, and Stochastic gradient training.

...read moreread less

Abstract: Deep kernel learning combines the non-parametric flexibility of kernel methods with the inductive biases of deep learning architectures. We propose a novel deep kernel learning model and stochastic variational inference procedure which generalizes deep kernel learning approaches to enable classification, multi-task learning, additive covariance structures, and stochastic gradient training. Specifically, we apply additive base kernels to subsets of output features from deep neural architectures, and jointly learn the parameters of the base kernels and deep network through a Gaussian process marginal likelihood objective. Within this framework, we derive an efficient form of stochastic variational inference which leverages local kernel interpolation, inducing points, and structure exploiting algebra. We show improved performance over stand alone deep networks, SVMs, and state of the art scalable Gaussian processes on several classification benchmarks, including an airline delay dataset containing 6 million training points, CIFAR, and ImageNet.

...read moreread less

Posted Content•

On the Quantitative Analysis of Decoder-Based Generative Models

[...]

Yuhuai Wu¹, Yuri Burda¹, Ruslan Salakhutdinov², Roger Grosse¹•Institutions (2)

University of Toronto¹, Carnegie Mellon University²

14 Nov 2016-arXiv: Learning

TL;DR: This work proposes to use Annealed Importance Sampling for evaluating log-likelihoods for decoder-based models and validate its accuracy using bidirectional Monte Carlo, and analyzes the performance of decoded models, the effectiveness of existing log- likelihood estimators, the degree of overfitting, and the degree to which these models miss important modes of the data distribution.

...read moreread less

Abstract: The past several years have seen remarkable progress in generative models which produce convincing samples of images and other modalities. A shared component of many powerful generative models is a decoder network, a parametric deep neural net that defines a generative distribution. Examples include variational autoencoders, generative adversarial networks, and generative moment matching networks. Unfortunately, it can be difficult to quantify the performance of these models because of the intractability of log-likelihood estimation, and inspecting samples can be misleading. We propose to use Annealed Importance Sampling for evaluating log-likelihoods for decoder-based models and validate its accuracy using bidirectional Monte Carlo. The evaluation code is provided at this https URL. Using this technique, we analyze the performance of decoder-based models, the effectiveness of existing log-likelihood estimators, the degree of overfitting, and the degree to which these models miss important modes of the data distribution.

...read moreread less

Posted Content•

Review Networks for Caption Generation

[...]

Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, William W. Cohen - Show less +1 more

25 May 2016-arXiv: Learning

TL;DR: The review network performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a thought vector after each review step; the thought vectors are used as the input of the attention mechanism in the decoder.

...read moreread less

Abstract: We propose a novel extension of the encoder-decoder framework, called a review network The review network is generic and can enhance any existing encoder- decoder model: in this paper, we consider RNN decoders with both CNN and RNN encoders The review network performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a thought vector after each review step; the thought vectors are used as the input of the attention mechanism in the decoder We show that conventional encoder-decoders are a special case of our framework Empirically, we show that our framework improves over state-of- the-art encoder-decoder systems on the tasks of image captioning and source code captioning

...read moreread less

Posted Content•

On Multiplicative Integration with Recurrent Neural Networks

[...]

Yuhuai Wu¹, Saizheng Zhang², Ying Zhang², Yoshua Bengio², Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (2)

University of Toronto¹, Université de Montréal²

21 Jun 2016-arXiv: Learning

TL;DR: This work introduces a general and simple structural design called Multiplicative Integration, which changes the way in which information from difference sources flows and is integrated in the computational building block of an RNN, while introducing almost no extra parameters.

...read moreread less

Abstract: We introduce a general and simple structural design called Multiplicative Integration (MI) to improve recurrent neural networks (RNNs). MI changes the way in which information from difference sources flows and is integrated in the computational building block of an RNN, while introducing almost no extra parameters. The new structure can be easily embedded into many popular RNN models, including LSTMs and GRUs. We empirically analyze its learning behaviour and conduct evaluations on several tasks using different RNN models. Our experimental results demonstrate that Multiplicative Integration can provide a substantial performance boost over many of the existing RNN models.

...read moreread less

Posted Content•

Revisiting Semi-Supervised Learning with Graph Embeddings

[...]

Zhilin Yang¹, William W. Cohen¹, Ruslan Salakhutdinov¹•Institutions (1)

Carnegie Mellon University¹

29 Mar 2016-arXiv: Learning

TL;DR: On a large and diverse set of benchmark tasks, including text classification, distantly supervised entity extraction, and entity classification, the proposed semi-supervised learning framework shows improved performance over many of the existing models.

...read moreread less

Abstract: We present a semi-supervised learning framework based on graph embeddings. Given a graph between instances, we train an embedding for each instance to jointly predict the class label and the neighborhood context in the graph. We develop both transductive and inductive variants of our method. In the transductive variant of our method, the class labels are determined by both the learned embeddings and input feature vectors, while in the inductive variant, the embeddings are defined as a parametric function of the feature vectors, so predictions can be made on instances not seen during training. On a large and diverse set of benchmark tasks, including text classification, distantly supervised entity extraction, and entity classification, we show improved performance over many of the existing models.

...read moreread less

Proceedings Article•

Stochastic variational deep kernel learning

[...]

Andrew Gordon Wilson¹, Zhiting Hu¹, Ruslan Salakhutdinov¹, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2016

TL;DR: Deep Kernel Learning as mentioned in this paper applies additive base kernels to subsets of output features from deep neural architectures, and jointly learns the parameters of the base kernels and deep network through a Gaussian process marginal likelihood objective.

...read moreread less

Abstract: Deep kernel learning combines the non-parametric flexibility of kernel methods with the inductive biases of deep learning architectures. We propose a novel deep kernel learning model and stochastic variational inference procedure which generalizes deep kernel learning approaches to enable classification, multi-task learning, additive covariance structures, and stochastic gradient training. Specifically, we apply additive base kernels to subsets of output features from deep neural architectures, and jointly learn the parameters of the base kernels and deep network through a Gaussian process marginal likelihood objective. Within this framework, we derive an efficient form of stochastic variational inference which leverages local kernel interpolation, inducing points, and structure exploiting algebra. We show improved performance over stand alone deep networks, SVMs, and state of the art scalable Gaussian processes on several classification benchmarks, including an airline delay dataset containing 6 million training points, CIFAR, and ImageNet.

...read moreread less

Posted Content•

Gated-Attention Readers for Text Comprehension

[...]

Bhuwan Dhingra¹, Hanxiao Liu², Zhilin Yang², William W. Cohen², Ruslan Salakhutdinov² - Show less +1 more•Institutions (2)

Microsoft¹, Carnegie Mellon University²

05 Jun 2016-arXiv: Computation and Language

TL;DR: The model, the Gated-Attention (GA) Reader, integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader, which enables the reader to build query-specific representations of tokens in the document for accurate answer selection.

...read moreread less

Abstract: In this paper we study the problem of answering cloze-style questions over documents. Our model, the Gated-Attention (GA) Reader, integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enables the reader to build query-specific representations of tokens in the document for accurate answer selection. The GA Reader obtains state-of-the-art results on three benchmarks for this task--the CNN \& Daily Mail news stories and the Who Did What dataset. The effectiveness of multiplicative interaction is demonstrated by an ablation study, and by comparing to alternative compositional operators for implementing the gated-attention. The code is available at this https URL.

...read moreread less

Posted Content•

Encode, Review, and Decode: Reviewer Module for Caption Generation.

[...]

Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, William W. Cohen - Show less +1 more

25 May 2016-arXiv: Learning

TL;DR: The reviewer module performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a fact vector after each review step; the fact vectors are used as the input of the attention mechanism in the decoder.

...read moreread less

Abstract: We propose a novel module, the reviewer module, to improve the encoder-decoder learning framework. The reviewer module is generic, and can be plugged into an existing encoder-decoder model. The reviewer module performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a fact vector after each review step; the fact vectors are used as the input of the attention mechanism in the decoder. We show that the conventional encoder-decoders are a special case of our framework. Empirically, we show that our framework can improve over state-of-the-art encoder-decoder systems on the tasks of image captioning and source code captioning.

...read moreread less

Proceedings Article•DOI•

Deep Neural Networks with Massive Learned Knowledge

[...]

Zhiting Hu¹, Zichao Yang¹, Ruslan Salakhutdinov¹, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

01 Nov 2016

TL;DR: A general framework is developed that enables learning knowledge and its confidence jointly with the DNNs, so that the vast amount of fuzzy knowledge can be incorporated and automatically optimized with little manual efforts.

...read moreread less

Abstract: Regulating deep neural networks (DNNs) with human structured knowledge has shown to be of great benefit for improved accuracy and interpretability. We develop a general framework that enables learning knowledge and its confidence jointly with the DNNs, so that the vast amount of fuzzy knowledge can be incorporated and automatically optimized with little manual efforts. We apply the framework to sentence sentiment analysis, augmenting a DNN with massive linguistic constraints on discourse and polarity structures. Our model substantially enhances the performance using less training data, and shows improved interpretability. The principled framework can also be applied to posterior regularization for regulating other statistical models.

...read moreread less

Posted Content•

Words or Characters? Fine-grained Gating for Reading Comprehension

[...]

Zhilin Yang¹, Bhuwan Dhingra², Ye Yuan, Junjie Hu³, William W. Cohen¹, Ruslan Salakhutdinov¹ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, Microsoft², The Chinese University of Hong Kong³

06 Nov 2016-arXiv: Computation and Language

TL;DR: A fine-grained gating mechanism to dynamically combine word-level and character-level representations based on properties of the words is presented, which can improve the performance on reading comprehension tasks and show improved results on a social media tag prediction task.

...read moreread less

Abstract: Previous work combines word-level and character-level representations using concatenation or scalar weighting, which is suboptimal for high-level tasks like reading comprehension. We present a fine-grained gating mechanism to dynamically combine word-level and character-level representations based on properties of the words. We also extend the idea of fine-grained gating to modeling the interaction between questions and paragraphs for reading comprehension. Experiments show that our approach can improve the performance on reading comprehension tasks, achieving new state-of-the-art results on the Children's Book Test dataset. To demonstrate the generality of our gating mechanism, we also show improved results on a social media tag prediction task.

...read moreread less

Posted Content•

Architectural Complexity Measures of Recurrent Neural Networks

[...]

Saizheng Zhang¹, Yuhuai Wu², Tong Che¹, Zhouhan Lin¹, Roland Memisevic¹, Ruslan Salakhutdinov², Yoshua Bengio¹ - Show less +3 more•Institutions (2)

Université de Montréal¹, University of Toronto²

26 Feb 2016-arXiv: Learning

TL;DR: In this article, a graph-theoretic framework is presented to analyze the connecting architectures of RNNs and three architecture complexity measures are proposed: the recurrent depth, the feedforward depth and the recurrent skip coefficient.

...read moreread less

Abstract: In this paper, we systematically analyze the connecting architectures of recurrent neural networks (RNNs). Our main contribution is twofold: first, we present a rigorous graph-theoretic framework describing the connecting architectures of RNNs in general. Second, we propose three architecture complexity measures of RNNs: (a) the recurrent depth, which captures the RNN's over-time nonlinear complexity, (b) the feedforward depth, which captures the local input-output nonlinearity (similar to the "depth" in feedforward neural networks (FNNs)), and (c) the recurrent skip coefficient which captures how rapidly the information propagates over time. We rigorously prove each measure's existence and computability. Our experimental results show that RNNs might benefit from larger recurrent depth and feedforward depth. We further demonstrate that increasing recurrent skip coefficient offers performance boosts on long term dependency problems.

...read moreread less

Proceedings Article•

On Multiplicative Integration with Recurrent Neural Networks

[...]

Yuhuai Wu¹, Saizheng Zhang², Ying Zhang², Yoshua Bengio², Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (2)

University of Toronto¹, Université de Montréal²

21 Jun 2016

TL;DR: In this paper, a general simple structural design called multiplicative integration (MI) is introduced to improve recurrent neural networks (RNNs) by changing the way of how the information flow gets integrated in the computational building block of an RNN, while introducing almost no extra parameters.

...read moreread less

Abstract: We introduce a general simple structural design called “Multiplicative Integration” (MI) to improve recurrent neural networks (RNNs). MI changes the way of how the information flow gets integrated in the computational building block of an RNN, while introducing almost no extra parameters. The new structure can be easily embedded into many popular RNN models, including LSTMs and GRUs. We empirically analyze its learning behaviour and conduct evaluations on several tasks using different RNN models. Our experimental results demonstrate that Multiplicative Integration can provide a substantial performance boost over many of the existing RNN models.

...read moreread less

Proceedings Article•

Architectural Complexity Measures of Recurrent Neural Networks

[...]

Saizheng Zhang¹, Yuhuai Wu², Tong Che¹, Zhouhan Lin¹, Roland Memisevic¹, Ruslan Salakhutdinov², Yoshua Bengio¹ - Show less +3 more•Institutions (2)

Université de Montréal¹, University of Toronto²

26 Feb 2016

TL;DR: In this article, a graph-theoretic framework is presented to analyze the connecting architectures of RNNs and three architecture complexity measures are proposed: the recurrent depth, the feedforward depth, and the recurrent skip coefficient.

...read moreread less

Abstract: In this paper, we systematically analyze the connecting architectures of recurrent neural networks (RNNs). Our main contribution is twofold: first, we present a rigorous graph-theoretic framework describing the connecting architectures of RNNs in general. Second, we propose three architecture complexity measures of RNNs: (a) the recurrent depth, which captures the RNN’s over-time nonlinear complexity, (b) the feedforward depth, which captures the local input-output nonlinearity (similar to the “depth” in feedforward neural networks (FNNs)), and (c) the recurrent skip coefficient which captures how rapidly the information propagates over time. We rigorously prove each measure’s existence and computability. Our experimental results show that RNNs might benefit from larger recurrent depth and feedforward depth. We further demonstrate that increasing recurrent skip coefficient offers performance boosts on long term dependency problems.

...read moreread less

Patent•

Method and apparatus for face recognition

[...]

Jung-Bae Kim¹, Ruslan Salakhutdinov², Jae-Joon Han¹, ByungIn Yoo¹•Institutions (2)

Samsung¹, University of Toronto²

22 Jun 2016

TL;DR: In this article, a training method of training an illumination compensation model includes extracting, from a training image, an albedo image of a face area, a surface normal image of the face area and an illumination feature, the extracting being based on an illumination model; generating an illumination restoration image based on the albedos image, the surface normal images, and the illumination feature.

...read moreread less

Abstract: A training method of training an illumination compensation model includes extracting, from a training image, an albedo image of a face area, a surface normal image of the face area, and an illumination feature, the extracting being based on an illumination compensation model; generating an illumination restoration image based on the albedo image, the surface normal image, and the illumination feature; and training the illumination compensation model based on the training image and the illumination restoration image.

...read moreread less

Proceedings Article•

Iterative Refinement of the Approximate Posterior for Directed Belief Networks

[...]

R Devon Hjelm¹, Ruslan Salakhutdinov², Kyunghyun Cho³, Nebojsa Jojic⁴, Vince D. Calhoun¹, Junyoung Chung⁵ - Show less +2 more•Institutions (5)

University of New Mexico¹, Carnegie Mellon University², New York University³, Microsoft⁴, Université de Montréal⁵

29 Oct 2016

TL;DR: This article proposed an iterative refinement procedure for improving the approximate posterior of the recognition network and showed that training with the refined posterior is competitive with state-of-the-art methods.

...read moreread less

Abstract: Variational methods that rely on a recognition network to approximate the posterior of directed graphical models offer better inference and learning than previous methods. Recent advances that exploit the capacity and flexibility in this approach have expanded what kinds of models can be trained. However, as a proposal for the posterior, the capacity of the recognition network is limited, which can constrain the representational power of the generative model and increase the variance of Monte Carlo estimates. To address these issues, we introduce an iterative refinement procedure for improving the approximate posterior of the recognition network and show that training with the refined posterior is competitive with state-of-the-art methods. The advantages of refinement are further evident in an increased effective sample size, which implies a lower variance of gradient estimates.

...read moreread less

Proceedings Article•

Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations

[...]

Behnam Neyshabur¹, Yuhuai Wu², Ruslan Salakhutdinov³, Nathan Srebro⁴•Institutions (4)

Toyota Technological Institute at Chicago¹, University of Toronto², Carnegie Mellon University³, University of Chicago⁴

01 Jan 2016

TL;DR: In this article, the authors investigate the parameter space geometry of recurrent neural networks and develop an adaptation of path-SGD optimization method, attuned to this geometry, that can learn plain RNNs with ReLU activations.

...read moreread less

Abstract: We investigate the parameter-space geometry of recurrent neural networks (RNNs), and develop an adaptation of path-SGD optimization method, attuned to this geometry, that can learn plain RNNs with ReLU activations. On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of ReLU RNNs compared to RNNs trained with SGD, even with various recently suggested initialization schemes.

...read moreread less

Proceedings Article•

Data-Dependent Path Normalization in Neural Networks

[...]

Behnam Neyshabur¹, Ryota Tomioka², Ruslan Salakhutdinov³, Nathan Srebro¹•Institutions (3)

Toyota Technological Institute at Chicago¹, Microsoft², University of Toronto³

30 Nov 2016

TL;DR: The authors propose a unified framework for neural net normalization, regularization and optimization, which includes Path-SGD and Batch-Normalization and interpolates between them across two different dimensions.

...read moreread less

Abstract: We propose a unified framework for neural net normalization, regularization and optimization, which includes Path-SGD and Batch-Normalization and interpolates between them across two different dimensions. Through this framework we investigate issue of invariance of the optimization, data dependence and the connection with natural gradients.

...read moreread less

Posted Content•

Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations

[...]

Behnam Neyshabur¹, Yuhuai Wu², Ruslan Salakhutdinov³, Nathan Srebro⁴•Institutions (4)

Toyota Technological Institute at Chicago¹, University of Toronto², Carnegie Mellon University³, University of Chicago⁴

23 May 2016-arXiv: Learning

TL;DR: On several datasets that require capturing long-term dependency structure, it is shown that path-SGD can significantly improve trainability of ReLU RNNs compared to RNN's trained with SGD, even with various recently suggested initialization schemes.

...read moreread less

Abstract: We investigate the parameter-space geometry of recurrent neural networks (RNNs), and develop an adaptation of path-SGD optimization method, attuned to this geometry, that can learn plain RNNs with ReLU activations. On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of ReLU RNNs compared to RNNs trained with SGD, even with various recently suggested initialization schemes.

...read moreread less

Proceedings Article•

Words or Characters? Fine-grained Gating for Reading Comprehension

[...]

Zhilin Yang¹, Bhuwan Dhingra², Ye Yuan, Junjie Hu³, William W. Cohen¹, Ruslan Salakhutdinov¹ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, Microsoft², The Chinese University of Hong Kong³

04 Nov 2016

TL;DR: The authors proposed a fine-grained gating mechanism to dynamically combine word-level and character-level representations based on properties of the words for reading comprehension, achieving state-of-the-art results on the Children's Book Test dataset.

...read moreread less

Abstract: Previous work combines word-level and character-level representations using concatenation or scalar weighting, which is suboptimal for high-level tasks like reading comprehension. We present a fine-grained gating mechanism to dynamically combine word-level and character-level representations based on properties of the words. We also extend the idea of fine-grained gating to modeling the interaction between questions and paragraphs for reading comprehension. Experiments show that our approach can improve the performance on reading comprehension tasks, achieving new state-of-the-art results on the Children's Book Test dataset. To demonstrate the generality of our gating mechanism, we also show improved results on a social media tag prediction task.

...read moreread less

Showing papers by "Ruslan Salakhutdinov published in 2016"