Top 37 papers published by Ruslan Salakhutdinov from Carnegie Mellon University in 2017

Posted Content•

[...]

Manzil Zaheer¹, Satwik Kottur¹, Siamak Ravanbakhsh¹, Barnabás Póczos², Ruslan Salakhutdinov³, Alexander J. Smola⁴ - Show less +2 more•Institutions (4)

Carnegie Mellon University¹, University of California, Riverside², University of Toronto³, Google⁴

10 Mar 2017-arXiv: Learning

TL;DR: The main theorem characterizes the permutation invariant objective functions and provides a family of functions to which any permutation covariant objective function must belong, which enables the design of a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks.

...read moreread less

Abstract: We study the problem of designing models for machine learning tasks defined on \emph{sets}. In contrast to traditional approach of operating on fixed dimensional vectors, we consider objective functions defined on sets that are invariant to permutations. Such problems are widespread, ranging from estimation of population statistics \cite{poczos13aistats}, to anomaly detection in piezometer data of embankment dams \cite{Jung15Exploration}, to cosmology \cite{Ntampaka16Dynamical,Ravanbakhsh16ICML1}. Our main theorem characterizes the permutation invariant functions and provides a family of functions to which any permutation invariant objective function must belong. This family of functions has a special structure which enables us to design a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks. We also derive the necessary and sufficient conditions for permutation equivariance in deep models. We demonstrate the applicability of our method on population statistic estimation, point cloud classification, set expansion, and outlier detection.

...read moreread less

1,329 citations

Proceedings Article•

Toward controlled generation of text

[...]

Zhiting Hu¹, Zichao Yang¹, Xiaodan Liang¹, Ruslan Salakhutdinov¹, Eric P. Xing¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

06 Aug 2017

TL;DR: A new neural generative model is proposed which combines variational auto-encoders and holistic attribute discriminators for effective imposition of semantic structures inGeneric generation and manipulation of text.

...read moreread less

Abstract: Generic generation and manipulation of text is challenging and has limited success compared to recent deep generative modeling in visual domain. This paper aims at generating plausible text sentences, whose attributes are controlled by learning disentangled latent representations with designated semantics. We propose a new neural generative model which combines variational auto-encoders (VAEs) and holistic attribute discriminators for effective imposition of semantic structures. The model can alternatively be seen as enhancing VAEs with the wake-sleep algorithm for leveraging fake samples as extra training data. With differentiable approximation to discrete text samples, explicit constraints on independent attribute controls, and efficient collaborative learning of generator and discriminators, our model learns interpretable representations from even only word annotations, and produces short sentences with desired attributes of sentiment and tenses. Quantitative experiments using trained classifiers as evaluators validate the accuracy of sentence and attribute generation.

...read moreread less

735 citations

Posted Content•

Toward Controlled Generation of Text

[...]

Zhiting Hu¹, Zichao Yang¹, Xiaodan Liang¹, Ruslan Salakhutdinov¹, Eric P. Xing¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

02 Mar 2017-arXiv: Learning

TL;DR: This article proposed a new neural generative model which combines variational auto-encoders and holistic attribute discriminators for effective imposition of semantic structures, which learns highly interpretable representations from even only word annotations, and produces realistic sentences with desired attributes.

...read moreread less

Abstract: Generic generation and manipulation of text is challenging and has limited success compared to recent deep generative modeling in visual domain. This paper aims at generating plausible natural language sentences, whose attributes are dynamically controlled by learning disentangled latent representations with designated semantics. We propose a new neural generative model which combines variational auto-encoders and holistic attribute discriminators for effective imposition of semantic structures. With differentiable approximation to discrete text samples, explicit constraints on independent attribute controls, and efficient collaborative learning of generator and discriminators, our model learns highly interpretable representations from even only word annotations, and produces realistic sentences with desired attributes. Quantitative evaluation validates the accuracy of sentence and attribute generation.

...read moreread less

536 citations

Proceedings Article•

Deep Sets

[...]

Manzil Zaheer¹, Satwik Kottur¹, Siamak Ravanbakhsh¹, Barnabás Póczos², Ruslan Salakhutdinov³, Alexander J. Smola⁴ - Show less +2 more•Institutions (4)

Carnegie Mellon University¹, University of California, Riverside², University of Toronto³, Google⁴

01 Jan 2017

TL;DR: In this paper, the authors study the problem of designing models for machine learning tasks defined on sets and provide a family of functions to which any permutation invariant objective function must belong.

...read moreread less

Abstract: We study the problem of designing models for machine learning tasks defined on sets. In contrast to the traditional approach of operating on fixed dimensional vectors, we consider objective functions defined on sets and are invariant to permutations. Such problems are widespread, ranging from the estimation of population statistics, to anomaly detection in piezometer data of embankment dams, to cosmology. Our main theorem characterizes the permutation invariant objective functions and provides a family of functions to which any permutation invariant objective function must belong. This family of functions has a special structure which enables us to design a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks. We demonstrate the applicability of our method on population statistic estimation, point cloud classification, set expansion, and outlier detection.

...read moreread less

370 citations

Proceedings Article•

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

[...]

Zhilin Yang¹, Zihang Dai¹, Ruslan Salakhutdinov², William W. Cohen³•Institutions (3)

Carnegie Mellon University¹, Apple Inc.², Google³

10 Nov 2017

TL;DR: The authors formulate language modeling as a matrix factorization problem, and show that the expressiveness of softmax-based models is limited by a Softmax bottleneck, which further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language.

...read moreread less

Abstract: We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language. We propose a simple and effective method to address this issue, and improve the state-of-the-art perplexities on Penn Treebank and WikiText-2 to 47.69 and 40.68 respectively. The proposed method also excels on the large-scale 1B Word dataset, outperforming the baseline by over 5.6 points in perplexity.

...read moreread less

326 citations

Proceedings Article•

Good Semi-supervised Learning That Requires a Bad GAN

[...]

Zihang Dai¹, Zhilin Yang², Fan Yang², William W. Cohen², Ruslan Salakhutdinov² - Show less +1 more•Institutions (2)

Baidu¹, Carnegie Mellon University²

27 May 2017

TL;DR: Theoretically, it is shown that given the discriminator objective, good semisupervised learning indeed requires a bad generator, and a novel formulation based on the analysis that substantially improves over feature matching GANs is derived, obtaining state-of-the-art results on multiple benchmark datasets.

...read moreread less

Abstract: Semi-supervised learning methods based on generative adversarial networks (GANs) obtained strong empirical results, but it is not clear 1) how the discriminator benefits from joint training with a generator, and 2) why good semi-supervised classification performance and a good generator cannot be obtained at the same time. Theoretically we show that given the discriminator objective, good semi-supervised learning indeed requires a bad generator, and propose the definition of a preferred generator. Empirically, we derive a novel formulation based on our analysis that substantially improves over feature matching GANs, obtaining state-of-the-art results on multiple benchmark datasets.

...read moreread less

291 citations

Proceedings Article•DOI•

The More You Know: Using Knowledge Graphs for Image Classification

[...]

Kenneth Marino¹, Ruslan Salakhutdinov¹, Abhinav Gupta¹•Institutions (1)

Carnegie Mellon University¹

21 Jul 2017

TL;DR: This paper investigates the use of structured prior knowledge in the form of knowledge graphs and shows that using this knowledge improves performance on image classification, and introduces the Graph Search Neural Network as a way of efficiently incorporating large knowledge graphs into a vision classification pipeline.

...read moreread less

Abstract: One characteristic that sets humans apart from modern learning-based computer vision algorithms is the ability to acquire knowledge about the world and use that knowledge to reason about the visual world. Humans can learn about the characteristics of objects and the relationships that occur between them to learn a large variety of visual concepts, often with few examples. This paper investigates the use of structured prior knowledge in the form of knowledge graphs and shows that using this knowledge improves performance on image classification. We build on recent work on end-to-end learning on graphs, introducing the Graph Search Neural Network as a way of efficiently incorporating large knowledge graphs into a vision classification pipeline. We show in a number of experiments that our method outperforms standard neural network baselines for multi-label classification.

...read moreread less

278 citations

Posted Content•

Good Semi-supervised Learning that Requires a Bad GAN

[...]

Zihang Dai¹, Zhilin Yang², Fan Yang², William W. Cohen², Ruslan Salakhutdinov² - Show less +1 more•Institutions (2)

Baidu¹, Carnegie Mellon University²

27 May 2017-arXiv: Learning

TL;DR: This paper showed that good semi-supervised classification performance and a good generator cannot be obtained at the same time, and proposed the definition of a preferred generator, which substantially improves over feature matching GANs.

...read moreread less

Abstract: Semi-supervised learning methods based on generative adversarial networks (GANs) obtained strong empirical results, but it is not clear 1) how the discriminator benefits from joint training with a generator, and 2) why good semi-supervised classification performance and a good generator cannot be obtained at the same time Theoretically, we show that given the discriminator objective, good semisupervised learning indeed requires a bad generator, and propose the definition of a preferred generator Empirically, we derive a novel formulation based on our analysis that substantially improves over feature matching GANs, obtaining state-of-the-art results on multiple benchmark datasets

...read moreread less

265 citations

Proceedings Article•

Improved variational autoencoders for text modeling using dilated convolutions

[...]

Zichao Yang¹, Zhiting Hu¹, Ruslan Salakhutdinov¹, Taylor Berg-Kirkpatrick¹•Institutions (1)

Carnegie Mellon University¹

06 Aug 2017

TL;DR: The authors showed that variational autoencoders can outperform LSTM language models when carefully managed, showing that there is a trade-off between contextual capacity of the decoder and effective use of encoding information.

...read moreread less

Abstract: Recent work on generative text modeling has found that variational autoencoders (VAE) with LSTM decoders perform worse than simpler LSTM language models (Bowman et al., 2015). This negative result is so far poorly understood, but has been attributed to the propensity of LSTM decoders to ignore conditioning information from the encoder. In this paper, we experiment with a new type of decoder for VAE: a dilated CNN. By changing the decoder's dilation architecture, we control the size of context from previously generated words. In experiments, we find that there is a trade-off between contextual capacity of the decoder and effective use of encoding information. We show that when carefully managed, VAEs can outperform LSTM language models. We demonstrate perplexity gains on two datasets, representing the first positive language modeling result with VAE. Further, we conduct an in-depth investigation of the use of VAE (with our new decoding architecture) for semi-supervised and unsupervised labeling tasks, demonstrating gains over several strong baselines.

...read moreread less

261 citations

Proceedings Article•DOI•

Spatially Adaptive Computation Time for Residual Networks

[...]

Michael Figurnov¹, Maxwell D. Collins², Yukun Zhu², Li Zhang², Jonathan Huang², Dmitry Vetrov¹, Ruslan Salakhutdinov³ - Show less +3 more•Institutions (3)

National Research University – Higher School of Economics¹, Google², Carnegie Mellon University³

01 Jul 2017

TL;DR: Experimental results are presented showing that this model improves the computational efficiency of Residual Networks on the challenging ImageNet classification and COCO object detection datasets and the computation time maps on the visual saliency dataset cat2000 correlate surprisingly well with human eye fixation positions.

...read moreread less

Abstract: This paper proposes a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image. This architecture is end-to-end trainable, deterministic and problem-agnostic. It is therefore applicable without any modifications to a wide range of computer vision problems such as image classification, object detection and image segmentation. We present experimental results showing that this model improves the computational efficiency of Residual Networks on the challenging ImageNet classification and COCO object detection datasets. Additionally, we evaluate the computation time maps on the visual saliency dataset cat2000 and find that they correlate surprisingly well with human eye fixation positions.

...read moreread less

249 citations

Proceedings Article•DOI•

Gated-Attention Readers for Text Comprehension

[...]

Bhuwan Dhingra¹, Hanxiao Liu², Zhilin Yang², William W. Cohen², Ruslan Salakhutdinov² - Show less +1 more•Institutions (2)

Microsoft¹, Carnegie Mellon University²

01 Jul 2017

TL;DR: Gated-Attention (GA) Reader as mentioned in this paper integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader.

...read moreread less

Abstract: In this paper we study the problem of answering cloze-style questions over documents. Our model, the Gated-Attention (GA) Reader, integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enables the reader to build query-specific representations of tokens in the document for accurate answer selection. The GA Reader obtains state-of-the-art results on three benchmarks for this task–the CNN & Daily Mail news stories and the Who Did What dataset. The effectiveness of multiplicative interaction is demonstrated by an ablation study, and by comparing to alternative compositional operators for implementing the gated-attention.

...read moreread less

Proceedings Article•DOI•

Learning Robust Visual-Semantic Embeddings

[...]

Yao-Hung Hubert Tsai¹, Liang-Kang Huang, Ruslan Salakhutdinov¹•Institutions (1)

Carnegie Mellon University¹

01 Oct 2017

TL;DR: An end-to-end learning framework that is able to extract more robust multi-modal representations across domains and a novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data.

...read moreread less

Abstract: Many of the existing methods for learning joint embedding of images and text use only supervised information from paired images and its textual attributes. Taking advantage of the recent success of unsupervised learning in deep neural networks, we propose an end-to-end learning framework that is able to extract more robust multi-modal representations across domains. The proposed method combines representation learning models (i.e., auto-encoders) together with cross-domain learning criteria (i.e., Maximum Mean Discrepancy loss) to learn joint embeddings for semantic and visual features. A novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data. We evaluate our method on Animals with Attributes and Caltech-UCSD Birds 200-2011 dataset with a wide range of applications, including zero and few-shot image recognition and retrieval, from inductive to transductive settings. Empirically, we show that our frame-work improves over the current state of the art on many of the considered tasks.

...read moreread less

Posted Content•

Neural Map: Structured Memory for Deep Reinforcement Learning

[...]

Emilio Parisotto¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

27 Feb 2017-arXiv: Learning

TL;DR: This paper develops a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with and demonstrates empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments.

...read moreread less

Abstract: A critical component to enabling intelligent reasoning in partially observable environments is memory. Despite this importance, Deep Reinforcement Learning (DRL) agents have so far used relatively simple memory architectures, with the main methods to overcome partial observability being either a temporal convolution over the past k frames or an LSTM layer. More recent work (Oh et al., 2016) has went beyond these architectures by using memory networks which can allow more sophisticated addressing schemes over the past k frames. But even these architectures are unsatisfactory due to the reason that they are limited to only remembering information from the last k frames. In this paper, we develop a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with. This architecture, called the Neural Map, uses a spatially structured 2D memory image to learn to store arbitrary information about the environment over long time lags. We demonstrate empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments and show that it is capable of generalizing to environments that were not seen during training.

...read moreread less

Posted Content•

Gated-Attention Architectures for Task-Oriented Language Grounding

[...]

Devendra Singh Chaplot¹, Kanthashree Mysore Sathyendra¹, Rama Kumar Pasumarthi¹, Dheeraj Rajagopal¹, Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

22 Jun 2017-arXiv: Learning

TL;DR: This paper propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input.

...read moreread less

Abstract: To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called task-oriented language grounding. We propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input. The proposed model combines the image and text representations using a Gated-Attention mechanism and learns a policy to execute the natural language instruction using standard reinforcement and imitation learning methods. We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.

...read moreread less

Posted Content•

Geometry of Optimization and Implicit Regularization in Deep Learning.

[...]

Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, Nathan Srebro

08 May 2017-arXiv: Learning

TL;DR: This work argues that the optimization plays a crucial role in generalization of deep learning models through implicit regularization, and demonstrates how changing the empirical optimization procedure can improve generalization, even if actual optimization quality is not affected.

...read moreread less

Abstract: We argue that the optimization plays a crucial role in generalization of deep learning models through implicit regularization. We do this by demonstrating that generalization ability is not controlled by network size but rather by some other implicit control. We then demonstrate how changing the empirical optimization procedure can improve generalization, even if actual optimization quality is not affected. We do so by studying the geometry of the parameter space of deep networks, and devising an optimization algorithm attuned to this geometry.

...read moreread less

Proceedings Article•

Gated-Attention Architectures for Task-Oriented Language Grounding

[...]

Devendra Singh Chaplot¹, Kanthashree Mysore Sathyendra¹, Rama Kumar Pasumarthi¹, Dheeraj Rajagopal¹, Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

22 Jun 2017

TL;DR: An end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input.

...read moreread less

Abstract: To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called task-oriented language grounding. We propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input. The proposed model combines the image and text representations using a Gated-Attention mechanism and learns a policy to execute the natural language instruction using standard reinforcement and imitation learning methods. We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.

...read moreread less

Proceedings Article•DOI•

Semi-Supervised QA with Generative Domain-Adaptive Nets

[...]

Zhilin Yang¹, Junjie Hu¹, Ruslan Salakhutdinov², William W. Cohen¹•Institutions (2)

Carnegie Mellon University¹, University of Toronto²

07 Feb 2017

TL;DR: In this paper, a generative model is trained to generate questions based on the unlabeled text and combine model-generated questions with human generated questions for training question answering models, and a novel domain adaptation algorithm is developed to alleviate the discrepancy between the model generated data distribution and the human-generated data distribution.

...read moreread less

Abstract: We study the problem of semi-supervised question answering—utilizing unlabeled text to boost the performance of question answering models. We propose a novel training framework, the Generative Domain-Adaptive Nets. In this framework, we train a generative model to generate questions based on the unlabeled text, and combine model-generated questions with human-generated questions for training question answering models. We develop novel domain adaptation algorithms, based on reinforcement learning, to alleviate the discrepancy between the model-generated data distribution and the human-generated data distribution. Experiments show that our proposed framework obtains substantial improvement from unlabeled text.

...read moreread less

Posted Content•

Controllable Text Generation.

[...]

Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, Eric P. Xing - Show less +1 more

02 Mar 2017

TL;DR: A new neural generative model is proposed which combines variational auto-encoders and holistic attribute discriminators for effective imposition of semantic structures inGeneric generation and manipulation of text.

...read moreread less

Abstract: Generic generation and manipulation of text is challenging and has limited success compared to recent deep generative modeling in visual domain. This paper aims at generating plausible natural language sentences, whose attributes are dynamically controlled by learning disentangled latent representations with designated semantics. We propose a new neural generative model which combines variational auto-encoders and holistic attribute discriminators for effective imposition of semantic structures. With differentiable approximation to discrete text samples, explicit constraints on independent attribute controls, and efficient collaborative learning of generator and discriminators, our model learns highly interpretable representations from even only word annotations, and produces realistic sentences with desired attributes. Quantitative evaluation validates the accuracy of sentence and attribute generation.

...read moreread less

Posted Content•

On Unifying Deep Generative Models

[...]

Zhiting Hu¹, Zichao Yang¹, Ruslan Salakhutdinov¹, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

02 Jun 2017-arXiv: Learning

TL;DR: It is shown that GANs and VAEs involve minimizing KL divergences of respective posterior and inference distributions with opposite directions, extending the two learning phases of classic wake-sleep algorithm, respectively.

...read moreread less

Abstract: Deep generative models have achieved impressive success in recent years. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as emerging families for generative model learning, have largely been considered as two distinct paradigms and received extensive independent studies respectively. This paper aims to establish formal connections between GANs and VAEs through a new formulation of them. We interpret sample generation in GANs as performing posterior inference, and show that GANs and VAEs involve minimizing KL divergences of respective posterior and inference distributions with opposite directions, extending the two learning phases of classic wake-sleep algorithm, respectively. The unified view provides a powerful tool to analyze a diverse set of existing model variants, and enables to transfer techniques across research lines in a principled way. For example, we apply the importance weighting method in VAE literatures for improved GAN learning, and enhance VAEs with an adversarial mechanism that leverages generated samples. Experiments show generality and effectiveness of the transferred techniques.

...read moreread less

Posted Content•

Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks

[...]

Zhilin Yang¹, Ruslan Salakhutdinov¹, William W. Cohen¹•Institutions (1)

Carnegie Mellon University¹

18 Mar 2017-arXiv: Computation and Language

TL;DR: In this paper, the authors explore the problem of transfer learning for neural sequence taggers, where a source task with plentiful annotations (e.g., POS tagging on Penn Treebank) is used to improve performance on a target task with fewer available annotations.

...read moreread less

Abstract: Recent papers have shown that neural networks obtain state-of-the-art performance on several different sequence tagging tasks. One appealing property of such systems is their generality, as excellent performance can be achieved with a unified architecture and without task-specific feature engineering. However, it is unclear if such systems can be used for tasks without large amounts of training data. In this paper we explore the problem of transfer learning for neural sequence taggers, where a source task with plentiful annotations (e.g., POS tagging on Penn Treebank) is used to improve performance on a target task with fewer available annotations (e.g., POS tagging for microblogs). We examine the effects of transfer learning for deep hierarchical recurrent networks across domains, applications, and languages, and show that significant improvement can often be obtained. These improvements lead to improvements over the current state-of-the-art on several well-studied tasks.

...read moreread less

Proceedings Article•

Neural Map: Structured Memory for Deep Reinforcement Learning.

[...]

Emilio Parisotto¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

27 Feb 2017

TL;DR: The Neural Map as mentioned in this paper uses a spatially structured 2D memory image to learn to store arbitrary information about the environment over long time lags, and it is capable of generalizing to environments that were not seen during training.

...read moreread less

Abstract: A critical component to enabling intelligent reasoning in partially observable environments is memory. Despite this importance, Deep Reinforcement Learning (DRL) agents have so far used relatively simple memory architectures, with the main methods to overcome partial observability being either a temporal convolution over the past k frames or an LSTM layer. More recent work (Oh et al., 2016) has went beyond these architectures by using memory networks which can allow more sophisticated addressing schemes over the past k frames. But even these architectures are unsatisfactory due to the reason that they are limited to only remembering information from the last k frames. In this paper, we develop a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with. This architecture, called the Neural Map, uses a spatially structured 2D memory image to learn to store arbitrary information about the environment over long time lags. We demonstrate empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments and show that it is capable of generalizing to environments that were not seen during training.

...read moreread less

Posted Content•

Improved Variational Autoencoders for Text Modeling using Dilated Convolutions

[...]

Zichao Yang¹, Zhiting Hu¹, Ruslan Salakhutdinov¹, Taylor Berg-Kirkpatrick¹•Institutions (1)

Carnegie Mellon University¹

27 Feb 2017-arXiv: Neural and Evolutionary Computing

TL;DR: It is shown that with the right decoder, VAE can outperform LSTM language models, and perplexity gains are demonstrated on two datasets, representing the first positive experimental result on the use VAE for generative modeling of text.

...read moreread less

Abstract: Recent work on generative modeling of text has found that variational auto-encoders (VAE) incorporating LSTM decoders perform worse than simpler LSTM language models (Bowman et al., 2015). This negative result is so far poorly understood, but has been attributed to the propensity of LSTM decoders to ignore conditioning information from the encoder. In this paper, we experiment with a new type of decoder for VAE: a dilated CNN. By changing the decoder's dilation architecture, we control the effective context from previously generated words. In experiments, we find that there is a trade off between the contextual capacity of the decoder and the amount of encoding information used. We show that with the right decoder, VAE can outperform LSTM language models. We demonstrate perplexity gains on two datasets, representing the first positive experimental result on the use VAE for generative modeling of text. Further, we conduct an in-depth investigation of the use of VAE (with our new decoding architecture) for semi-supervised and unsupervised labeling tasks, demonstrating gains over several strong baselines.

...read moreread less

Posted Content•

Linguistic Knowledge as Memory for Recurrent Neural Networks

[...]

Bhuwan Dhingra, Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov

07 Mar 2017-arXiv: Computation and Language

TL;DR: This work introduces a model that encodes graphs as explicit memory in recurrent neural networks, and uses it to model coreference relations in text and achieve new state-of-the-art results on all considered benchmarks, including CNN, bAbi, and LAMBADA.

...read moreread less

Abstract: Training recurrent neural networks to model long term dependencies is difficult. Hence, we propose to use external linguistic knowledge as an explicit signal to inform the model which memories it should utilize. Specifically, external knowledge is used to augment a sequence with typed edges between arbitrarily distant elements, and the resulting graph is decomposed into directed acyclic subgraphs. We introduce a model that encodes such graphs as explicit memory in recurrent neural networks, and use it to model coreference relations in text. We apply our model to several text comprehension tasks and achieve new state-of-the-art results on all considered benchmarks, including CNN, bAbi, and LAMBADA. On the bAbi QA tasks, our model solves 15 out of the 20 tasks with only 1000 training examples per task. Analysis of the learned representations further demonstrates the ability of our model to encode fine-grained entity information across a document.

...read moreread less

Posted Content•

Improving One-Shot Learning through Fusing Side Information

[...]

Yao-Hung Hubert Tsai, Ruslan Salakhutdinov

23 Oct 2017-arXiv: Learning

TL;DR: This paper introduces two statistical approaches for fusing side information into data representation learning to improve one-shot learning and introduces an attention mechanism to efficiently treat examples belonging to the 'lots-of-examples' classes as quasi-samples (additional training samples) for 'one-example' classes.

...read moreread less

Abstract: Deep Neural Networks (DNNs) often struggle with one-shot learning where we have only one or a few labeled training examples per category. In this paper, we argue that by using side information, we may compensate the missing information across classes. We introduce two statistical approaches for fusing side information into data representation learning to improve one-shot learning. First, we propose to enforce the statistical dependency between data representations and multiple types of side information. Second, we introduce an attention mechanism to efficiently treat examples belonging to the 'lots-of-examples' classes as quasi-samples (additional training samples) for 'one-example' classes. We empirically show that our learning architecture improves over traditional softmax regression networks as well as state-of-the-art attentional regression networks on one-shot recognition tasks.

...read moreread less

Posted Content•

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

[...]

Zhilin Yang¹, Zihang Dai¹, Ruslan Salakhutdinov², William W. Cohen³•Institutions (3)

Carnegie Mellon University¹, Apple Inc.², Google³

10 Nov 2017-arXiv: Computation and Language

TL;DR: It is shown that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck, and a simple and effective method is proposed to address this issue.

...read moreread less

Abstract: We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language. We propose a simple and effective method to address this issue, and improve the state-of-the-art perplexities on Penn Treebank and WikiText-2 to 47.69 and 40.68 respectively. The proposed method also excels on the large-scale 1B Word dataset, outperforming the baseline by over 5.6 points in perplexity.

...read moreread less

Posted Content•

A Generic Approach for Escaping Saddle points

[...]

Sashank J. Reddi¹, Manzil Zaheer¹, Suvrit Sra², Barnabás Póczos¹, Francis Bach, Ruslan Salakhutdinov¹, Alexander J. Smola³ - Show less +3 more•Institutions (3)

Carnegie Mellon University¹, Massachusetts Institute of Technology², Amazon.com³

05 Sep 2017-arXiv: Learning

TL;DR: In this article, the authors propose a framework that alternates between a first-order and a second-order subroutine, using the latter only close to saddle points, and yields convergence results competitive to the state-of-the-art.

...read moreread less

Abstract: A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points. First-order methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use second-order methods. However, most works on second-order methods rely extensively on expensive Hessian-based computations, making them impractical in large-scale settings. To tackle this challenge, we introduce a generic framework that minimizes Hessian based computations while at the same time provably converging to second-order critical points. Our framework carefully alternates between a first-order and a second-order subroutine, using the latter only close to saddle points, and yields convergence results competitive to the state-of-the-art. Empirical results suggest that our strategy also enjoys a good practical performance.

...read moreread less

Proceedings Article•

A Generic Approach for Escaping Saddle points

[...]

Sashank J. Reddi¹, Manzil Zaheer¹, Suvrit Sra², Barnabás Póczos¹, Francis Bach, Ruslan Salakhutdinov¹, Alexander J. Smola³ - Show less +3 more•Institutions (3)

Carnegie Mellon University¹, Massachusetts Institute of Technology², Amazon.com³

05 Sep 2017

TL;DR: A generic framework is introduced that minimizes Hessian based computations while at the same time provably converging to second-order critical points, and yields convergence results competitive to the state-of-the-art.

...read moreread less

Abstract: A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points. First-order methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use second-order methods. However, most works on second-order methods rely extensively on expensive Hessian-based computations, making them impractical in large-scale settings. To tackle this challenge, we introduce a generic framework that minimizes Hessian based computations while at the same time provably converging to second-order critical points. Our framework carefully alternates between a first-order and a second-order subroutine, using the latter only close to saddle points, and yields convergence results competitive to the state-of-the-art. Empirical results suggest that our strategy also enjoys a good practical performance.

...read moreread less

Posted Content•

A Comparative Study of Word Embeddings for Reading Comprehension

[...]

Bhuwan Dhingra, Hanxiao Liu, Ruslan Salakhutdinov, William W. Cohen

02 Mar 2017-arXiv: Computation and Language

TL;DR: It is shown that seemingly minor choices made on the use of pre-trained word embeddings, and the representation of out-of-vocabulary tokens at test time, can turn out to have a larger impact than architectural choices on the final performance.

...read moreread less

Abstract: The focus of past machine learning research for Reading Comprehension tasks has been primarily on the design of novel deep learning architectures. Here we show that seemingly minor choices made on (1) the use of pre-trained word embeddings, and (2) the representation of out-of-vocabulary tokens at test time, can turn out to have a larger impact than architectural choices on the final performance. We systematically explore several options for these choices, and provide recommendations to researchers working in this area.

...read moreread less

Proceedings Article•

On Unifying Deep Generative Models

[...]

Zhiting Hu¹, Zichao Yang¹, Ruslan Salakhutdinov¹, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

02 Jun 2017

TL;DR: In this paper, a unified view of generative adversarial networks (GANs) and variational autoencoders (VAEs) is presented, which enables transfer techniques across research lines in a principled way.

...read moreread less

Abstract: Deep generative models have achieved impressive success in recent years. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as emerging families for generative model learning, have largely been considered as two distinct paradigms and received extensive independent studies respectively. This paper aims to establish formal connections between GANs and VAEs through a new formulation of them. We interpret sample generation in GANs as performing posterior inference, and show that GANs and VAEs involve minimizing KL divergences of respective posterior and inference distributions with opposite directions, extending the two learning phases of classic wake-sleep algorithm, respectively. The unified view provides a powerful tool to analyze a diverse set of existing model variants, and enables to transfer techniques across research lines in a principled way. For example, we apply the importance weighting method in VAE literatures for improved GAN learning, and enhance VAEs with an adversarial mechanism that leverages generated samples. Experiments show generality and effectiveness of the transferred techniques.

...read moreread less

Posted Content•

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training.

[...]

Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, Jaime G. Carbonell

16 Jul 2017-arXiv: Learning

TL;DR: The results indicate the normalized gradient with adaptive step size can help accelerate the training of neural networks, and significant speedup can be observed if the networks are deep or the dependencies are long.

...read moreread less

Abstract: In this paper, we propose a generic and simple algorithmic framework for first order optimization. The framework essentially contains two consecutive steps in each iteration: 1) computing and normalizing the mini-batch stochastic gradient; 2) selecting adaptive step size to update the decision variable (parameter) towards the negative of the normalized gradient. We show that the proposed approach, when customized to the popular adaptive stepsize methods, such as AdaGrad, can enjoy a sublinear convergence rate, if the objective is convex. We also conduct extensive empirical studies on various non-convex neural network optimization problems, including multi layer perceptron, convolution neural networks and recurrent neural networks. The results indicate the normalized gradient with adaptive step size can help accelerate the training of neural networks. In particular, significant speedup can be observed if the networks are deep or the dependencies are long.

...read moreread less

Showing papers by "Ruslan Salakhutdinov published in 2017"