Showing papers by "Zhilin Yang published in 2017"

PDF

Open Access

Proceedings Article•

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

[...]

Zhilin Yang¹, Zihang Dai¹, Ruslan Salakhutdinov², William W. Cohen³•Institutions (3)

Carnegie Mellon University¹, Apple Inc.², Google³

10 Nov 2017

TL;DR: The authors formulate language modeling as a matrix factorization problem, and show that the expressiveness of softmax-based models is limited by a Softmax bottleneck, which further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language.

...read moreread less

Abstract: We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language. We propose a simple and effective method to address this issue, and improve the state-of-the-art perplexities on Penn Treebank and WikiText-2 to 47.69 and 40.68 respectively. The proposed method also excels on the large-scale 1B Word dataset, outperforming the baseline by over 5.6 points in perplexity.

...read moreread less

326 citations

Proceedings Article•

Good Semi-supervised Learning That Requires a Bad GAN

[...]

Zihang Dai¹, Zhilin Yang², Fan Yang², William W. Cohen², Ruslan Salakhutdinov² - Show less +1 more•Institutions (2)

Baidu¹, Carnegie Mellon University²

27 May 2017

TL;DR: Theoretically, it is shown that given the discriminator objective, good semisupervised learning indeed requires a bad generator, and a novel formulation based on the analysis that substantially improves over feature matching GANs is derived, obtaining state-of-the-art results on multiple benchmark datasets.

...read moreread less

Abstract: Semi-supervised learning methods based on generative adversarial networks (GANs) obtained strong empirical results, but it is not clear 1) how the discriminator benefits from joint training with a generator, and 2) why good semi-supervised classification performance and a good generator cannot be obtained at the same time. Theoretically we show that given the discriminator objective, good semi-supervised learning indeed requires a bad generator, and propose the definition of a preferred generator. Empirically, we derive a novel formulation based on our analysis that substantially improves over feature matching GANs, obtaining state-of-the-art results on multiple benchmark datasets.

...read moreread less

291 citations

Posted Content•

Good Semi-supervised Learning that Requires a Bad GAN

[...]

Zihang Dai¹, Zhilin Yang², Fan Yang², William W. Cohen², Ruslan Salakhutdinov² - Show less +1 more•Institutions (2)

Baidu¹, Carnegie Mellon University²

27 May 2017-arXiv: Learning

TL;DR: This paper showed that good semi-supervised classification performance and a good generator cannot be obtained at the same time, and proposed the definition of a preferred generator, which substantially improves over feature matching GANs.

...read moreread less

Abstract: Semi-supervised learning methods based on generative adversarial networks (GANs) obtained strong empirical results, but it is not clear 1) how the discriminator benefits from joint training with a generator, and 2) why good semi-supervised classification performance and a good generator cannot be obtained at the same time Theoretically, we show that given the discriminator objective, good semisupervised learning indeed requires a bad generator, and propose the definition of a preferred generator Empirically, we derive a novel formulation based on our analysis that substantially improves over feature matching GANs, obtaining state-of-the-art results on multiple benchmark datasets

...read moreread less

265 citations

Proceedings Article•

Differentiable learning of logical rules for knowledge base reasoning

[...]

Fan Yang¹, Zhilin Yang¹, William W. Cohen¹•Institutions (1)

Carnegie Mellon University¹

04 Dec 2017

TL;DR: A framework, Neural Logic Programming, is proposed that combines the parameter and structure learning of first-order logical rules in an end-to-end differentiable model and outperforms prior work on multiple knowledge base benchmark datasets, including Freebase and WikiMovies.

...read moreread less

Abstract: We study the problem of learning probabilistic first-order logical rules for knowledge base reasoning. This learning problem is difficult because it requires learning the parameters in a continuous space as well as the structure in a discrete space. We propose a framework, Neural Logic Programming, that combines the parameter and structure learning of first-order logical rules in an end-to-end differentiable model. This approach is inspired by a recently-developed differentiable logic called TensorLog [5], where inference tasks can be compiled into sequences of differentiable operations. We design a neural controller system that learns to compose these operations. Empirically, our method outperforms prior work on multiple knowledge base benchmark datasets, including Freebase and WikiMovies.

...read moreread less

255 citations

Proceedings Article•DOI•

Gated-Attention Readers for Text Comprehension

[...]

Bhuwan Dhingra¹, Hanxiao Liu², Zhilin Yang², William W. Cohen², Ruslan Salakhutdinov² - Show less +1 more•Institutions (2)

Microsoft¹, Carnegie Mellon University²

01 Jul 2017

TL;DR: Gated-Attention (GA) Reader as mentioned in this paper integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader.

...read moreread less

Abstract: In this paper we study the problem of answering cloze-style questions over documents. Our model, the Gated-Attention (GA) Reader, integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enables the reader to build query-specific representations of tokens in the document for accurate answer selection. The GA Reader obtains state-of-the-art results on three benchmarks for this task–the CNN & Daily Mail news stories and the Who Did What dataset. The effectiveness of multiplicative interaction is demonstrated by an ablation study, and by comparing to alternative compositional operators for implementing the gated-attention.

...read moreread less

231 citations

Proceedings Article•DOI•

Semi-Supervised QA with Generative Domain-Adaptive Nets

[...]

Zhilin Yang¹, Junjie Hu¹, Ruslan Salakhutdinov², William W. Cohen¹•Institutions (2)

Carnegie Mellon University¹, University of Toronto²

07 Feb 2017

TL;DR: In this paper, a generative model is trained to generate questions based on the unlabeled text and combine model-generated questions with human generated questions for training question answering models, and a novel domain adaptation algorithm is developed to alleviate the discrepancy between the model generated data distribution and the human-generated data distribution.

...read moreread less

Abstract: We study the problem of semi-supervised question answering—utilizing unlabeled text to boost the performance of question answering models. We propose a novel training framework, the Generative Domain-Adaptive Nets. In this framework, we train a generative model to generate questions based on the unlabeled text, and combine model-generated questions with human-generated questions for training question answering models. We develop novel domain adaptation algorithms, based on reinforcement learning, to alleviate the discrepancy between the model-generated data distribution and the human-generated data distribution. Experiments show that our proposed framework obtains substantial improvement from unlabeled text.

...read moreread less

115 citations

Posted Content•

Differentiable Learning of Logical Rules for Knowledge Base Reasoning

[...]

Fan Yang¹, Zhilin Yang¹, William W. Cohen¹•Institutions (1)

Carnegie Mellon University¹

27 Feb 2017-arXiv: Artificial Intelligence

TL;DR: This paper propose a neural logic programming framework that combines the parameter and structure learning of first-order logical rules in an end-to-end differentiable model, inspired by a recently-developed differentiable logic called TensorLog.

...read moreread less

Abstract: We study the problem of learning probabilistic first-order logical rules for knowledge base reasoning. This learning problem is difficult because it requires learning the parameters in a continuous space as well as the structure in a discrete space. We propose a framework, Neural Logic Programming, that combines the parameter and structure learning of first-order logical rules in an end-to-end differentiable model. This approach is inspired by a recently-developed differentiable logic called TensorLog, where inference tasks can be compiled into sequences of differentiable operations. We design a neural controller system that learns to compose these operations. Empirically, our method outperforms prior work on multiple knowledge base benchmark datasets, including Freebase and WikiMovies.

...read moreread less

111 citations

Posted Content•

Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks

[...]

Zhilin Yang¹, Ruslan Salakhutdinov¹, William W. Cohen¹•Institutions (1)

Carnegie Mellon University¹

18 Mar 2017-arXiv: Computation and Language

TL;DR: In this paper, the authors explore the problem of transfer learning for neural sequence taggers, where a source task with plentiful annotations (e.g., POS tagging on Penn Treebank) is used to improve performance on a target task with fewer available annotations.

...read moreread less

Abstract: Recent papers have shown that neural networks obtain state-of-the-art performance on several different sequence tagging tasks. One appealing property of such systems is their generality, as excellent performance can be achieved with a unified architecture and without task-specific feature engineering. However, it is unclear if such systems can be used for tasks without large amounts of training data. In this paper we explore the problem of transfer learning for neural sequence taggers, where a source task with plentiful annotations (e.g., POS tagging on Penn Treebank) is used to improve performance on a target task with fewer available annotations (e.g., POS tagging for microblogs). We examine the effects of transfer learning for deep hierarchical recurrent networks across domains, applications, and languages, and show that significant improvement can often be obtained. These improvements lead to improvements over the current state-of-the-art on several well-studied tasks.

...read moreread less

88 citations

Posted Content•

Linguistic Knowledge as Memory for Recurrent Neural Networks

[...]

Bhuwan Dhingra, Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov

07 Mar 2017-arXiv: Computation and Language

TL;DR: This work introduces a model that encodes graphs as explicit memory in recurrent neural networks, and uses it to model coreference relations in text and achieve new state-of-the-art results on all considered benchmarks, including CNN, bAbi, and LAMBADA.

...read moreread less

Abstract: Training recurrent neural networks to model long term dependencies is difficult. Hence, we propose to use external linguistic knowledge as an explicit signal to inform the model which memories it should utilize. Specifically, external knowledge is used to augment a sequence with typed edges between arbitrarily distant elements, and the resulting graph is decomposed into directed acyclic subgraphs. We introduce a model that encodes such graphs as explicit memory in recurrent neural networks, and use it to model coreference relations in text. We apply our model to several text comprehension tasks and achieve new state-of-the-art results on all considered benchmarks, including CNN, bAbi, and LAMBADA. On the bAbi QA tasks, our model solves 15 out of the 20 tasks with only 1000 training examples per task. Analysis of the learned representations further demonstrates the ability of our model to encode fine-grained entity information across a document.

...read moreread less

44 citations

Posted Content•

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

[...]

Zhilin Yang¹, Zihang Dai¹, Ruslan Salakhutdinov², William W. Cohen³•Institutions (3)

Carnegie Mellon University¹, Apple Inc.², Google³

10 Nov 2017-arXiv: Computation and Language

TL;DR: It is shown that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck, and a simple and effective method is proposed to address this issue.

...read moreread less

38 citations

Posted Content•

Differentiable Learning of Logical Rules for Knowledge Base Completion.

[...]

Fan Yang, Zhilin Yang, William W. Cohen

27 Feb 2017

TL;DR: A neural controller system which learns how to sequentially compose the these primitive differentiable operations to solve reasoning tasks, and in particular, to perform knowledge base completion is described.

...read moreread less

Abstract: Learned models composed of probabilistic logical rules are useful for many tasks, such as knowledge base completion. Unfortunately this learning problem is difficult, since determining the structure of the theory normally requires solving a discrete optimization problem. In this paper, we propose an alternative approach: a completely differentiable model for learning sets of first-order rules. The approach is inspired by a recently-developed differentiable logic, i.e. a subset of first-order logic for which inference tasks can be compiled into sequences of differentiable operations. Here we describe a neural controller system which learns how to sequentially compose the these primitive differentiable operations to solve reasoning tasks, and in particular, to perform knowledge base completion. The long-term goal of this work is to develop integrated, end-to-end systems that can learn to perform high-level logical reasoning as well as lower-level perceptual tasks.

...read moreread less

Posted Content•

Semi-Supervised QA with Generative Domain-Adaptive Nets

[...]

Zhilin Yang¹, Junjie Hu¹, Ruslan Salakhutdinov², William W. Cohen¹•Institutions (2)

Carnegie Mellon University¹, University of Toronto²

07 Feb 2017-arXiv: Computation and Language

TL;DR: A novel training framework for semi-supervised question answering is proposed, the Generative Domain-Adaptive Nets, which combines a generative model to generate questions based on the unlabeled text, and combines model- generated questions with human-generated questions for training question answering models.

...read moreread less

Abstract: We study the problem of semi-supervised question answering----utilizing unlabeled text to boost the performance of question answering models. We propose a novel training framework, the Generative Domain-Adaptive Nets. In this framework, we train a generative model to generate questions based on the unlabeled text, and combine model-generated questions with human-generated questions for training question answering models. We develop novel domain adaptation algorithms, based on reinforcement learning, to alleviate the discrepancy between the model-generated data distribution and the human-generated data distribution. Experiments show that our proposed framework obtains substantial improvement from unlabeled text.

...read moreread less

Posted Content•

A Probabilistic Framework for Location Inference from Social Media.

[...]

Yujie Qian¹, Jie Tang², Zhilin Yang³, Binxuan Huang³, Wei Wei³, Kathleen M. Carley³ - Show less +2 more•Institutions (3)

Massachusetts Institute of Technology¹, Tsinghua University², Carnegie Mellon University³

23 Feb 2017-arXiv: Artificial Intelligence

TL;DR: A novel probabilistic model based on factor graphs for location inference that offers several unique advantages for this task is presented and can substantially improve the inference accuracy over that of several state-of-the-art methods.

...read moreread less

Abstract: We study the extent to which we can infer users' geographical locations from social media. Location inference from social media can benefit many applications, such as disaster management, targeted advertising, and news content tailoring. The challenges, however, lie in the limited amount of labeled data and the large scale of social networks. In this paper, we formalize the problem of inferring location from social media into a semi-supervised factor graph model (SSFGM). The model provides a probabilistic framework in which various sources of information (e.g., content and social network) can be combined together. We design a two-layer neural network to learn feature representations, and incorporate the learned latent features into SSFGM. To deal with the large-scale problem, we propose a Two-Chain Sampling (TCS) algorithm to learn SSFGM. The algorithm achieves a good trade-off between accuracy and efficiency. Experiments on Twitter and Weibo show that the proposed TCS algorithm for SSFGM can substantially improve the inference accuracy over several state-of-the-art methods. More importantly, TCS achieves over 100x speedup comparing with traditional propagation-based methods (e.g., loopy belief propagation).

...read moreread less

Posted Content•

Mastering the Dungeon: Grounded Language Learning by Mechanical Turker Descent

[...]

Zhilin Yang¹, Saizheng Zhang², Jack Urbanek³, Will Feng, Alexander H. Miller³, Arthur Szlam³, Douwe Kiela³, Jason Weston⁴ - Show less +4 more•Institutions (4)

Carnegie Mellon University¹, Université de Montréal², Facebook³, New York University⁴

21 Nov 2017-arXiv: Computation and Language

TL;DR: In this paper, the authors propose an interactive learning procedure called Mechanical Turker Descent (MTD) and use it to train agents to execute natural language commands grounded in a fantasy text adventure game.

...read moreread less

Abstract: Contrary to most natural language processing research, which makes use of static datasets, humans learn language interactively, grounded in an environment. In this work we propose an interactive learning procedure called Mechanical Turker Descent (MTD) and use it to train agents to execute natural language commands grounded in a fantasy text adventure game. In MTD, Turkers compete to train better agents in the short term, and collaborate by sharing their agents' skills in the long term. This results in a gamified, engaging experience for the Turkers and a better quality teaching signal for the agents compared to static datasets, as the Turkers naturally adapt the training data to the agent's abilities.

...read moreread less