Showing papers on "Natural language published in 2019"

PDF

Open Access

Posted Content•

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

[...]

Jiasen Lu¹, Dhruv Batra², Devi Parikh², Stefan Lee²•Institutions (2)

Salesforce.com¹, Georgia Institute of Technology²

06 Aug 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

Abstract: We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

...read moreread less

1,241 citations

Proceedings Article•

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

[...]

Jiasen Lu¹, Dhruv Batra², Devi Parikh², Stefan Lee²•Institutions (2)

Salesforce.com¹, Georgia Institute of Technology²

06 Aug 2019

TL;DR: The ViLBERT model as mentioned in this paper extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

Abstract: We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

...read moreread less

1,069 citations

Posted Content•

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.

[...]

Hamel Husain, Ho-Hsiang Wu, Tiferet Ahavah Gazit, Miltiadis Allamanis¹, Marc Brockschmidt¹ - Show less +1 more•Institutions (1)

Microsoft¹

20 Sep 2019-arXiv: Learning

TL;DR: The methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task are described.

...read moreread less

Abstract: Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task. We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending CodeSearchNet Challenge to more queries and programming languages in the future.

...read moreread less

492 citations

Proceedings Article•DOI•

Multimodal Transformer for Unaligned Multimodal Language Sequences

[...]

Yao-Hung Hubert Tsai¹, Shaojie Bai¹, Paul Pu Liang¹, J. Zico Kolter¹, J. Zico Kolter², Louis-Philippe Morency¹, Ruslan Salakhutdinov¹ - Show less +3 more•Institutions (2)

Carnegie Mellon University¹, Bosch²

01 Jun 2019

TL;DR: In this paper, a directional pairwise cross-modal attention mechanism is proposed to attend to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another.

...read moreread less

Abstract: Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

...read moreread less

395 citations

Proceedings Article•DOI•

Explain Yourself! Leveraging Language Models for Commonsense Reasoning

[...]

Nazneen Fatema Rajani¹, Bryan McCann¹, Caiming Xiong¹, Richard Socher¹•Institutions (1)

Salesforce.com¹

06 Jun 2019

TL;DR: This work collects human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations in a new dataset called Common Sense Explanations to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation framework.

...read moreread less

Abstract: Deep learning models perform poorly on tasks that require commonsense reasoning, which often necessitates some form of world-knowledge or reasoning over information not immediately present in the input. We collect human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations in a new dataset called Common Sense Explanations (CoS-E). We use CoS-E to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation (CAGE) framework. CAGE improves the state-of-the-art by 10% on the challenging CommonsenseQA task. We further study commonsense reasoning in DNNs using both human and auto-generated explanations including transfer to out-of-domain tasks. Empirical results indicate that we can effectively leverage language models for commonsense reasoning.

...read moreread less

379 citations

Posted Content•

Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment

[...]

Di Jin¹, Zhijing Jin², Joey Tianyi Zhou³, Peter Szolovits¹•Institutions (3)

Massachusetts Institute of Technology¹, University of Hong Kong², Agency for Science, Technology and Research³

27 Jul 2019-arXiv: Computation and Language

TL;DR: TextFooler is presented, a simple but strong baseline to generate adversarial text that outperforms previous attacks by success rate and perturbation rate, and is utility-preserving and efficient, which generates adversarialtext with computational complexity linear to the text length.

...read moreread less

Abstract: Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TextFooler, a simple but strong baseline to generate natural adversarial text. By applying it to two fundamental natural language tasks, text classification and textual entailment, we successfully attacked three target models, including the powerful pre-trained BERT, and the widely used convolutional and recurrent neural networks. We demonstrate the advantages of this framework in three ways: (1) effective---it outperforms state-of-the-art attacks in terms of success rate and perturbation rate, (2) utility-preserving---it preserves semantic content and grammaticality, and remains correctly classified by humans, and (3) efficient---it generates adversarial text with computational complexity linear to the text length. *The code, pre-trained target models, and test examples are available at this https URL.

...read moreread less

370 citations

Posted Content•

Multimodal Transformer for Unaligned Multimodal Language Sequences

[...]

Yao-Hung Hubert Tsai¹, Shaojie Bai¹, Paul Pu Liang¹, J. Zico Kolter¹, J. Zico Kolter², Louis-Philippe Morency¹, Ruslan Salakhutdinov¹ - Show less +3 more•Institutions (2)

Carnegie Mellon University¹, Bosch²

01 Jun 2019-arXiv: Computation and Language

TL;DR: Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.

...read moreread less

Abstract: Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

...read moreread less

361 citations

Proceedings Article•DOI•

The Woman Worked as a Babysitter: On Biases in Language Generation

[...]

Emily Sheng¹, Kai-Wei Chang², Premkumar Natarajan¹, Nanyun Peng¹•Institutions (2)

Information Sciences Institute¹, University of California, Los Angeles²

01 Nov 2019

TL;DR: The notion of the regard towards a demographic is introduced, the varying levels of regard towards different demographics are used as a defining metric for bias in NLG, and the extent to which sentiment scores are a relevant proxy metric for regard is analyzed.

...read moreread less

Abstract: We present a systematic study of biases in natural language generation (NLG) by analyzing text generated from prompts that contain mentions of different demographic groups. In this work, we introduce the notion of the regard towards a demographic, use the varying levels of regard towards different demographics as a defining metric for bias in NLG, and analyze the extent to which sentiment scores are a relevant proxy metric for regard. To this end, we collect strategically-generated text from language models and manually annotate the text with both sentiment and regard scores. Additionally, we build an automatic regard classifier through transfer learning, so that we can analyze biases in unseen text. Together, these methods reveal the extent of the biased nature of language model generations. Our analysis provides a study of biases in NLG, bias metrics and correlated human judgments, and empirical evidence on the usefulness of our annotated dataset.

...read moreread less

355 citations

Posted Content•

Fine-Tuning Language Models from Human Preferences.

[...]

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul F. Christiano, Geoffrey Irving - Show less +4 more

18 Sep 2019-arXiv: Computation and Language

TL;DR: This paper builds on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets.

...read moreread less

Abstract: Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

...read moreread less

353 citations

Proceedings Article•DOI•

Knowledge Graph Embedding Based Question Answering

[...]

Xiao Huang¹, Jingyuan Zhang¹, Dingcheng Li¹, Ping Li¹•Institutions (1)

Baidu¹

30 Jan 2019

TL;DR: An effective Knowledge Embedding based Question Answering (KEQA) framework that focuses on answering the most common types of questions, i.e., simple questions, in which each question could be answered by the machine straightforwardly if its single head entity and single predicate are correctly identified.

...read moreread less

Abstract: Question answering over knowledge graph (QA-KG) aims to use facts in the knowledge graph (KG) to answer natural language questions. It helps end users more efficiently and more easily access the substantial and valuable knowledge in the KG, without knowing its data structures. QA-KG is a nontrivial problem since capturing the semantic meaning of natural language is difficult for a machine. Meanwhile, many knowledge graph embedding methods have been proposed. The key idea is to represent each predicate/entity as a low-dimensional vector, such that the relation information in the KG could be preserved. The learned vectors could benefit various applications such as KG completion and recommender systems. In this paper, we explore to use them to handle the QA-KG problem. However, this remains a challenging task since a predicate could be expressed in different ways in natural language questions. Also, the ambiguity of entity names and partial names makes the number of possible answers large. To bridge the gap, we propose an effective Knowledge Embedding based Question Answering (KEQA) framework. We focus on answering the most common types of questions, i.e., simple questions, in which each question could be answered by the machine straightforwardly if its single head entity and single predicate are correctly identified. To answer a simple question, instead of inferring its head entity and predicate directly, KEQA targets at jointly recovering the question's head entity, predicate, and tail entity representations in the KG embedding spaces. Based on a carefully-designed joint distance metric, the three learned vectors' closest fact in the KG is returned as the answer. Experiments on a widely-adopted benchmark demonstrate that the proposed KEQA outperforms the state-of-the-art QA-KG methods.

...read moreread less

348 citations

Proceedings Article•DOI•

Deeper Text Understanding for IR with Contextual Neural Language Modeling

[...]

Zhuyun Dai¹, Jamie Callan¹•Institutions (1)

Carnegie Mellon University¹

18 Jul 2019

TL;DR: This article proposed a contextual neural language model, BERT, to provide deeper text understanding for IR, which can better leverage language structures, bringing large improvements on queries written in natural languages and combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model.

...read moreread less

Abstract: Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR. Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited.

...read moreread less

Proceedings Article•DOI•

TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

[...]

Howard Chen¹, Alane Suhr¹, Dipendra Misra¹, Noah Snavely¹, Yoav Artzi¹ - Show less +1 more•Institutions (1)

Cornell University¹

01 Jun 2019

TL;DR: This work introduces the Touchdown task and dataset, where an agent must first follow navigation instructions in a Street View environment to a goal position, and then guess a location in its observed environment described in natural language to find a hidden object.

...read moreread less

Abstract: We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a Street View environment to a goal position, and then guess a location in its observed environment described in natural language to find a hidden object. The data contains 9326 examples of English instructions and spatial descriptions paired with demonstrations. We perform qualitative linguistic analysis, and show that the data displays a rich use of spatial reasoning. Empirical analysis shows the data presents an open challenge to existing methods.

...read moreread less

Journal Article•DOI•

A Survey Of Cross-lingual Word Embedding Models

[...]

Sebastian Ruder, Ivan Vulić, Anders Søgaard

01 May 2019-Journal of Artificial Intelligence Research

TL;DR: Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of crosslingual transfer when developing natural language processin... as discussed by the authors.

...read moreread less

Abstract: Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processin...

...read moreread less

Proceedings Article•DOI•

Cross-Modal Self-Attention Network for Referring Image Segmentation

[...]

Linwei Ye¹, Mrigank Rochan¹, Zhi Liu², Yang Wang¹•Institutions (2)

University of Manitoba¹, Shanghai University²

01 Jun 2019

TL;DR: A cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features and a gated multi-level fusion module to selectively integrateSelf-attentive cross- modal features corresponding to different levels in the image.

...read moreread less

Abstract: We consider the problem of referring image segmentation. Given an input image and a natural language expression, the goal is to segment the object referred by the language expression in the image. Existing works in this area treat the language expression and the input image separately in their representations. They do not sufficiently capture long-range correlations between these two modalities. In this paper, we propose a cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the input image. In addition, we propose a gated multi-level fusion module to selectively integrate self-attentive cross-modal features corresponding to different levels in the image. This module controls the information flow of features at different levels. We validate the proposed approach on four evaluation datasets. Our proposed approach consistently outperforms existing state-of-the-art methods.

...read moreread less

Proceedings Article•DOI•

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

[...]

Xin Wang¹, Jiawei Wu¹, Junkun Chen², Lei Li, Yuan-Fang Wang¹, William Yang Wang¹ - Show less +2 more•Institutions (2)

University of California, Santa Barbara¹, Fudan University²

01 Oct 2019

TL;DR: This work presents a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese and demonstrates that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation.

...read moreread less

Abstract: We present a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions, there are over 206,000 English-Chinese parallel translation pairs. Compared to the widely-used MSR-VTT dataset, \vatex is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. We also introduce two tasks for video-and-language research based on \vatex: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context. Extensive experiments on the \vatex dataset show that, first, the unified multilingual model can not only produce both English and Chinese descriptions for a video more efficiently, but also offer improved performance over the monolingual models. Furthermore, we demonstrate that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation. In the end, we discuss the potentials of using \vatex for other video-and-language research.

...read moreread less

Proceedings Article•DOI•

A neural model for generating natural language summaries of program subroutines

[...]

Alexander LeClair¹, Siyuan Jiang², Collin McMillan¹•Institutions (2)

University of Notre Dame¹, Eastern Michigan University²

25 May 2019

TL;DR: In this article, a neural model that combines words from code with code structure from an AST is presented, which allows the model to learn code structure independent of the text in code.

...read moreread less

Abstract: Source code summarization -- creating natural language descriptions of source code behavior -- is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance. Traditional techniques relied on heuristics and templates built manually by human experts. Recently, data-driven approaches based on neural machine translation have largely overtaken template-based systems. But nearly all of these techniques rely almost entirely on programs having good internal documentation; without clear identifier names, the models fail to create good summaries. In this paper, we present a neural model that combines words from code with code structure from an AST. Unlike previous approaches, our model processes each data source as a separate input, which allows the model to learn code structure independent of the text in code. This process helps our approach provide coherent summaries in many cases even when zero internal documentation is provided. We evaluate our technique with a dataset we created from 2.1m Java methods. We find improvement over two baseline techniques from SE literature and one from NLP literature.

...read moreread less

Posted Content•

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

[...]

Mohit Shridhar¹, Jesse Thomason¹, Daniel Gordon¹, Yonatan Bisk², Winson Han², Roozbeh Mottaghi², Luke Zettlemoyer¹, Dieter Fox³ - Show less +4 more•Institutions (3)

University of Washington¹, Allen Institute for Artificial Intelligence², Nvidia³

03 Dec 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

...read moreread less

Abstract: We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

...read moreread less

Proceedings Article•DOI•

SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization

[...]

Haoming Jiang¹, Pengcheng He², Weizhu Chen², Xiaodong Liu², Jianfeng Gao², Tuo Zhao¹ - Show less +2 more•Institutions (2)

Georgia Institute of Technology¹, Microsoft²

08 Nov 2019-arXiv: Computation and Language

TL;DR: A new learning framework for robust and efficient fine-tuning for pre-trained models to attain better generalization performance and outperforms the state-of-the-art T5 model, which is the largest pre- trained model containing 11 billion parameters, on GLUE.

...read moreread less

Abstract: Transfer learning has fundamentally changed the landscape of natural language processing (NLP) research. Many existing state-of-the-art models are first pre-trained on a large text corpus and then fine-tuned on downstream tasks. However, due to limited data resources from downstream tasks and the extremely large capacity of pre-trained models, aggressive fine-tuning often causes the adapted model to overfit the data of downstream tasks and forget the knowledge of the pre-trained model. To address the above issue in a more principled manner, we propose a new computational framework for robust and efficient fine-tuning for pre-trained language models. Specifically, our proposed framework contains two important ingredients: 1. Smoothness-inducing regularization, which effectively manages the capacity of the model; 2. Bregman proximal point optimization, which is a class of trust-region methods and can prevent knowledge forgetting. Our experiments demonstrate that our proposed method achieves the state-of-the-art performance on multiple NLP benchmarks.

...read moreread less

Posted Content•

Abductive Commonsense Reasoning

[...]

Chandra Bhagavatula¹, Ronan Le Bras¹, Chaitanya Malaviya¹, Keisuke Sakaguchi¹, Ari Holtzman¹, Hannah Rashkin¹, Doug Downey¹, Scott Wen-tau Yih², Yejin Choi³ - Show less +5 more•Institutions (3)

Allen Institute for Artificial Intelligence¹, University of Washington², Facebook³

15 Aug 2019-arXiv: Computation and Language

TL;DR: This study introduces a challenge dataset, ART, that consists of over 20k commonsense narrative contexts and 200k explanations, and conceptualizes two new tasks -- Abductive NLI: a multiple-choice question answering task for choosing the more likely explanation, and Abduction NLG: a conditional generation task for explaining given observations in natural language.

...read moreread less

Abstract: Abductive reasoning is inference to the most plausible explanation. For example, if Jenny finds her house in a mess when she returns from work, and remembers that she left a window open, she can hypothesize that a thief broke into her house and caused the mess, as the most plausible explanation. While abduction has long been considered to be at the core of how people interpret and read between the lines in natural language (Hobbs et al., 1988), there has been relatively little research in support of abductive natural language inference and generation. We present the first study that investigates the viability of language-based abductive reasoning. We introduce a challenge dataset, ART, that consists of over 20k commonsense narrative contexts and 200k explanations. Based on this dataset, we conceptualize two new tasks -- (i) Abductive NLI: a multiple-choice question answering task for choosing the more likely explanation, and (ii) Abductive NLG: a conditional generation task for explaining given observations in natural language. On Abductive NLI, the best model achieves 68.9% accuracy, well below human performance of 91.4%. On Abductive NLG, the current best language generators struggle even more, as they lack reasoning capabilities that are trivial for humans. Our analysis leads to new insights into the types of reasoning that deep pre-trained language models fail to perform--despite their strong performance on the related but more narrowly defined task of entailment NLI--pointing to interesting avenues for future research.

...read moreread less

Proceedings Article•DOI•

Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks

[...]

Peng Wang¹, Qi Wu¹, Jiewei Cao¹, Chunhua Shen¹, Lianli Gao², Anton van den Hengel¹ - Show less +2 more•Institutions (2)

University of Adelaide¹, University of Electronic Science and Technology of China²

15 Jun 2019

TL;DR: A graph-based, language-guided attention mechanism that represents inter-object relationships, and properties with a flexibility and power impossible with competing approaches, and enables the comprehension decision to be visualizable and explainable.

...read moreread less

Abstract: The task in referring expression comprehension is to localize the object instance in an image described by a referring expression phrased in natural language. As a language-to-vision matching task, the key to this problem is to learn a discriminative object feature that can adapt to the expression used. To avoid ambiguity, the expression normally tends to describe not only the properties of the referent itself, but also its relationships to its neighbourhood. To capture and exploit this important information we propose a graph-based, language-guided attention mechanism. Being composed of node attention component and edge attention component, the proposed graph attention mechanism explicitly represents inter-object relationships, and properties with a flexibility and power impossible with competing approaches. Furthermore, the proposed graph attention mechanism enables the comprehension decision to be visualizable and explainable. Experiments on three referring expression comprehension datasets show the advantage of the proposed approach.

...read moreread less

Proceedings Article•DOI•

A Corpus for Reasoning about Natural Language Grounded in Photographs

[...]

Alane Suhr¹, Stephanie Zhou, Ally Zhang, Iris Zhang², Huajun Bai, Yoav Artzi¹ - Show less +2 more•Institutions (2)

Cornell University¹, Facebook²

01 Jul 2019

TL;DR: This work introduces a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges, and Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.

...read moreread less

Abstract: We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.

...read moreread less

Proceedings Article•

Use What You Have: Video retrieval using representations from collaborative experts.

[...]

Yang Liu¹, Samuel Albanie², Arsha Nagrani², Andrew Zisserman²•Institutions (2)

University of Cambridge¹, University of Oxford²

01 Jan 2019

TL;DR: In this article, a collaborative experts model is proposed to aggregate information from different pre-trained experts and assess their approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.

...read moreread less

Abstract: The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human-generated queries for video datasets `in the wild' vary a lot in terms of degree of specificity, with some queries describing specific details such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pre-trained semantic embeddings which include 'general' features such as motion, appearance, and scene features from visual content. We also explore the use of more 'specific' cues from ASR and OCR which are intermittently available for videos and find that these signals remain challenging to use effectively for retrieval. We propose a collaborative experts model to aggregate information from these different pre-trained experts and assess our approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet. Code and data can be found at this http URL. This paper contains a correction to results reported in the previous version.

...read moreread less

Posted Content•

Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset

[...]

Abhinav Rastogi¹, Xiaoxue Zang¹, Srinivas Sunkara¹, Raghav Gupta¹, Pranav Khaitan¹ - Show less +1 more•Institutions (1)

Google¹

12 Sep 2019-arXiv: Computation and Language

TL;DR: The Schema-Guided Dialogue (SGD) dataset as mentioned in this paper is a large-scale task-oriented dialogue dataset, containing 16k multi-domain conversations spanning 16 domains.

...read moreread less

Abstract: Virtual assistants such as Google Assistant, Alexa and Siri provide a conversational interface to a large number of services and APIs spanning multiple domains. Such systems need to support an ever-increasing number of services with possibly overlapping functionality. Furthermore, some of these services have little to no training data available. Existing public datasets for task-oriented dialogue do not sufficiently capture these challenges since they cover few domains and assume a single static ontology per domain. In this work, we introduce the the Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains. Our dataset exceeds the existing task-oriented dialogue corpora in scale, while also highlighting the challenges associated with building large-scale virtual assistants. It provides a challenging testbed for a number of tasks including language understanding, slot filling, dialogue state tracking and response generation. Along the same lines, we present a schema-guided paradigm for task-oriented dialogue, in which predictions are made over a dynamic set of intents and slots, provided as input, using their natural language descriptions. This allows a single dialogue system to easily support a large number of services and facilitates simple integration of new services without requiring additional training data. Building upon the proposed paradigm, we release a model for dialogue state tracking capable of zero-shot generalization to new APIs, while remaining competitive in the regular setting.

...read moreread less

Posted Content•

Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment

[...]

Di Jin, Zhijing Jin, Joey Tianyi Zhou, Peter Szolovits

27 Jul 2019

TL;DR: The TextFooler is presented, a general attack framework, to generate natural adversarial texts that outperforms state-of-the-art attacks in terms of success rate and perturbation rate.

...read moreread less

Proceedings Article•DOI•

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

[...]

Da Zhang¹, Xiyang Dai², Xin Wang¹, Yuan-Fang Wang¹, Larry S. Davis³ - Show less +1 more•Institutions (3)

University of California, Santa Barbara¹, Microsoft², University of Maryland, College Park³

01 Jun 2019

TL;DR: In this article, a Moment Alignment Network (MAN) is proposed to explicitly model moment-wise temporal relations as a structured graph and devise an iterative graph adjustment network to jointly learn the best structure in an end-to-end manner.

...read moreread less

Abstract: This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is not trivial especially when a video contains multiple moments of interests and the language describes complex temporal dependencies, which often happens in real scenarios. We identify two crucial challenges: semantic misalignment and structural misalignment. However, existing approaches treat different moments separately and do not explicitly model complex moment-wise temporal relations. In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network. MAN naturally assigns candidate moment representations aligned with language semantics over different temporal locations and scales. Most importantly, we propose to explicitly model moment-wise temporal relations as a structured graph and devise an iterative graph adjustment network to jointly learn the best structure in an end-to-end manner. We evaluate the proposed approach on two challenging public benchmarks DiDeMo and Charades-STA, where our MAN significantly outperforms the state-of-the-art by a large margin.

...read moreread less

Journal Article•DOI•

How Efficiency Shapes Human Language.

[...]

Edward Gibson¹, Richard Futrell², Steven Piantadosi³, Isabelle Dautriche⁴, Kyle Mahowald⁵, Leon Bergen⁵, Roger Levy¹ - Show less +3 more•Institutions (5)

Massachusetts Institute of Technology¹, University of California, Irvine², University of California, Berkeley³, University of Edinburgh⁴, University of California, San Diego⁵

01 May 2019-Trends in Cognitive Sciences

TL;DR: These studies show how a pervasive pressure for efficiency guides the forms of natural language and indicate that a rich future for language research lies in connecting linguistics to cognitive psychology and mathematical theories of communication and inference.

...read moreread less

Proceedings Article•

Controllable Text-to-Image Generation

[...]

Bowen Li¹, Xiaojuan Qi², Thomas Lukasiewicz¹, Philip H. S. Torr¹•Institutions (2)

University of Oxford¹, The Chinese University of Hong Kong²

01 Jan 2019

TL;DR: A novel controllable text-to-image generative adversarial network (ControlGAN) is proposed, which can effectively synthesise high-quality images and also control parts of the image generation according to natural language descriptions.

...read moreread less

Abstract: In this paper, we propose a novel controllable text-to-image generative adversarial network (ControlGAN), which can effectively synthesise high-quality images and also control parts of the image generation according to natural language descriptions. To achieve this, we introduce a word-level spatial and channel-wise attention-driven generator that can disentangle different visual attributes, and allow the model to focus on generating and manipulating subregions corresponding to the most relevant words. Also, a word-level discriminator is proposed to provide fine-grained supervisory feedback by correlating words with image regions, facilitating training an effective generator which is able to manipulate specific visual attributes without affecting the generation of other content. Furthermore, perceptual loss is adopted to reduce the randomness involved in the image generation, and to encourage the generator to manipulate specific attributes required in the modified text. Extensive experiments on benchmark datasets demonstrate that our method outperforms existing state of the art, and is able to effectively manipulate synthetic images using natural language descriptions.

...read moreread less

Journal Article•DOI•

Inherent Disagreements in Human Textual Inferences

[...]

Ellie Pavlick¹, Tom Kwiatkowski²•Institutions (2)

Brown University¹, Google²

14 Nov 2019-Transactions of the Association for Computational Linguistics

TL;DR: It is argued for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments to reflect the type of uncertainty present in human disagreements.

...read moreread less

Abstract: We analyze human’s disagreements about the validity of natural language inferences. We show that, very often, disagreements are not dismissible as annotation “noise”, but rather persist as we collect more ratings and as we vary the amount of context provided to raters. We further show that the type of uncertainty captured by current state-of-the-art models for natural language inference is not reflective of the type of uncertainty present in human disagreements. We discuss implications of our results in relation to the recognizing textual entailment (RTE)/natural language inference (NLI) task. We argue for a refined evaluation objective which requires models to explicitly capture the full distribution of plausible human judgments.

...read moreread less

Proceedings Article•DOI•

When deep learning met code search

[...]

José Pablo Cambronero¹, Hongyu Li², Seohyun Kim², Koushik Sen³, Satish Chandra² - Show less +1 more•Institutions (3)

Massachusetts Institute of Technology¹, Facebook², University of California, Berkeley³

12 Aug 2019

TL;DR: In this paper, the authors evaluate the performance of supervised techniques for code search using natural language and show that adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much.

...read moreread less

Abstract: There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of embedding code and natural language queries into real vectors and then using vector distance to approximate semantic correlation between code and the query. Multiple approaches exist for learning these embeddings, including unsupervised techniques, which rely only on a corpus of code examples, and supervised techniques, which use an aligned corpus of paired code and natural language descriptions. The goal of this supervision is to produce embeddings that are more similar for a query and the corresponding desired code snippet. Clearly, there are choices in whether to use supervised techniques at all, and if one does, what sort of network and training to use for supervision. This paper is the first to evaluate these choices systematically. To this end, we assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora. To explore the design space in network complexity, we also introduced a new design point that is a minimal supervision extension to an existing unsupervised technique. Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus.

...read moreread less

Posted Content•

TabFact: A Large-scale Dataset for Table-based Fact Verification

[...]

Wenhu Chen¹, Hongmin Wang¹, Jianshu Chen¹, Yunkai Zhang, Hong Wang¹, Shiyang Li¹, Xiyou Zhou¹, William Yang Wang¹ - Show less +4 more•Institutions (1)

University of California, Santa Barbara¹

05 Sep 2019-arXiv: Computation and Language

TL;DR: A large-scale dataset with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED is constructed and two different models are designed: Table-BERT and Latent Program Algorithm (LPA).

...read moreread less

Abstract: The problem of verifying whether a textual hypothesis holds based on the given evidence, also known as fact verification, plays an important role in the study of natural language understanding and semantic representation. However, existing studies are mainly restricted to dealing with unstructured evidence (e.g., natural language sentences and documents, news, etc), while verification under structured evidence, such as tables, graphs, and databases, remains under-explored. This paper specifically aims to study the fact verification given semi-structured data as evidence. To this end, we construct a large-scale dataset called TabFact with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED. TabFact is challenging since it involves both soft linguistic reasoning and hard symbolic reasoning. To address these reasoning challenges, we design two different models: Table-BERT and Latent Program Algorithm (LPA). Table-BERT leverages the state-of-the-art pre-trained language model to encode the linearized tables and statements into continuous vectors for verification. LPA parses statements into programs and executes them against the tables to obtain the returned binary value for verification. Both methods achieve similar accuracy but still lag far behind human performance. We also perform a comprehensive analysis to demonstrate great future opportunities. The data and code of the dataset are provided in \url{this https URL}.

...read moreread less

Collapse