Showing papers on "Natural language published in 2020"

PDF

Open Access

Posted Content•

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

[...]

Zhangyin Feng¹, Daya Guo¹, Duyu Tang¹, Nan Duan¹, Xiaocheng Feng², Ming Gong³, Linjun Shou³, Bing Qin³, Ting Liu³, Daxin Jiang³, Ming Zhou³ - Show less +7 more•Institutions (3)

Harbin Institute of Technology¹, Sun Yat-sen University², Microsoft³

19 Feb 2020-arXiv: Computation and Language

TL;DR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

...read moreread less

Abstract: We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.

...read moreread less

867 citations

Proceedings Article•

The Curious Case of Neural Text Degeneration

[...]

Ari Holtzman¹, Jan Buys², Leo Du², Maxwell Forbes², Yejin Choi¹ - Show less +1 more•Institutions (2)

Allen Institute for Artificial Intelligence¹, University of Washington²

30 Apr 2020

TL;DR: By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

...read moreread less

Abstract: Despite considerable advances in neural language modeling, it remains an open question what the best decoding strategy is for text generation from a language model (e.g. to generate a story). The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, maximization-based decoding methods such as beam search lead to degeneration — output text that is bland, incoherent, or gets stuck in repetitive loops. To address this we propose Nucleus Sampling, a simple but effective method to draw considerably higher quality text out of neural language models. Our approach avoids text degeneration by truncating the unreliable tail of the probability distribution, sampling from the dynamic nucleus of tokens containing the vast majority of the probability mass. To properly examine current maximization-based and stochastic decoding methods, we compare generations from each of these methods to the distribution of human text along several axes such as likelihood, diversity, and repetition. Our results show that (1) maximization is an inappropriate decoding objective for open-ended text generation, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation and (3) Nucleus Sampling is the best decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.

...read moreread less

682 citations

Posted Content•

Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference

[...]

Timo Schick¹, Hinrich Schütze¹•Institutions (1)

Ludwig Maximilian University of Munich¹

21 Jan 2020-arXiv: Computation and Language

TL;DR: This work introduces Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task.

...read moreread less

Abstract: Some NLP tasks can be solved in a fully unsupervised fashion by providing a pretrained language model with "task descriptions" in natural language (e.g., Radford et al., 2019). While this approach underperforms its supervised counterpart, we show in this work that the two ideas can be combined: We introduce Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task. These phrases are then used to assign soft labels to a large set of unlabeled examples. Finally, standard supervised training is performed on the resulting training set. For several tasks and languages, PET outperforms supervised training and strong semi-supervised approaches in low-resource settings by a large margin.

...read moreread less

675 citations

Book Chapter•DOI•

Multi-modal Transformer for Video Retrieval

[...]

Valentin Gabeur¹, Valentin Gabeur², Chen Sun¹, Karteek Alahari², Cordelia Schmid¹ - Show less +1 more•Institutions (2)

Google¹, University of Grenoble²

23 Aug 2020

TL;DR: A multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others, and a novel framework to establish state-of-the-art results for video retrieval on three datasets.

...read moreread less

Abstract: The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.

...read moreread less

389 citations

Posted Content•

How Much Knowledge Can You Pack Into the Parameters of a Language Model

[...]

Adam Roberts¹, Colin Raffel¹, Noam Shazeer¹•Institutions (1)

Google¹

10 Feb 2020-arXiv: Computation and Language

TL;DR: It is shown that this approach scales surprisingly well with model size and outperforms models that explicitly look up knowledge on the open-domain variants of Natural Questions and WebQuestions.

...read moreread less

Abstract: It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the practical utility of this approach by fine-tuning pre-trained models to answer questions without access to any external context or knowledge. We show that this approach scales with model size and performs competitively with open-domain systems that explicitly retrieve answers from an external knowledge source when answering questions. To facilitate reproducibility and future work, we release our code and trained models at https://goo.gle/t5-cbqa.

...read moreread less

365 citations

Proceedings Article•DOI•

TaPas: Weakly Supervised Table Parsing via Pre-training

[...]

Jonathan Herzig¹, Pawel Krzysztof Nowak², Thomas Müller², Francesco Piccinno², Julian Martin Eisenschlos² - Show less +1 more•Institutions (2)

Tel Aviv University¹, Google²

01 Jul 2020

TL;DR: TaPas is presented, an approach to question answering over tables without generating logical forms that outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on SQA and performing on par with the state of theart on WikiSQL and WikiTQ, but with a simpler model architecture.

...read moreread less

Abstract: Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition, the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we present TaPas, an approach to question answering over tables without generating logical forms. TaPas trains from weak supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation operator to such selection. TaPas extends BERT’s architecture to encode tables as input, initializes from an effective joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with three different semantic parsing datasets, and find that TaPas outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WikiSQL and WikiTQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our setting, from WikiSQL to WikiTQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.

...read moreread less

358 citations

Journal Article•DOI•

Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment

[...]

Di Jin¹, Zhijing Jin², Joey Tianyi Zhou³, Peter Szolovits¹•Institutions (3)

Massachusetts Institute of Technology¹, University of Hong Kong², Agency for Science, Technology and Research³

03 Apr 2020

TL;DR: TextFooler as discussed by the authors is a baseline to generate adversarial text for text classification and textual entailment tasks, and it outperforms previous attacks by success rate and perturbation rate, preserving semantic content, grammaticality, and correct types.

...read moreread less

Abstract: Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TextFooler, a simple but strong baseline to generate adversarial text. By applying it to two fundamental natural language tasks, text classification and textual entailment, we successfully attacked three target models, including the powerful pre-trained BERT, and the widely used convolutional and recurrent neural networks. We demonstrate three advantages of this framework: (1) effective—it outperforms previous attacks by success rate and perturbation rate, (2) utility-preserving—it preserves semantic content, grammaticality, and correct types classified by humans, and (3) efficient—it generates adversarial text with computational complexity linear to the text length.1

...read moreread less

335 citations

Proceedings Article•DOI•

TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data

[...]

Pengcheng Yin¹, Graham Neubig², Wen-tau Yih³, Sebastian Riedel⁴•Institutions (4)

Carnegie Mellon University¹, University of California, San Diego², Facebook³, University College London⁴

01 Jul 2020

TL;DR: TaBERT is a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables that achieves new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider.

...read moreread less

Abstract: Recent years have witnessed the burgeoning of pretrained language models (LMs) for text-based natural language (NL) understanding tasks. Such models are typically trained on free-form NL text, hence may not be suitable for tasks like semantic parsing over structured data, which require reasoning over both free-form NL questions and structured tabular data (e.g., database tables). In this paper we present TaBERT, a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. In experiments, neural semantic parsers using TaBERT as feature representation layers achieve new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider.

...read moreread less

310 citations

Proceedings Article•DOI•

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

[...]

Zhangyin Feng¹, Daya Guo¹, Duyu Tang¹, Nan Duan¹, Xiaocheng Feng², Ming Gong³, Linjun Shou³, Bing Qin³, Ting Liu³, Daxin Jiang³, Ming Zhou³ - Show less +7 more•Institutions (3)

Harbin Institute of Technology¹, Sun Yat-sen University², Microsoft³

19 Feb 2020

TL;DR: CodeBERT as mentioned in this paper is a pre-trained model for natural language code search and code documentation generation with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

...read moreread less

Abstract: We present CodeBERT, a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both “bimodal” data of NL-PL pairs and “unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NLPL probing.

...read moreread less

307 citations

Posted Content•

LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

[...]

Ikuya Yamada¹, Akari Asai², Hiroyuki Shindo³, Hideaki Takeda⁴, Yuji Matsumoto³ - Show less +1 more•Institutions (4)

Keio University¹, University of Washington², Nara Institute of Science and Technology³, National Institute of Informatics⁴

02 Oct 2020-arXiv: Computation and Language

TL;DR: New pretrained contextualized representations of words and entities based on the bidirectional transformer, and an entity-aware self-attention mechanism that considers the types of tokens (words or entities) when computing attention scores are proposed.

...read moreread less

Abstract: Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at this https URL.

...read moreread less

288 citations

Posted Content•

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

[...]

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu - Show less +1 more

02 Apr 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of task-specific visual representation for vision and language tasks and relieves the cost of bounding box annotations and overcomes the unbalance between semantic labels in visual task and language semantic.

...read moreread less

Abstract: We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks. Our Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of task-specific visual representation for vision and language tasks. It also relieves the cost of bounding box annotations and overcomes the unbalance between semantic labels in visual task and language semantic. To provide a better representation for down-stream tasks, we pre-train a universal end-to-end model with image and sentence pairs from Visual Genome dataset and MS-COCO dataset. We propose to use a random pixel sampling mechanism to enhance the robustness of visual representation and to apply the Masked Language Model and Image-Text Matching as pre-training tasks. Extensive experiments on downstream tasks with our pre-trained model show that our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR). Particularly, we boost the performance of a single model in VQA task by 2.17 points compared with SOTA under fair comparison.

...read moreread less

Journal Article•DOI•

Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset

[...]

Abhinav Rastogi¹, Xiaoxue Zang¹, Srinivas Sunkara¹, Raghav Gupta¹, Pranav Khaitan¹ - Show less +1 more•Institutions (1)

Google¹

03 Apr 2020

TL;DR: This work introduces the the Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains, and presents a schema-guided paradigm for task-oriented dialogue, in which predictions are made over a dynamic set of intents and slots provided as input.

...read moreread less

Abstract: Virtual assistants such as Google Assistant, Alexa and Siri provide a conversational interface to a large number of services and APIs spanning multiple domains. Such systems need to support an ever-increasing number of services with possibly overlapping functionality. Furthermore, some of these services have little to no training data available. Existing public datasets for task-oriented dialogue do not sufficiently capture these challenges since they cover few domains and assume a single static ontology per domain. In this work, we introduce the the Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains. Our dataset exceeds the existing task-oriented dialogue corpora in scale, while also highlighting the challenges associated with building large-scale virtual assistants. It provides a challenging testbed for a number of tasks including language understanding, slot filling, dialogue state tracking and response generation. Along the same lines, we present a schema-guided paradigm for task-oriented dialogue, in which predictions are made over a dynamic set of intents and slots, provided as input, using their natural language descriptions. This allows a single dialogue system to easily support a large number of services and facilitates simple integration of new services without requiring additional training data. Building upon the proposed paradigm, we release a model for dialogue state tracking capable of zero-shot generalization to new APIs, while remaining competitive in the regular setting.

...read moreread less

Journal Article•DOI•

Argument Mining: A Survey

[...]

John Lawrence¹, Chris Reed¹•Institutions (1)

University of Dundee¹

16 Jan 2020-Computational Linguistics

TL;DR: The techniques that establish the foundations for argument mining are explored, a review of recent advances in argument mining techniques are provided, and the challenges faced in automatically extracting a deeper understanding of reasoning expressed in language in general are discussed.

...read moreread less

Abstract: Argument mining is the automatic identification and extraction of the structure of inference and reasoning expressed as arguments presented in natural language. Understanding argumentative structur...

...read moreread less

Proceedings Article•DOI•

How Much Knowledge Can You Pack Into the Parameters of a Language Model

[...]

Adam Roberts¹, Colin Raffel¹, Noam Shazeer¹•Institutions (1)

Google¹

10 Feb 2020

TL;DR: The authors fine-tuned pre-trained models to answer questions without access to any external context or knowledge, which scales with model size and performs competitively with open-domain systems that explicitly retrieve answers from an external knowledge source when answering questions.

...read moreread less

Book•

Neural Machine Translation

[...]

Philipp Koehn¹•Institutions (1)

Johns Hopkins University¹

23 Jul 2020

TL;DR: A comprehensive treatment of the topic, ranging from introduction to neural networks, computation graphs, description of the currently dominant attentional sequence-to-sequence model, recent refinements, alternative architectures and challenges.

...read moreread less

Abstract: Deep learning is revolutionizing how machine translation systems are built today This book introduces the challenge of machine translation and evaluation - including historical, linguistic, and applied context -- then develops the core deep learning methods used for natural language applications Code examples in Python give readers a hands-on blueprint for understanding and implementing their own machine translation systems The book also provides extensive coverage of machine learning tricks, issues involved in handling various forms of data, model enhancements, and current challenges and methods for analysis and visualization Summaries of the current research in the field make this a state-of-the-art textbook for undergraduate and graduate classes, as well as an essential reference for researchers and developers interested in other applications of neural methods in the broader field of human language processing

...read moreread less

Journal Article•DOI•

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

[...]

Lianli Gao¹, Xiangpeng Li¹, Jingkuan Song¹, Heng Tao Shen¹•Institutions (1)

University of Electronic Science and Technology of China¹

01 May 2020-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning that utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information.

...read moreread less

Abstract: Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., “gun” and “shooting”) and non-visual words (e.g., “the”, “a”). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. Considering these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We design the hLSTMat model as a general framework, and we first instantiate it for the task of video captioning. Then, we further instantiate our hLSTMarefine it and apply it to the imioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.

...read moreread less

Proceedings Article•DOI•

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

[...]

Mohit Shridhar¹, Jesse Thomason¹, Daniel Gordon¹, Yonatan Bisk², Winson Han², Roozbeh Mottaghi², Luke Zettlemoyer¹, Dieter Fox³ - Show less +4 more•Institutions (3)

University of Washington¹, Allen Institute for Artificial Intelligence², Nvidia³

14 Jun 2020

TL;DR: Action Learning From Realistic Environments and Directives (ALFRED) as mentioned in this paper is a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.

...read moreread less

Abstract: We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like “Rinse off a mug and place it in the coffee maker.” and low-level language instructions like “Walk to the coffee maker on the right.” ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision- and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

...read moreread less

Book Chapter•DOI•

Natural Language Processing

[...]

K. R. Chowdhary

01 Jan 2020

TL;DR: This chapter presents the challenges of NLP, progress so far made in this field, NLP applications, components of N LP, and grammar of English language—the way machine requires it.

...read moreread less

Abstract: The abundant volume of natural language text in the connected world, though having a large content of knowledge, but it is becoming increasingly difficult to disseminate it by a human to discover the knowledge/wisdom in it, specifically within any given time limits. The automated NLP is aimed to do this job effectively and with accuracy, like a human does it (for a limited of amount text). This chapter presents the challenges of NLP, progress so far made in this field, NLP applications, components of NLP, and grammar of English language—the way machine requires it. In addition, covers the specific areas like probabilistic parsing, ambiguities and their resolution, information extraction, discourse analysis, NL question-answering, commonsense interfaces, commonsense thinking and reasoning, causal-diversity, and various tools for NLP. Finally, the chapter summary, and a set of relevant exercises are presented.

...read moreread less

Journal Article•DOI•

The revolution will not be controlled: natural stimuli in speech neuroscience

[...]

Liberty S. Hamilton¹, Alexander G. Huth¹•Institutions (1)

University of Texas at Austin¹

01 Jan 2020-Language, cognition and neuroscience

TL;DR: It is argued that natural stimuli offer many advantages over simplified, controlled stimuli for studying how language is processed by the brain and the downsides of using natural language stimuli can be mitigated using modern statistical and computational techniques.

...read moreread less

Abstract: Humans have a unique ability to produce and consume rich, complex, and varied language in order to communicate ideas to one another. Still, outside of natural reading, the most common methods for studying how our brains process speech or understand language use only isolated words or simple sentences. Recent studies have upset this status quo by employing complex natural stimuli and measuring how the brain responds to language as it is used. In this article we argue that natural stimuli offer many advantages over simplified, controlled stimuli for studying how language is processed by the brain. Furthermore, the downsides of using natural language stimuli can be mitigated using modern statistical and computational techniques.

...read moreread less

Journal Article•DOI•

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

[...]

Chao Zhang¹, Zichao Yang, Xiaodong He, Li Deng•Institutions (1)

University of Cambridge¹

15 Apr 2020-IEEE Journal of Selected Topics in Signal Processing

TL;DR: A technical review of available models and learning methods for multimodal intelligence, focusing on the combination of vision and natural language modalities, which has become an important topic in both the computer vision andnatural language processing research communities.

...read moreread less

Abstract: Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

...read moreread less

Journal Article•DOI•

Understanding heritage languages

[...]

Maria Polinsky¹, Gregory Scontras²•Institutions (2)

University of Maryland, College Park¹, University of California²

01 Jan 2020-Bilingualism: Language and Cognition

TL;DR: The authors identify three outcomes of deviation in the heritage grammar: avoidance of ambiguity, a resistance to irregularity, and a shrinking of structure, and highlight two key triggers for deviation from the relevant baseline: the quantity and quality of the input from which the heritage language is acquired, and the economy of online resources when operating in a less dominant language.

...read moreread less

Abstract: With a growing interest in heritage languages from researchers of bilingualism and linguistic theory, the field of heritage-language studies has begun to build on its empirical foundations, moving toward a deeper understanding of the nature of language competence under unbalanced bilingualism. In furtherance of this trend, the current work synthesizes pertinent empirical observations and theoretical claims about vulnerable and robust areas of heritage language competence into early steps toward a model of heritage-language grammar. We highlight two key triggers for deviation from the relevant baseline: the quantity and quality of the input from which the heritage grammar is acquired, and the economy of online resources when operating in a less dominant language. In response to these triggers, we identify three outcomes of deviation in the heritage grammar: an avoidance of ambiguity, a resistance to irregularity, and a shrinking of structure. While we are still a ways away from a level of understanding that allows us to predict those aspects of heritage grammar that will be robust and those that will deviate from the relevant baselines, our hope is that the current work will spur the continued development of a predictive model of heritage language competence.

...read moreread less

Proceedings Article•DOI•

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

[...]

Fengda Zhu¹, Yi Zhu², Xiaojun Chang¹, Xiaodan Liang³•Institutions (3)

Monash University¹, Chinese Academy of Sciences², Sun Yat-sen University³

14 Jun 2020

TL;DR: AuxRN is introduced, a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information that help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments.

...read moreread less

Abstract: Vision-Language Navigation (VLN) is a task where an agent learns to navigate following a natural language instruction. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches fully exploit vision and language features in cross-modal grounding. However, the VLN task remains challenging, since previous works have implicitly neglected the rich semantic information contained in environments (such as navigation graphs or sub-trajectory semantics). In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information. The auxiliary tasks have four reasoning objectives: explaining the previous actions, evaluating the trajectory consistency, estimating the progress and predict the next direction. As a result, these additional training signals help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments. Our experiments demonstrate that auxiliary reasoning tasks improve both the performance of the main task and the model generalizability by a large margin. We further demonstrate empirically that an agent trained with self-supervised auxiliary reasoning tasks substantially outperforms the previous state-of-the-art method, being the best existing approach on the standard benchmark.

...read moreread less

Proceedings Article•DOI•

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

[...]

Yuankai Qi¹, Qi Wu², Peter Anderson³, Xin Wang⁴, William Yang Wang⁴, Chunhua Shen², Anton van den Hengel² - Show less +3 more•Institutions (4)

Harbin Institute of Technology¹, University of Adelaide², Georgia Institute of Technology³, University of California, Santa Barbara⁴

14 Jun 2020

TL;DR: A dataset of varied and complex robot tasks, described in natural language, in terms of objects visible in a large set of real images, and a novel Interactive Navigator-Pointer model is proposed that provides a strong baseline on the task.

...read moreread less

Abstract: One of the long-term challenges of robotics is to enable robots to interact with humans in the visual world via natural language, as humans are visual animals that communicate through language. Overcoming this challenge requires the ability to perform a wide variety of complex tasks in response to multifarious instructions from humans. In the hope that it might drive progress towards more flexible and powerful human interactions with robots, we propose a dataset of varied and complex robot tasks, described in natural language, in terms of objects visible in a large set of real images. Given an instruction, success requires navigating through a previously-unseen environment to identify an object. This represents a practical challenge, but one that closely reflects one of the core visual problems in robotics. Several state-of-the-art vision-and-language navigation, and referring-expression models are tested to verify the difficulty of this new task, but none of them show promising results because there are many fundamental differences between our task and previous ones. A novel Interactive Navigator-Pointer model is also proposed that provides a strong baseline on the task. The proposed model especially achieves the best performance on the unseen test split, but still leaves substantial room for improvement compared to the human performance. Repository: https://github.com/YuankaiQi/REVERIE.

...read moreread less

Proceedings Article•DOI•

Information-Theoretic Probing for Linguistic Structure

[...]

Tiago Pimentel¹, Josef Valvoda², Rowan Hall Maudslay², Ran Zmigrod², Adina Williams³, Ryan Cotterell⁴ - Show less +2 more•Institutions (4)

Universidade Federal de Minas Gerais¹, University of Cambridge², Facebook³, ETH Zurich⁴

01 Jul 2020

TL;DR: An information-theoretic operationalization of probing as estimating mutual information that contradicts received wisdom: one should always select the highest performing probe one can, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation.

...read moreread less

Abstract: The success of neural networks on a diverse set of NLP tasks has led researchers to question how much these networks actually ``know'' about natural language. Probes are a natural way of assessing this. When probing, a researcher chooses a linguistic task and trains a supervised model to predict annotations in that linguistic task from the network's learned representations. If the probe does well, the researcher may conclude that the representations encode knowledge related to the task. A commonly held belief is that using simpler models as probes is better; the logic is that simpler models will identify linguistic structure, but not learn the task itself. We propose an information-theoretic operationalization of probing as estimating mutual information that contradicts this received wisdom: one should always select the highest performing probe one can, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation. The experimental portion of our paper focuses on empirically estimating the mutual information between a linguistic property and BERT, comparing these estimates to several baselines. We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research---plus English---totalling eleven languages. Our implementation is available in https://github.com/rycolab/info-theoretic-probing.

...read moreread less

Proceedings Article•DOI•

COGS: A compositional generalization challenge based on semantic interpretation

[...]

Najoung Kim¹, Tal Linzen²•Institutions (2)

Johns Hopkins University¹, New York University²

01 Nov 2020

TL;DR: In experiments with Transformers and LSTMs, it is found that in-distribution accuracy on the COGS test set was near-perfect, but generalization accuracy was substantially lower, and the dataset showed high sensitivity to random seed.

...read moreread less

Abstract: Natural language is characterized by compositionality: the meaning of a complex expression is constructed from the meanings of its constituent parts. To facilitate the evaluation of the compositional abilities of language processing architectures, we introduce COGS, a semantic parsing dataset based on a fragment of English. The evaluation portion of COGS contains multiple systematic gaps that can only be addressed by compositional generalization; these include new combinations of familiar syntactic structures, or new combinations of familiar words and familiar structures. In experiments with Transformers and LSTMs, we found that in-distribution accuracy on the COGS test set was near-perfect (96--99%), but generalization accuracy was substantially lower (16--35%) and showed high sensitivity to random seed (+-6--8%). These findings indicate that contemporary standard NLP models are limited in their compositional generalization capacity, and position COGS as a good way to measure progress.

...read moreread less

Proceedings Article•DOI•

Big code != big vocabulary: open-vocabulary models for source code

[...]

Rafael-Michael Karampatsis¹, Hlib Babii², Romain Robbes², Charles Sutton¹, Andrea Janes² - Show less +1 more•Institutions (2)

University of Edinburgh¹, Free University of Bozen-Bolzano²

27 Jun 2020

TL;DR: In this article, the authors present an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work, and show that such models outperform the state of the art on three distinct code corpora (Java, C, Python).

...read moreread less

Abstract: Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available.

...read moreread less

Proceedings Article•DOI•

Few-shot Natural Language Generation for Task-Oriented Dialog

[...]

Baolin Peng¹, Chenguang Zhu¹, Chunyuan Li¹, Xiujun Li¹, Jinchao Li¹, Michael Zeng¹, Jianfeng Gao¹ - Show less +3 more•Institutions (1)

Microsoft¹

27 Feb 2020

TL;DR: FewshotWOZ is presented, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems, and the proposed SC-GPT model significantly outperforms existing methods, measured by various automatic metrics and human evaluations.

...read moreread less

Abstract: As a crucial component in task-oriented dialog systems, the Natural Language Generation (NLG) module converts a dialog act represented in a semantic form into a response in natural language. The success of traditional template-based or statistical models typically relies on heavily annotated data, which is infeasible for new domains. Therefore, it is pivotal for an NLG system to generalize well with limited labelled data in real applications. To this end, we present FewshotWOZ, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems. Further, we develop the SC-GPT model. It is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability, and fine-tuned with only a few domain-specific labels to adapt to new domains. Experiments on FewshotWOZ and the large Multi-Domain-WOZ datasets show that the proposed SC-GPT significantly outperforms existing methods, measured by various automatic metrics and human evaluations.

...read moreread less

Journal Article•DOI•

Break It Down: A Question Understanding Benchmark

[...]

Tomer Wolfson¹, Tomer Wolfson², Mor Geva², Mor Geva¹, Ankit Gupta², Matt Gardner¹, Yoav Goldberg³, Yoav Goldberg¹, Daniel Deutch², Jonathan Berant², Jonathan Berant¹ - Show less +7 more•Institutions (3)

Allen Institute for Artificial Intelligence¹, Tel Aviv University², Bar-Ilan University³

17 Apr 2020-Transactions of the Association for Computational Linguistics

TL;DR: The authors introduce a Question Decomposition Meaning Representation (QDMR) for questions, which constitutes the ordered list of steps, expressed through natural language, that are necessary for answering a question.

...read moreread less

Abstract: Understanding natural language questions entails the ability to break down a question into the requisite steps for computing its answer. In this work, we introduce a Question Decomposition Meaning Representation (QDMR) for questions. QDMR constitutes the ordered list of steps, expressed through natural language, that are necessary for answering a question. We develop a crowdsourcing pipeline, showing that quality QDMRs can be annotated at scale, and release the Break dataset, containing over 83K pairs of questions and their QDMRs. We demonstrate the utility of QDMR by showing that (a) it can be used to improve open-domain question answering on the HotpotQA dataset, (b) it can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications. Last, we use Break to train a sequence-to-sequence model with copying that parses questions into QDMR structures, and show that it substantially outperforms several natural baselines.

...read moreread less

Proceedings Article•DOI•

Improved Code Summarization via a Graph Neural Network

[...]

Alexander LeClair¹, Sakib Haque¹, Lingfei Wu², Collin McMillan¹•Institutions (2)

University of Notre Dame¹, IBM²

13 Jul 2020

TL;DR: This paper presents an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries, and shows improvement over four baseline techniques.

...read moreread less

Abstract: Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source code as input and outputs a natural language description. Yet a strong consensus is developing that using structural information as input leads to improved performance. The first approaches to use structural information flattened the AST into a sequence. Recently, more complex approaches based on random AST paths or graph neural networks have improved on the models using flattened ASTs. However, the literature still does not describe the using a graph neural network together with source code sequence as separate inputs to a model. Therefore, in this paper, we present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries. We evaluate our technique using a data set of 2.1 million Java method-comment pairs and show improvement over four baseline techniques, two from the software engineering literature, and two from machine learning literature.

...read moreread less

Posted Content•

Transformers as Soft Reasoners over Language

[...]

Peter Clark¹, Oyvind Tafjord¹, Kyle Richardson¹•Institutions (1)

Allen Institute for Artificial Intelligence¹

14 Feb 2020-arXiv: Computation and Language

TL;DR: This work trains transformers to reason (or emulate reasoning) over natural language sentences using synthetically generated data, thus bypassing a formal representation and suggesting a new role for transformers, namely as limited "soft theorem provers" operating over explicit theories in language.

...read moreread less

Abstract: Beginning with McCarthy's Advice Taker (1959), AI has pursued the goal of providing a system with explicit, general knowledge and having the system reason over that knowledge. However, expressing the knowledge in a formal (logical or probabilistic) representation has been a major obstacle to this research. This paper investigates a modern approach to this problem where the facts and rules are provided as natural language sentences, thus bypassing a formal representation. We train transformers to reason (or emulate reasoning) over these sentences using synthetically generated data. Our models, that we call RuleTakers, provide the first empirical demonstration that this kind of soft reasoning over language is learnable, can achieve high (99%) accuracy, and generalizes to test data requiring substantially deeper chaining than seen during training (95%+ scores). We also demonstrate that the models transfer well to two hand-authored rulebases, and to rulebases paraphrased into more natural language. These findings are significant as it suggests a new role for transformers, namely as limited "soft theorem provers" operating over explicit theories in language. This in turn suggests new possibilities for explainability, correctability, and counterfactual reasoning in question-answering.

...read moreread less

Collapse