Top 25 papers published by Douwe Kiela from Facebook in 2020

Posted Content•

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

[...]

Patrick S. H. Lewis¹, Ethan Perez², Aleksandra Piktus¹, Fabio Petroni¹, Vladimir Karpukhin¹, Naman Goyal¹, Heinrich Küttler¹, Michael Lewis¹, Wen-tau Yih¹, Tim Rocktäschel¹, Sebastian Riedel¹, Douwe Kiela¹ - Show less +8 more•Institutions (2)

Facebook¹, New York University²

22 May 2020-arXiv: Computation and Language

TL;DR: A general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation, and finds that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

...read moreread less

Abstract: Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

...read moreread less

632 citations

Proceedings Article•DOI•

Adversarial NLI: A New Benchmark for Natural Language Understanding

[...]

Yixin Nie¹, Adina Williams², Emily Dinan², Mohit Bansal¹, Jason Weston², Douwe Kiela² - Show less +2 more•Institutions (2)

University of North Carolina at Chapel Hill¹, Facebook²

01 Jul 2020

TL;DR: This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.

...read moreread less

Abstract: We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-the-art models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.

...read moreread less

606 citations

Proceedings Article•

The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes

[...]

Douwe Kiela¹, Hamed Firooz¹, Aravind Mohan¹, Vedanuj Goswami¹, Amanpreet Singh¹, Pratik Ringshia¹, Davide Testuggine¹ - Show less +3 more•Institutions (1)

Facebook¹

10 May 2020

TL;DR: This work proposes a new challenge set for multimodal classification, focusing on detecting hate speech in multi-modal memes, constructed such that unimodal models struggle and only multimodAL models can succeed.

...read moreread less

Abstract: This work proposes a new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes. It is constructed such that unimodal models struggle and only multimodal models can succeed: difficult examples ("benign confounders") are added to the dataset to make it hard to rely on unimodal signals. The task requires subtle reasoning, yet is straightforward to evaluate as a binary classification problem. We provide baseline performance numbers for unimodal models, as well as for multimodal models with various degrees of sophistication. We find that state-of-the-art methods perform poorly compared to humans (64.73% vs. 84.7% accuracy), illustrating the difficulty of the task and highlighting the challenge that this important problem poses to the community.

...read moreread less

196 citations

Proceedings Article•DOI•

Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation

[...]

Emily Dinan¹, Angela Fan¹, Adina Williams¹, Jack Urbanek¹, Douwe Kiela¹, Jason Weston¹ - Show less +2 more•Institutions (1)

Facebook¹

01 Nov 2020

TL;DR: This work measures gender bias in dialogue data, and examines how this bias is actually amplified in subsequent generative chit-chat dialogue models, and considers three techniques to mitigate gender bias: counterfactual data augmentation, targeted data collection, and bias controlled training.

...read moreread less

Abstract: Social biases present in data are often directly reflected in the predictions of models trained on that data. We analyze gender bias in dialogue data, and examine how this bias is not only replicated, but is also amplified in subsequent generative chit-chat dialogue models. We measure gender bias in six existing dialogue datasets before selecting the most biased one, the multi-player text-based fantasy adventure dataset LIGHT, as a testbed for bias mitigation techniques. We consider three techniques to mitigate gender bias: counterfactual data augmentation, targeted data collection, and bias controlled training. We show that our proposed techniques mitigate gender bias by balancing the genderedness of generated dialogue utterances, and find that they are particularly effective in combination. We evaluate model performance with a variety of quantitative methods---including the quantity of gendered words, a dialogue safety classifier, and human assessments---all of which show that our models generate less gendered, but equally engaging chit-chat responses.

...read moreread less

144 citations

Posted Content•

Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval

[...]

Wenhan Xiong, Xiang Lorraine Li, Srini Iyer, Jingfei Du, Patrick S. H. Lewis, William Yang Wang, Yashar Mehdad, Wen-tau Yih, Sebastian Riedel, Douwe Kiela, Barlas Oguz - Show less +7 more

27 Sep 2020-arXiv: Computation and Language

TL;DR: This work proposes a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on twoMulti-hop datasets, HotpotQA and multi-evidence FEVER, and can be applied to any unstructured text corpus.

...read moreread less

Abstract: We propose a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on two multi-hop datasets, HotpotQA and multi-evidence FEVER. Contrary to previous work, our method does not require access to any corpus-specific information, such as inter-document hyperlinks or human-annotated entity markers, and can be applied to any unstructured text corpus. Our system also yields a much better efficiency-accuracy trade-off, matching the best published accuracy on HotpotQA while being 10 times faster at inference time.

...read moreread less

86 citations

Posted Content•

Unsupervised Question Decomposition for Question Answering

[...]

Ethan Perez¹, Patrick S. H. Lewis², Wen-tau Yih², Kyunghyun Cho¹, Douwe Kiela² - Show less +1 more•Institutions (2)

New York University¹, Facebook²

22 Feb 2020-arXiv: Computation and Language

TL;DR: One-to-N Unsupervised Sequence Transduction (ONUS) as mentioned in this paper decomposes hard questions into simpler sub-questions that existing QA systems are capable of answering.

...read moreread less

Abstract: We aim to improve question answering (QA) by decomposing hard questions into simpler sub-questions that existing QA systems are capable of answering. Since labeling questions with decompositions is cumbersome, we take an unsupervised approach to produce sub-questions, also enabling us to leverage millions of questions from the internet. Specifically, we propose an algorithm for One-to-N Unsupervised Sequence transduction (ONUS) that learns to map one hard, multi-hop question to many simpler, single-hop sub-questions. We answer sub-questions with an off-the-shelf QA model and give the resulting answers to a recomposition model that combines them into a final answer. We show large QA improvements on HotpotQA over a strong baseline on the original, out-of-domain, and multi-hop dev sets. ONUS automatically learns to decompose different kinds of questions, while matching the utility of supervised and heuristic decomposition methods for QA and exceeding those methods in fluency. Qualitatively, we find that using sub-questions is promising for shedding light on why a QA system makes a prediction.

...read moreread less

82 citations

Proceedings Article•

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

[...]

Patrick S. H. Lewis¹, Ethan Perez², Aleksandra Piktus¹, Fabio Petroni¹, Vladimir Karpukhin¹, Naman Goyal¹, Heinrich Küttler¹, Michael Lewis¹, Wen-tau Yih¹, Tim Rocktäschel¹, Sebastian Riedel¹, Douwe Kiela¹ - Show less +8 more•Institutions (2)

Facebook¹, New York University²

22 May 2020

TL;DR: This paper proposed a retrieval-augmented generation (RAG) model, which combines pre-trained parametric and non-parametric memory for language generation and achieved state-of-the-art results on knowledge-intensive NLP tasks.

...read moreread less

Abstract: Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

...read moreread less

74 citations

Posted Content•

The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes

[...]

Douwe Kiela¹, Hamed Firooz¹, Aravind Mohan¹, Vedanuj Goswami¹, Amanpreet Singh¹, Pratik Ringshia¹, Davide Testuggine¹ - Show less +3 more•Institutions (1)

Facebook¹

10 May 2020-arXiv: Artificial Intelligence

TL;DR: The authors proposed a new challenge set for multimodal classification, focusing on detecting hate speech in multi-modal memes, where difficult examples are added to the dataset to make it hard to rely on unimodal signals.

...read moreread less

Abstract: This work proposes a new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes. It is constructed such that unimodal models struggle and only multimodal models can succeed: difficult examples ("benign confounders") are added to the dataset to make it hard to rely on unimodal signals. The task requires subtle reasoning, yet is straightforward to evaluate as a binary classification problem. We provide baseline performance numbers for unimodal models, as well as for multimodal models with various degrees of sophistication. We find that state-of-the-art methods perform poorly compared to humans (64.73% vs. 84.7% accuracy), illustrating the difficulty of the task and highlighting the challenge that this important problem poses to the community.

...read moreread less

71 citations

Proceedings Article•DOI•

Unsupervised Question Decomposition for Question Answering

[...]

Ethan Perez¹, Patrick S. H. Lewis², Wen-tau Yih², Kyunghyun Cho¹, Douwe Kiela² - Show less +1 more•Institutions (2)

New York University¹, Facebook²

22 Feb 2020

TL;DR: An algorithm for One-to-N Unsupervised Sequence transduction (ONUS) that learns to map one hard, multi-hop question to many simpler, single-hop sub-questions, which is promising for shedding light on why a QA system makes a prediction.

...read moreread less

Abstract: We aim to improve question answering (QA) by decomposing hard questions into simpler sub-questions that existing QA systems are capable of answering. Since labeling questions with decompositions is cumbersome, we take an unsupervised approach to produce sub-questions, also enabling us to leverage millions of questions from the internet. Specifically, we propose an algorithm for One-to-N Unsupervised Sequence transduction (ONUS) that learns to map one hard, multi-hop question to many simpler, single-hop sub-questions. We answer sub-questions with an off-the-shelf QA model and give the resulting answers to a recomposition model that combines them into a final answer. We show large QA improvements on HotpotQA over a strong baseline on the original, out-of-domain, and multi-hop dev sets. ONUS automatically learns to decompose different kinds of questions, while matching the utility of supervised and heuristic decomposition methods for QA and exceeding those methods in fluency. Qualitatively, we find that using sub-questions is promising for shedding light on why a QA system makes a prediction.

...read moreread less

62 citations

Proceedings Article•DOI•

Multi-Dimensional Gender Bias Classification

[...]

Emily Dinan¹, Angela Fan¹, Ledell Wu, Jason Weston¹, Douwe Kiela¹, Adina Williams¹ - Show less +2 more•Institutions (1)

Facebook¹

01 Nov 2020

TL;DR: The authors proposed a fine-grained framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from gender of a person speaking to the speaker, and bias from a speaker's gender.

...read moreread less

Abstract: Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a novel, general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a new, crowdsourced evaluation benchmark. Distinguishing between gender bias along multiple dimensions enables us to train better and more fine-grained gender bias classifiers. We show our classifiers are valuable for a variety of applications, like controlling for gender bias in generative models, detecting gender bias in arbitrary text, and classifying text as offensive based on its genderedness.

...read moreread less

48 citations

Posted Content•

ANLIzing the Adversarial Natural Language Inference Dataset.

[...]

Adina Williams, Tristan Thrush, Douwe Kiela¹•Institutions (1)

Facebook¹

24 Oct 2020-arXiv: Computation and Language

TL;DR: An in-depth error analysis of Adversarial NLI (ANLI), a recently introduced large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds, and a fine-grained annotation scheme of the different aspects of inference that are responsible for the gold classification labels is proposed.

...read moreread less

Abstract: We perform an in-depth error analysis of Adversarial NLI (ANLI), a recently introduced large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds. We propose a fine-grained annotation scheme of the different aspects of inference that are responsible for the gold classification labels, and use it to hand-code all three of the ANLI development sets. We use these annotations to answer a variety of interesting questions: which inference types are most common, which models have the highest performance on each reasoning type, and which types are the most challenging for state of-the-art models? We hope that our annotations will enable more fine-grained evaluation of models trained on ANLI, provide us with a deeper understanding of where models fail and succeed, and help us determine how to train better models in future.

...read moreread less

Proceedings Article•

On the interaction between supervision and self-play in emergent communication

[...]

Ryan Lowe¹, Abhinav Gupta², Jakob Foerster¹, Douwe Kiela¹, Joelle Pineau³ - Show less +1 more•Institutions (3)

Facebook¹, New York University², McGill University³

30 Apr 2020

TL;DR: In this paper, the authors investigate the relationship between two learning signals with the ultimate goal of improving sample efficiency: imitating human language data via supervised learning, and maximizing reward in a simulated multi-agent environment via self-play.

...read moreread less

Abstract: A promising approach for teaching artificial agents to use natural language involves using human-in-the-loop training. However, recent work suggests that current machine learning methods are too data inefficient to be trained in this way from scratch. In this paper, we investigate the relationship between two categories of learning signals with the ultimate goal of improving sample efficiency: imitating human language data via supervised learning, and maximizing reward in a simulated multi-agent environment via self-play (as done in emergent communication), and introduce the term \textit{supervised self-play (S2P)} for algorithms using both of these signals. We find that first training agents via supervised learning on human data followed by self-play outperforms the converse, suggesting that it is not beneficial to emerge languages from scratch. We then empirically investigate various S2P schedules that begin with supervised learning in two environments: a Lewis signaling game with symbolic inputs, and an image-based referential game with natural language descriptions. Lastly, we introduce population based approaches to S2P, which further improves the performance over single-agent methods.

...read moreread less

Posted Content•

On the interaction between supervision and self-play in emergent communication.

[...]

Ryan Lowe¹, Abhinav Gupta², Jakob Foerster¹, Douwe Kiela¹, Joelle Pineau³ - Show less +1 more•Institutions (3)

Facebook¹, New York University², McGill University³

04 Feb 2020-arXiv: Computation and Language

TL;DR: It is found that first training agents via supervised learning on human data followed by self-play outperforms the converse, suggesting that it is not beneficial to emerge languages from scratch.

...read moreread less

Abstract: A promising approach for teaching artificial agents to use natural language involves using human-in-the-loop training. However, recent work suggests that current machine learning methods are too data inefficient to be trained in this way from scratch. In this paper, we investigate the relationship between two categories of learning signals with the ultimate goal of improving sample efficiency: imitating human language data via supervised learning, and maximizing reward in a simulated multi-agent environment via self-play (as done in emergent communication), and introduce the term supervised self-play (S2P) for algorithms using both of these signals. We find that first training agents via supervised learning on human data followed by self-play outperforms the converse, suggesting that it is not beneficial to emerge languages from scratch. We then empirically investigate various S2P schedules that begin with supervised learning in two environments: a Lewis signaling game with symbolic inputs, and an image-based referential game with natural language descriptions. Lastly, we introduce population based approaches to S2P, which further improves the performance over single-agent methods.

...read moreread less

Posted Content•

Learning Optimal Representations with the Decodable Information Bottleneck

[...]

Yann Dubois¹, Douwe Kiela¹, David J. Schwab¹, Ramakrishna Vedantam¹•Institutions (1)

Facebook¹

27 Sep 2020-arXiv: Learning

TL;DR: This work proposes the Decodable Information Bottleneck (DIB), a framework that considers information retention and compression from the perspective of the desired predictive family and gives rise to representations that are optimal in terms of expected test performance and can be estimated with guarantees.

...read moreread less

Abstract: We address the question of characterizing and finding optimal representations for supervised learning. Traditionally, this question has been tackled using the Information Bottleneck, which compresses the inputs while retaining information about the targets, in a decoder-agnostic fashion. In machine learning, however, our goal is not compression but rather generalization, which is intimately linked to the predictive family or decoder of interest (e.g. linear classifier). We propose the Decodable Information Bottleneck (DIB) that considers information retention and compression from the perspective of the desired predictive family. As a result, DIB gives rise to representations that are optimal in terms of expected test performance and can be estimated with guarantees. Empirically, we show that the framework can be used to enforce a small generalization gap on downstream classifiers and to predict the generalization ability of neural networks.

...read moreread less

Posted Content•

I love your chain mail! Making knights smile in a fantasy game world: Open-domain goal-oriented dialogue agents

[...]

Shrimai Prabhumoye¹, Margaret Li, Jack Urbanek, Emily Dinan, Douwe Kiela, Jason Weston, Arthur Szlam - Show less +3 more•Institutions (1)

Carnegie Mellon University¹

07 Feb 2020-arXiv: Artificial Intelligence

TL;DR: The authors train a goal-oriented model with reinforcement learning against an imitation-learned ''chit-chat'' model with two approaches: the policy either learns to pick a topic or learn to pick an utterance given the top-K utterances from the chit-CHAT model.

...read moreread less

Abstract: Dialogue research tends to distinguish between chit-chat and goal-oriented tasks. While the former is arguably more naturalistic and has a wider use of language, the latter has clearer metrics and a straightforward learning signal. Humans effortlessly combine the two, for example engaging in chit-chat with the goal of exchanging information or eliciting a specific response. Here, we bridge the divide between these two domains in the setting of a rich multi-player text-based fantasy environment where agents and humans engage in both actions and dialogue. Specifically, we train a goal-oriented model with reinforcement learning against an imitation-learned ``chit-chat'' model with two approaches: the policy either learns to pick a topic or learns to pick an utterance given the top-K utterances from the chit-chat model. We show that both models outperform an inverse model baseline and can converse naturally with their dialogue partner in order to achieve goals.

...read moreread less

Posted Content•

Multi-Dimensional Gender Bias Classification

[...]

Emily Dinan¹, Angela Fan¹, Ledell Wu, Jason Weston¹, Douwe Kiela¹, Adina Williams¹ - Show less +2 more•Institutions (1)

Facebook¹

01 May 2020-arXiv: Computation and Language

TL;DR: A general framework that decomposes gender bias in text along several pragmatic and semantic dimensions is proposed, which proves valuable for a variety of important applications, such as controlling for gender biases in generative models, detecting Gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.

...read moreread less

Abstract: Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.

...read moreread less

Journal Article•DOI•

Generating Interactive Worlds with Text.

[...]

Angela Fan¹, Jack Urbanek¹, Pratik Ringshia¹, Emily Dinan¹, Emma Qian¹, Siddharth Karamcheti¹, Shrimai Prabhumoye¹, Douwe Kiela¹, Tim Rocktäschel¹, Arthur Szlam¹, Jason Weston¹ - Show less +7 more•Institutions (1)

Facebook¹

03 Apr 2020

TL;DR: In this article, a neural network-based approach is proposed to compositionally arrange locations, characters, and objects into a coherent whole in a text-adventure game environment LIGHT.

...read moreread less

Abstract: Procedurally generating cohesive and interesting game environments is challenging and time-consuming. In order for the relationships between the game elements to be natural, common-sense has to be encoded into arrangement of the elements. In this work, we investigate a machine learning approach for world creation using content from the multi-player text adventure game environment LIGHT (Urbanek et al. 2019). We introduce neural network based models to compositionally arrange locations, characters, and objects into a coherent whole. In addition to creating worlds based on existing elements, our models can generate new game content. Humans can also leverage our models to interactively aid in worldbuilding. We show that the game environments created with our approach are cohesive, diverse, and preferred by human evaluators compared to other machine learning based world construction algorithms.

...read moreread less

Proceedings Article•

Learning Optimal Representations with the Decodable Information Bottleneck

[...]

Yann Dubois¹, Douwe Kiela¹, David J. Schwab², Ramakrishna Vedantam³•Institutions (3)

Facebook¹, The Graduate Center, CUNY², Georgia Institute of Technology³

01 Jan 2020

TL;DR: In this article, the authors propose the Decodable Information Bottleneck (DIB), which considers information retention and compression from the perspective of the desired predictive family and gives rise to representations that are optimal in terms of expected test performance.

...read moreread less

Abstract: We address the question of characterizing and finding optimal representations for supervised learning. Traditionally, this question has been tackled using the Information Bottleneck, which compresses the inputs while retaining information about the targets, in a decoder-agnostic fashion. In machine learning, however, our goal is not compression but rather generalization, which is intimately linked to the predictive family or decoder of interest (e.g. linear classifier). We propose the Decodable Information Bottleneck (DIB) that considers information retention and compression from the perspective of the desired predictive family. As a result, DIB gives rise to representations that are optimal in terms of expected test performance and can be estimated with guarantees. Empirically, we show that the framework can be used to enforce a small generalization gap on downstream classifiers and to predict the generalization ability of neural networks.

...read moreread less

Posted Content•

DynaSent: A Dynamic Benchmark for Sentiment Analysis

[...]

Christopher Potts¹, Zhengxuan Wu¹, Atticus Geiger¹, Douwe Kiela²•Institutions (2)

Stanford University¹, Facebook²

30 Dec 2020-arXiv: Computation and Language

TL;DR: DynaSent as discussed by the authors is a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis, which combines naturally occurring sentences with sentences created using the open-source Dynabench Platform, which facilities human and model-in-the-loop dataset creation.

...read moreread less

Abstract: We introduce DynaSent ('Dynamic Sentiment'), a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis. DynaSent combines naturally occurring sentences with sentences created using the open-source Dynabench Platform, which facilities human-and-model-in-the-loop dataset creation. DynaSent has a total of 121,634 sentences, each validated by five crowdworkers, and its development and test splits are designed to produce chance performance for even the best models we have been able to develop; when future models solve this task, we will use them to create DynaSent version 2, continuing the dynamic evolution of this benchmark. Here, we report on the dataset creation effort, focusing on the steps we took to increase quality and reduce artifacts. We also present evidence that DynaSent's Neutral category is more coherent than the comparable category in other benchmarks, and we motivate training models from scratch for each round over successive fine-tuning.

...read moreread less

Posted Content•

To what extent do human explanations of model behavior align with actual model behavior

[...]

Grusha Prasad, Yixin Nie, Mohit Bansal, Robin Jia, Douwe Kiela, Adina Williams - Show less +2 more

24 Dec 2020-arXiv: Computation and Language

TL;DR: The authors investigated the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions using Natural Language Inference (NLI) as a case study.

...read moreread less

Abstract: Given the increasingly prominent role NLP models (will) play in our lives, it is important to evaluate models on their alignment with human expectations of how models behave. Using Natural Language Inference (NLI) as a case study, we investigated the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions. More specifically, we defined two alignment metrics that quantify how well natural language human explanations align with model sensitivity to input words, as measured by integrated gradients. Then, we evaluated six different transformer models (the base and large versions of BERT, RoBERTa and ELECTRA), and found that the BERT-base model has the highest alignment with human-generated explanations, for both alignment metrics. Additionally, the base versions of the models we surveyed tended to have higher alignment with human-generated explanations than their larger counterparts, suggesting that increasing the number model parameters could result in worse alignment with human explanations. Finally, we find that a model's alignment with human explanations is not predicted by the model's accuracy on NLI, suggesting that accuracy and alignment are orthogonal, and both are important ways to evaluate models.

...read moreread less

Posted Content•

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection

[...]

Bertie Vidgen¹, Tristan Thrush², Zeerak Waseem³, Douwe Kiela⁴•Institutions (4)

The Turing Institute¹, Université catholique de Louvain², University of Sheffield³, Facebook⁴

31 Dec 2020-arXiv: Computation and Language

TL;DR: In this paper, a human-and-model-in-the-loop process for dynamically generating datasets and training better performing and more robust hate detection models is presented, which includes ~15,000 challenging perturbations and each hateful entry has fine-grained labels for the type and target of hate.

...read moreread less

Abstract: We present a human-and-model-in-the-loop process for dynamically generating datasets and training better performing and more robust hate detection models. We provide a new dataset of ~40,000 entries, generated and labelled by trained annotators over four rounds of dynamic data creation. It includes ~15,000 challenging perturbations and each hateful entry has fine-grained labels for the type and target of hate. Hateful entries make up 54% of the dataset, which is substantially higher than comparable datasets. We show that model performance is substantially improved using this approach. Models trained on later rounds of data collection perform better on test sets and are harder for annotators to trick. They also perform better on HateCheck, a suite of functional tests for online hate detection. We provide the code, dataset and annotation guidelines for other researchers to use. Accepted at ACL 2021.

...read moreread less

Posted Content•

Reservoir Transformers.

[...]

Sheng Shen¹, Alexei Baevski², Ari S. Morcos², Kurt Keutzer¹, Michael Auli², Douwe Kiela² - Show less +2 more•Institutions (2)

University of California, Berkeley¹, Facebook²

30 Dec 2020-arXiv: Computation and Language

TL;DR: Inspired by old and well-established ideas in machine learning, a variety of non-linear “reservoir” layers interspersed with regular transformer layers are explored, and improvements in wall-clock compute time until convergence are shown.

...read moreread less

Abstract: We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.

...read moreread less

Posted Content•

I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling

[...]

Yixin Nie¹, Mary Williamson², Mohit Bansal¹, Douwe Kiela², Jason Weston² - Show less +1 more•Institutions (2)

University of North Carolina at Chapel Hill¹, Facebook²

24 Dec 2020-arXiv: Computation and Language

TL;DR: This article proposed a structured utterance-based approach of using pre-trained Transformer models for contradiction detection with the typical unstructured approach and showed that their best contradiction detection model correlates well with human judgments and further provide evidence for its usage in automatically evaluating and improving the consistency of state-of-theart generative chatbots.

...read moreread less

Abstract: To quantify how well natural language understanding models can capture consistency in a general conversation, we introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues. We then compare a structured utterance-based approach of using pre-trained Transformer models for contradiction detection with the typical unstructured approach. Results reveal that: (i) our newly collected dataset is notably more effective at providing supervision for the dialogue contradiction detection task than existing NLI data including those aimed to cover the dialogue domain; (ii) the structured utterance-based approach is more robust and transferable on both analysis and out-of-distribution dialogues than its unstructured counterpart. We also show that our best contradiction detection model correlates well with human judgments and further provide evidence for its usage in both automatically evaluating and improving the consistency of state-of-the-art generative chatbots.

...read moreread less

Posted Content•

Reservoir Transformer

[...]

Sheng Shen, Alexei Baevski, Ari S. Morcos, Kurt Keutzer, Michael Auli, Douwe Kiela - Show less +2 more

30 Dec 2020

TL;DR: This article explored a variety of non-linear reservoir layers interspersed with regular transformer layers, and showed improvements in wall-clock compute time until convergence, as well as overall performance on various machine translation and (masked) language modelling tasks.

...read moreread less

Abstract: We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.

...read moreread less

To what extent do human explanations of model behavior align with actual model behavior

[...]

Grusha Prasad, Yixin Nie, Mohit Bansal, Robin Jia, Douwe Kiela, Adina Williams - Show less +2 more

01 Jan 2020

TL;DR: The authors investigate the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions, and find that the BERT-base model has the highest alignment with humangenerated explanations, for all alignment metrics.

...read moreread less

Abstract: Given the increasingly prominent role NLP models (will) play in our lives, it is important for human expectations of model behavior to align with actual model behavior. Using Natural Language Inference (NLI) as a case study, we investigate the extent to which human-generated explanations of models’ inference decisions align with how models actually make these decisions. More specifically, we define three alignment metrics that quantify how well natural language explanations align with model sensitivity to input words, as measured by integrated gradients. Then, we evaluate eight different models (the base and large versions of BERT,RoBERTa and ELECTRA, as well as anRNN and bag-of-words model), and find that the BERT-base model has the highest alignment with human-generated explanations, for all alignment metrics. Focusing in on transformers, we find that the base versions tend to have higher alignment with human-generated explanations than their larger counterparts, suggesting that increasing the number of model parameters leads, in some cases, to worse alignment with human explanations. Finally, we find that a model’s alignment with human explanations is not predicted by the model’s accuracy, suggesting that accuracy and alignment are complementary ways to evaluate models.

...read moreread less

Showing papers by "Douwe Kiela published in 2020"