Home
/
Authors
/
Alexis Ross

Author

Alexis Ross

Allen Institute for Artificial Intelligence

Bio: Alexis Ross is an academic researcher from Allen Institute for Artificial Intelligence. The author has contributed to research in topics: Computer science & Spurious relationship. The author has an hindex of 6, co-authored 11 publications receiving 159 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

[...]

Najoung Kim¹, Roma Patel², Adam Poliak³, Patrick Xia³, Alex Wang³, Thomas H. McCoy⁴, Ian Tenney⁵, Alexis Ross⁶, Tal Linzen³, Benjamin Van Durme³, Samuel R. Bowman⁷, Ellie Pavlick⁵ - Show less +8 more•Institutions (7)

KAIST¹, University of Nottingham², Johns Hopkins University³, Harvard University⁴, Google⁵, Allen Institute for Artificial Intelligence⁶, New York University⁷

01 Apr 2019

TL;DR: The results show that pretraining on CCG—the authors' most syntactic objective—performs the best on average across their probing tasks, suggesting that syntactic knowledge helps function word comprehension.

...read moreread less

Abstract: We introduce a set of nine challenge tasks that test for the understanding of function words. These tasks are created by structurally mutating sentences from existing datasets to target the comprehension of specific types of function words (e.g., prepositions, wh-words). Using these probing tasks, we explore the effects of various pretraining objectives for sentence encoders (e.g., language modeling, CCG supertagging and natural language inference (NLI)) on the learned representations. Our results show that pretraining on CCG—our most syntactic objective—performs the best on average across our probing tasks, suggesting that syntactic knowledge helps function word comprehension. Language modeling also shows strong performance, supporting its widespread use for pretraining state-of-the-art NLP models. Overall, no pretraining objective dominates across the board, and our function word probing tasks highlight several intuitive differences between pretraining objectives, e.g., that NLI helps the comprehension of negation.

...read moreread less

88 citations

Proceedings Article•DOI•

How well do NLI models capture verb veridicality

[...]

Alexis Ross¹, Ellie Pavlick²•Institutions (2)

Allen Institute for Artificial Intelligence¹, Brown University²

01 Nov 2019

TL;DR: It is shown that, encouragingly, BERT’s inferences are sensitive not only to the presence of individual verb types, but also to the syntactic role of the verb, the form of the complement clause (to- vs. that-complements), and negation.

...read moreread less

Abstract: In natural language inference (NLI), contexts are considered veridical if they allow us to infer that their underlying propositions make true claims about the real world. We investigate whether a state-of-the-art natural language inference model (BERT) learns to make correct inferences about veridicality in verb-complement constructions. We introduce an NLI dataset for veridicality evaluation consisting of 1,500 sentence pairs, covering 137 unique verbs. We find that both human and model inferences generally follow theoretical patterns, but exhibit a systematic bias towards assuming that verbs are veridical–a bias which is amplified in BERT. We further show that, encouragingly, BERT’s inferences are sensitive not only to the presence of individual verb types, but also to the syntactic role of the verb, the form of the complement clause (to- vs. that-complements), and negation.

...read moreread less

37 citations

Posted Content•

Explaining NLP Models via Minimal Contrastive Editing (MiCE)

[...]

Alexis Ross¹, Ana Marasović¹, Matthew E. Peters¹•Institutions (1)

Allen Institute for Artificial Intelligence¹

27 Dec 2020-arXiv: Computation and Language

TL;DR: The authors presented Minimal Contrastive Editing (MiCE), a method for producing contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case.

...read moreread less

Abstract: Humans have been shown to give contrastive explanations, which explain why an observed event happened rather than some other counterfactual event (the contrast case). Despite the influential role that contrastivity plays in how humans explain, this property is largely missing from current methods for explaining NLP models. We present Minimal Contrastive Editing (MiCE), a method for producing contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case. Our experiments across three tasks--binary sentiment classification, topic classification, and multiple-choice question answering--show that MiCE is able to produce edits that are not only contrastive, but also minimal and fluent, consistent with human contrastive edits. We demonstrate how MiCE edits can be used for two use cases in NLP system development--debugging incorrect model outputs and uncovering dataset artifacts--and thereby illustrate that producing contrastive explanations is a promising research direction for model interpretability.

...read moreread less

29 citations

Posted Content•

Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

[...]

Najoung Kim, Roma Patel, Adam Poliak, Alex Wang, Patrick Xia, R. Thomas McCoy, Ian Tenney, Alexis Ross, Tal Linzen, Benjamin Van Durme, Samuel R. Bowman, Ellie Pavlick - Show less +8 more

25 Apr 2019-arXiv: Computation and Language

TL;DR: The authors explore the effects of various pretraining objectives for sentence encoders (e.g., language modeling, CCG supertagging and natural language inference) on the learned representations.

...read moreread less

Abstract: We introduce a set of nine challenge tasks that test for the understanding of function words. These tasks are created by structurally mutating sentences from existing datasets to target the comprehension of specific types of function words (e.g., prepositions, wh-words). Using these probing tasks, we explore the effects of various pretraining objectives for sentence encoders (e.g., language modeling, CCG supertagging and natural language inference (NLI)) on the learned representations. Our results show that pretraining on language modeling performs the best on average across our probing tasks, supporting its widespread use for pretraining state-of-the-art NLP models, and CCG supertagging and NLI pretraining perform comparably. Overall, no pretraining objective dominates across the board, and our function word probing tasks highlight several intuitive differences between pretraining objectives, e.g., that NLI helps the comprehension of negation.

...read moreread less

27 citations

Proceedings Article•DOI•

Explaining NLP Models via Minimal Contrastive Editing (MiCE)

[...]

Alexis Ross¹, Ana Marasovi, Matthew E. Peters²•Institutions (2)

Allen Institute for Artificial Intelligence¹, University of Washington²

01 Aug 2021

15 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Beyond accuracy: Behavioral testing of NLP models with checklist

[...]

Marco Tulio Ribeiro¹, Tongshuang Wu², Carlos Guestrin³, Sameer Singh⁴•Institutions (4)

Microsoft¹, University of Washington², Apple Inc.³, University of California, Irvine⁴

01 Jun 2020

TL;DR: CheckList as mentioned in this paper is a task-agnostic methodology for testing NLP models, which includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly.

...read moreread less

Abstract: Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

...read moreread less

705 citations

Journal Article•DOI•

What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models

[...]

Allyson Ettinger¹•Institutions (1)

University of Chicago¹

01 Feb 2020-Transactions of the Association for Computational Linguistics

TL;DR: A suite of diagnostics drawn from human language experiments are introduced, which allow us to ask targeted questions about information used by language models for generating predictions in context, and the popular BERT model is applied.

...read moreread less

Abstract: Pre-training by language modeling has become a popular and successful approach to NLP tasks, but we have yet to understand exactly what linguistic capacities these pre-training processes confer upon models. In this paper we introduce a suite of diagnostics drawn from human language experiments, which allow us to ask targeted questions about information used by language models for generating predictions in context. As a case study, we apply these diagnostics to the popular BERT model, finding that it can generally distinguish good from bad completions involving shared category or role reversal, albeit with less sensitivity than humans, and it robustly retrieves noun hypernyms, but it struggles with challenging inference and role-based event prediction -- and in particular, it shows clear insensitivity to the contextual impacts of negation.

...read moreread less

314 citations

Proceedings Article•DOI•

Designing and Interpreting Probes with Control Tasks

[...]

John Hewitt¹, Percy Liang¹•Institutions (1)

Stanford University¹

08 Sep 2019

TL;DR: Control tasks, which associate word types with random outputs, are proposed to complement linguistic tasks, and it is found that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective.

...read moreread less

Abstract: Probes, supervised models trained to predict properties (like parts-of-speech) from representations (like ELMo), have achieved high accuracy on a range of linguistic tasks. But does this mean that the representations encode linguistic structure or just that the probe has learned the linguistic task? In this paper, we propose control tasks, which associate word types with random outputs, to complement linguistic tasks. By construction, these tasks can only be learned by the probe itself. So a good probe, (one that reflects the representation), should be selective, achieving high linguistic task accuracy and low control task accuracy. The selectivity of a probe puts linguistic task accuracy in context with the probe’s capacity to memorize from word types. We construct control tasks for English part-of-speech tagging and dependency edge prediction, and show that popular probes on ELMo representations are not selective. We also find that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective. Finally, we find that while probes on the first layer of ELMo yield slightly better part-of-speech tagging accuracy than the second, probes on the second layer are substantially more selective, which raises the question of which layer better represents parts-of-speech.

...read moreread less

305 citations

Posted Content•

What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models

[...]

Allyson Ettinger¹•Institutions (1)

University of Chicago¹

31 Jul 2019-arXiv: Computation and Language

TL;DR: This article introduced a suite of diagnostics drawn from human language experiments, which allow us to ask targeted questions about the information used by language models for generating predictions in context, and applied these diagnostics to the popular BERT model, finding that it can generally distinguish good from bad completions involving shared category or role reversal.

...read moreread less

Abstract: Pre-training by language modeling has become a popular and successful approach to NLP tasks, but we have yet to understand exactly what linguistic capacities these pre-training processes confer upon models. In this paper we introduce a suite of diagnostics drawn from human language experiments, which allow us to ask targeted questions about the information used by language models for generating predictions in context. As a case study, we apply these diagnostics to the popular BERT model, finding that it can generally distinguish good from bad completions involving shared category or role reversal, albeit with less sensitivity than humans, and it robustly retrieves noun hypernyms, but it struggles with challenging inferences and role-based event prediction -- and in particular, it shows clear insensitivity to the contextual impacts of negation.

...read moreread less

274 citations

Proceedings Article•DOI•

Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly

[...]

Nora Kassner¹, Hinrich Schütze¹•Institutions (1)

Ludwig Maximilian University of Munich¹

06 Jul 2020

TL;DR: Two new probing tasks analyzing factual knowledge stored in Pretrained Language Models are proposed and it is found that PLMs do not distinguish between negated and non-negated cloze questions, and PLMs are easily distracted by misprimes.

...read moreread less

Abstract: Building on Petroni et al. (2019), we pro- pose two new probing tasks analyzing fac- tual knowledge stored in Pretrained Language Models (PLMs). (1) Negation. We find that PLMs do not distinguish between negated (“Birds cannot [MASK]”) and non-negated (“Birds can [MASK]”) cloze questions. (2) Mispriming. Inspired by priming methods in human psychology, we add “misprimes” to cloze questions (“Talk? Birds can [MASK]”). We find that PLMs are easily distracted by misprimes. These results suggest that PLMs still have a long way to go to adequately learn human-like factual knowledge.

...read moreread less

215 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

Collapse