Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

doi:10.1162/TACL_A_00074

Open AccessJournal ArticleDOI

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

Rotem Dror, +3 more

- 27 Nov 2017 -

Transactions of the Association for Comp...

- Vol. 5, Iss: 1, pp 471-486

TLDR

This paper proposes a Replicability Analysis framework for a statistically sound analysis of multiple comparisons between algorithms for NLP tasks, and demonstrates its empirical value across four applications: multi-domain dependency parsing, multilingual POS tagging, cross-domain sentiment classification and word similarity prediction.

Abstract:

With the ever growing amount of textual data from a large variety of languages, domains, and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure a consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions. In this paper we propose a Replicability Analysis framework for a statistically sound analysis of multiple comparisons between algorithms for NLP tasks. We discuss the theoretical advantages of this framework over the current, statistically unjustified, practice in the NLP literature, and demonstrate its empirical value across four applications: multi-domain dependency parsing, multilingual POS tagging, cross-domain sentiment classification and word similarity prediction.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing

Rotem Dror, +3 more

TL;DR: This opinion/ theoretical paper proposes a simple practical protocol for statistical significance test selection in NLP setups and accompanies this protocol with a brief survey of the most relevant tests.

...read moreread less

Proceedings ArticleDOI

Show Your Work: Improved Reporting of Experimental Results

Jesse Dodge, +4 more

TL;DR: It is demonstrated that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best, and a novel technique is presented: expected validation performance of the best-found model as a function of computation budget.

...read moreread less

Proceedings ArticleDOI

We need to talk about standard splits

Kyle Gorman, +1 more

TL;DR: It is argued that randomly generated splits should be used in system evaluation, and replication and reproduction experiments with nine part-of-speech taggers published between 2000 and 2018 fail to reliably reproduce some rankings when repeat this analysis with randomly generated training-testing splits.

...read moreread less

Proceedings ArticleDOI

Deep Dominance - How to Properly Compare Deep Neural Models

Rotem Dror, +2 more

TL;DR: The criteria for a high quality comparison method between DNNs is defined, and it is shown that the proposed test meets all criteria while previously proposed methods fail to do so.

...read moreread less

Proceedings ArticleDOI

Equity Beyond Bias in Language Technologies for Education

Elijah Mayfield, +6 more

TL;DR: Concepts from culturally relevant pedagogy and other frameworks for teaching and learning are introduced, identifying future work on equity in NLP.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Yoav Benjamini, +1 more

- 01 Jan 1995 -

Journal of the royal statistical society...

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.

...read moreread less

Book

An introduction to the bootstrap

Bradley Efron, +1 more

TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Journal ArticleDOI

A Simple Sequentially Rejective Multiple Test Procedure

Sture Holm

- 01 Jan 1979 -

Scandinavian Journal of Statistics

TL;DR: In this paper, a simple and widely accepted multiple test procedure of the sequentially rejective type is presented, i.e. hypotheses are rejected one at a time until no further rejections can be done.

...read moreread less

Collapse

Computational Linguistics

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

Citations

The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing

Show Your Work: Improved Reporting of Experimental Results

We need to talk about standard splits

Deep Dominance - How to Properly Compare Deep Neural Models

Equity Beyond Bias in Language Technologies for Education

References

Controlling the false discovery rate: a practical and powerful approach to multiple testing

An introduction to the bootstrap

Glove: Global Vectors for Word Representation

Distributed Representations of Words and Phrases and their Compositionality

A Simple Sequentially Rejective Multiple Test Procedure

Related Papers (5)

The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Glove: Global Vectors for Word Representation

Building a large annotated corpus of English: the penn treebank

Long short-term memory