scispace - formally typeset
Open AccessJournal ArticleDOI

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

TLDR
This paper proposes a Replicability Analysis framework for a statistically sound analysis of multiple comparisons between algorithms for NLP tasks, and demonstrates its empirical value across four applications: multi-domain dependency parsing, multilingual POS tagging, cross-domain sentiment classification and word similarity prediction.
Abstract
With the ever growing amount of textual data from a large variety of languages, domains, and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure a consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions.  In this paper we propose a Replicability Analysis framework for a statistically sound analysis of multiple comparisons between algorithms for NLP tasks. We discuss the theoretical advantages of this framework over the current, statistically unjustified, practice in the NLP literature, and demonstrate its empirical value across four applications: multi-domain dependency parsing, multilingual POS tagging,  cross-domain sentiment classification and word similarity prediction.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing

TL;DR: This opinion/ theoretical paper proposes a simple practical protocol for statistical significance test selection in NLP setups and accompanies this protocol with a brief survey of the most relevant tests.
Proceedings ArticleDOI

Show Your Work: Improved Reporting of Experimental Results

TL;DR: It is demonstrated that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best, and a novel technique is presented: expected validation performance of the best-found model as a function of computation budget.
Proceedings ArticleDOI

We need to talk about standard splits

TL;DR: It is argued that randomly generated splits should be used in system evaluation, and replication and reproduction experiments with nine part-of-speech taggers published between 2000 and 2018 fail to reliably reproduce some rankings when repeat this analysis with randomly generated training-testing splits.
Proceedings ArticleDOI

Deep Dominance - How to Properly Compare Deep Neural Models

TL;DR: The criteria for a high quality comparison method between DNNs is defined, and it is shown that the proposed test meets all criteria while previously proposed methods fail to do so.
Proceedings ArticleDOI

Equity Beyond Bias in Language Technologies for Education

TL;DR: Concepts from culturally relevant pedagogy and other frameworks for teaching and learning are introduced, identifying future work on equity in NLP.
References
More filters
Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Book

An introduction to the bootstrap

TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Journal ArticleDOI

A Simple Sequentially Rejective Multiple Test Procedure

TL;DR: In this paper, a simple and widely accepted multiple test procedure of the sequentially rejective type is presented, i.e. hypotheses are rejected one at a time until no further rejections can be done.
Related Papers (5)