Precise Task Formalization Matters in Winograd Schema Evaluations

Open AccessPosted Content

Precise Task Formalization Matters in Winograd Schema Evaluations

- 08 Oct 2020 -

TLDR

This paper found that framing the task as multiple choice improves performance by 2-6 points and several additional techniques, including the reuse of a pretrained language modeling head, can mitigate the model's extreme sensitivity to hyperparameters.

Abstract:

Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize that much of this improvement comes from recent changes in task formalization---the combination of input specification, loss function, and reuse of pretrained parameters---by users of the dataset, rather than improvements in the pretrained model's reasoning ability. We perform an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and find (i) framing the task as multiple choice improves performance by 2-6 points and (ii) several additional techniques, including the reuse of a pretrained language modeling head, can mitigate the model's extreme sensitivity to hyperparameters. We urge future benchmark creators to impose additional structure to minimize the impact of formalization decisions on reported results.

Precise Task Formalization Matters in Winograd Schema Evaluations

Citations

When Do You Need Billions of Words of Pretraining Data

References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

HuggingFace's Transformers: State-of-the-art Natural Language Processing.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Related Papers (5)

Back to Square One: Bias Detection, Training and Commonsense Disentanglement in the Winograd Schema

Abstract Reasoning with Distracting Features

EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference

Incorporating Relation Knowledge into Commonsense Reading Comprehension with Multi-task Learning

Knowledge-driven Data Construction for Zero-shot Evaluation in Commonsense Question Answering