The GUM corpus: creating multilayer resources in the classroom

doi:10.1007/S10579-016-9343-X

Journal ArticleDOI

The GUM corpus: creating multilayer resources in the classroom

Amir Zeldes

- Vol. 51, Iss: 3, pp 581-612

Chats0

TLDR

The results of this project show that high quality, richly annotated resources can be created effectively as part of a linguistics curriculum, opening new possibilities not just for research, but also for corpora in linguistics pedagogy.

Abstract:

This paper presents the methodology, design principles and detailed evaluation of a new freely available multilayer corpus, collected and edited via classroom annotation using collaborative software. After briefly discussing corpus design for open, extensible corpora, five classroom annotation projects are presented, covering structural markup in TEI XML, multiple part of speech tagging, constituent and dependency parsing, information structural and coreference annotation, and Rhetorical Structure Theory analysis. Layers are inspected for annotation quality and together they coalesce to form a richly annotated corpus that can be used to study the interactions between different levels of linguistic description. The evaluation gives an indication of the expected quality of a corpus created by students with relatively little training. A multifactorial example study on lexical NP coreference likelihood is also presented, which illustrates some applications of the corpus. The results of this project show that high quality, richly annotated resources can be created effectively as part of a linguistics curriculum, opening new possibilities not just for research, but also for corpora in linguistics pedagogy.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Matt Gardner, +25 more

TL;DR: A more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data, and recommends that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets.

...read moreread less

Proceedings ArticleDOI

Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network

Yutai Hou, +6 more

TL;DR: In this paper, a Labelenhanced Task-Adaptive Projection Network (L-TapNet) is proposed for slot tagging with only a few labeled support sentences (a.k.a. few-shot).

...read moreread less

Journal ArticleDOI

Anaphora and Coreference Resolution: A Review

Rhea Sanjay Sukthanker, +3 more

- 01 Jul 2020 -

Information Fusion

TL;DR: This review article aims at clarifying the scope of these two tasks in entity resolution, and carries out a detailed analysis of the datasets, evaluation metrics and research methods that have been adopted to tackle this NLP problem.

...read moreread less

Proceedings Article

An Annotated Dataset of Coreference in English Literature

David Bamman, +2 more

TL;DR: A new dataset of coreference annotations for works of literature in English, covering 29,103 mentions in 210,532 tokens from 100 works of fiction published between 1719 and 1922, is presented.

...read moreread less

Collapse

References

PDF

Open Access

More filters

ReportDOI

Building a large annotated corpus of English: the penn treebank

Mitchell Marcus, +2 more

- 01 Jun 1993 -

Computational Linguistics

TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.

...read moreread less

Proceedings ArticleDOI

The Stanford CoreNLP Natural Language Processing Toolkit

Christopher D. Manning, +5 more

TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.

...read moreread less

Book

Computational analysis of present-day American English

Henry Kučera, +5 more

Journal ArticleDOI

Rhetorical Structure Theory : Toward a Functional Theory of Text Organization

William C. Mann, +1 more

- 01 Jan 1988 -

Text - Interdisciplinary Journal for the...

TL;DR: Rhetorical Structure Theory (RST) as mentioned in this paper is a descriptive theory of a major aspect of the organization of natural text, which is a linguistically useful method for describing natural texts, characterizing their Structure primarily in terms of relations that hold between parts of the text.

...read moreread less

Proceedings ArticleDOI

Cheap and Fast -- But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

Rion Snow, +3 more

TL;DR: This work explores the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web, and proposes a technique for bias correction that significantly improves annotation quality on two tasks.

...read moreread less