scispace - formally typeset
Journal ArticleDOI

ANNIS3: A new architecture for generic corpus query and visualization

TLDR
This article proposes a generic solution for specialized corpus visualizations in a Web interface using annotation-triggered style sheets, which leverage the power of modern browsers and CSS for multiple and highly customizable views of primary data.
Abstract
This article is concerned with the data structures, properties of query languages, and visualization facilities required for the generic representation of richly annotated, heterogeneous linguistic corpora. We propose that above and beyond a general graph-based data model, which is becoming increasingly popular in many complex annotation formats, a well-defined concept of multiple, potentially conflicting segmentation layers must be introduced to deal with different sources and applications of corpus data flexibly. We also propose a generic solution for specialized corpus visualizations in a Web interface using annotation-triggered style sheets, which leverage the power of modern browsers and CSS for multiple and highly customizable views of primary data. We offer an implementation and evaluation of our architecture in ANNIS3, an open-source browser-based architecture for corpus search and visualization. We present three case studies to test the coverage of the system, encompassing core linguistic and digital humanities use-cases including richly annotated newspaper treebanks, multilingual diplomatic and normalized manuscript materials edited in TEI, and analysis of multimodal recordings of spoken language.

read more

Citations
More filters
Journal ArticleDOI

The GUM corpus: creating multilayer resources in the classroom

TL;DR: The results of this project show that high quality, richly annotated resources can be created effectively as part of a linguistics curriculum, opening new possibilities not just for research, but also for corpora in linguistics pedagogy.
Proceedings ArticleDOI

On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges

TL;DR: A taxonomy of applied methods for close and distant reading, and approaches that combine both reading techniques to provide a multifaceted view of the data are provided.
Journal ArticleDOI

Visual Text Analysis in Digital Humanities

TL;DR: An overview of the research conducted since 2005 on supporting text analysis tasks with close and distant reading visualizations in the digital humanities is presented and approaches that combine both reading techniques in order to provide a multi‐faceted view of the textual data are illustrated.
Proceedings ArticleDOI

ArchiMob - A Corpus of Spoken Swiss German

TL;DR: A bootstrapping approach to automatic normalisation using different machine-translation-inspired methods is presented and the performance of part-of-speech taggers on the authors' data is evaluated to show how the same bootstrapped approach improves part- of-speech tagging by 10% over four rounds.
Proceedings Article

TreeAnnotator: Versatile Visual Annotation of Hierarchical Text Relations.

TL;DR: TREEANNOTATOR’s interoperability exceeds similar tools, providing a wider range of formats, while annotation work can be completed more quickly due to a revised input method for RST dependency relations.
References
More filters
ReportDOI

Building a large annotated corpus of English: the penn treebank

TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Journal ArticleDOI

Rhetorical Structure Theory : Toward a Functional Theory of Text Organization

TL;DR: Rhetorical Structure Theory (RST) as mentioned in this paper is a descriptive theory of a major aspect of the organization of natural text, which is a linguistically useful method for describing natural texts, characterizing their Structure primarily in terms of relations that hold between parts of the text.
Book

Eclipse Modeling Framework

TL;DR: The authoritative guide to the Eclipse Modeling Framework (EMF)--written by the lead EMF designers! shows how EMF unifies three important technologies: Java, XML, and UML.
Journal ArticleDOI

The HCRC Map Task Corpus

TL;DR: A corpus of unscripted, task-oriented dialogues which has been designed, digitally recorded, and transcribed to support the study of spontaneous speech on many levels is described.
Related Papers (5)