scispace - formally typeset
Open AccessProceedings Article

FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study

Reads0
Chats0
TLDR
The aim of the paper is to present a clear image of the capabilities of FoLiA and how it relates to other formats, and to open discussion and aid users in their decision for a particular format.
Abstract
In this paper we present FoLiA, a Format for Linguistic Annotation, and conduct a comparative study with other annotation schemes, including the Linguistic Annotation Framework (LAF), the Text Encoding Initiative (TEI) and Text Corpus Format (TCF). An additional point of focus is the interoperability between FoLiA and metadata standards such as the Component MetaData Infrastructure (CMDI), as well as data category registries such as ISOcat. The aim of the paper is to present a clear image of the capabilities of FoLiA and how it relates to other formats. This should open discussion and aid users in their decision for a particular format. FoLiA is a practically-oriented XML-based annotation format for the representation of language resources, explicitly supporting a wide variety of annotation types. It introduces a flexible and uniform paradigm and a representation independent of language or label set. It is designed to be highly expressive, generic, and formalised, whilst at the same time focussing on being as practical as possible to ease its adoption and implementation. The aspiration is to offer a generic format for storage, exchange, and machine-processing of linguistically annotated documents, preventing users as well as software tools from having to cope with a wide variety of different formats, which in the field regularly causes convertibility issues and proliferation of ad-hoc formats. FoLiA emerged from such a practical need in the context of Computational Linguistics in the Netherlands and Flanders. It has been successfully adopted by numerous projects within this community. FoLiA was developed in a bottom-up fashion, with special emphasis on software libraries and tools to handle it.

read more

Content maybe subject to copyright    Report

Citations
More filters

XCES : An XML-based Encoding Standard for Linguistic Corpora XML Conversion of the CES

TL;DR: The Corpus Encoding Standard (CES) is a part of the EAGLES Guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES) that provides a set of encoding standards for corpus-based work in natural language processing applications as mentioned in this paper.
Proceedings ArticleDOI

The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions

TL;DR: An initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs is described, to elaborate universal terminologies and annotation guidelines for 18 languages and its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMwes.
Journal ArticleDOI

An Extensive Review of Tools for Manual Annotation of Documents

TL;DR: Motivation Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms, and some tools are comprehensive and mature enough to be used on most annotation projects.
Proceedings Article

TEITOK: Text-Faithful Annotated Corpora

TL;DR: TEITOK is a web-based framework that combines textual and linguistic annotation within a single TEI based XML document, that provides several built-in NLP tools to automatically (pre)process texts, and is highly customizable.
References
More filters
Journal Article

Extensible Markup Language (XML).

TL;DR: XML is an extremely simple dialect of SGML which is completely described in this document, to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.
Journal ArticleDOI

Extensible markup language

TL;DR: XML is the lingua franca of the wireless Web and is already being used for a host of server-server communication applications, which make it possible for different data servers to easily exchange information.
Proceedings Article

Xml linking language (xlink), version 1. 0

TL;DR: This specification defines the XML Linking Language (XLink) Version 1.1, which allows elements to be inserted into XML documents in order to create and describe links between resources, which enhances the functionality and interoperability of the Web.
Proceedings Article

The european language resources association.

TL;DR: The main achievement of ELRA (the most visible) is the growth of its catalogue, and its membership drive: since its foundation, ELRA has attracted an increasing number of members.
Related Papers (5)