scispace - formally typeset
Open AccessJournal ArticleDOI

International standard for a linguistic annotation framework

Reads0
Chats0
TLDR
The Linguistic Annotation Framework under development within ISO TC37 SC4 WG1 as mentioned in this paper is intended to serve as a basis for harmonizing existing language resources as well as developing new ones.
Abstract
This paper describes the Linguistic Annotation Framework under development within ISO TC37 SC4 WG1. The Linguistic Annotation Framework is intended to serve as a basis for harmonizing existing language resources as well as developing new ones.

read more

Content maybe subject to copyright    Report

International Standard for a Linguistic Annotation Framework
Nancy Ide
Dept. of Computer Science
Vassar College
Poughkeepsie, New York 12604-0520 USA
ide@cs.vassar.edu
Laurent Romary
Equipe Langue et Dialogue
LORIA/INRIA
Vandoeuvre-lès-Nancy
FRANCE
romary@loria.fr
Abstract
This paper describes the Linguistic Anno-
tation Framework under development
within ISO TC37 SC4 WG1. The Lin-
guistic Annotation Framework is intended
to serve as a basis for harmonizing exist-
ing language resources as well as devel-
oping new ones.
1 Introduction
Language resources are bodies of electronic lan-
guage data used to support research and applica-
tions in the area of natural language processing.
Typically, such data are enhanced (annotated) with
linguistic information such as morpho-syntactic
categories, syntactic or discourse structure, co-
reference information, etc.; or two or more bodies
may be aligned for correspondences (e.g., parallel
translations, speech signal and transcription).
Over the past 15-20 years, increasingly large bod-
ies of language resources have been created and
annotated by the language engineering community.
Certain fundamental representation principles have
been widely adopted, such as the use of stand-off
annotation (Ide and Priest-Dorman, 1996), use of
XML, etc., and several attempts to provide gener-
alized annotation mechanisms and formats have
been developed (e.g., XCES (Ide, et al., 2000),
annotation graphs (Bird and Liberman, 2001)).
However, it remains the case that annotation for-
mats often vary considerably from resource to re-
source, often to satisfy constraints imposed by
particular processing software. The language proc-
essing community has recognized that commonal-
ity and interoperability are increasingly imperative
to enable sharing, merging, and comparison of lan-
guage resources.
To provide an infra-structure and framework for
language resource development and use, the Inter-
national Organization for Standardization (ISO)
has formed a sub-committee (SC4) under Techni-
cal Committee 37 (TC37, Terminology and Other
Language Resources) devoted to Language Re-
source Management. The objective of ISO/TC
37/SC 4 is to prepare international standards and
guidelines for effective language resource man-
agement in applications in the multilingual infor-
mation society. To this end, the committee is
developing principles and methods for creating,
coding, processing and managing language re-
sources, such as written corpora, lexical corpora,
speech corpora, dictionary compiling and classifi-
cation schemes. The focus of the work is on data
modeling, markup, data exchange and the evalua-
tion of language resources other than terminologies
(which have already been treated in other sub-
committees of ISO/TC 37). The worldwide use of
ISO/TC 37/SC 4 standards should improve infor-
mation management within industrial, technical
and scientific environments, and increase effi-
ciency in computer-supported language communi-
cation.
At present, language professionals and standardiza-
tion experts are not sufficiently aware of the stan-
dardization efforts being undertaken by ISO/TC
37/SC 4. Promoting awareness of future activities
and rising problems, therefore, is crucial for the
success of the committee, and will be required to
ensure widespread adoption of the standards it de-
velops. An even more critical factor for the success
of the committee's work is to involve, from the
outset, as many and as broad a range of potential
users of the standards as possible.

Within ISO/TC 37/SC 4, a working group (WG1)
has been established to develop a Linguistic Anno-
tation Framework (LAF) that can serve as a basis
for harmonizing existing language resources as
well as developing new ones. In order to ensure
that the framework is developed based on the input
and consensus of the research community, a group
of experts
1
was convened on November 21-22,
2002, at Pont-à-Mousson, France, to lay out the
overall structure of the framework. Based on the
determinations of the experts at the workshop, the
general outlines of the Linguistic Annotation
Framework have been defined. In this paper, we
describe the LAF design as it has been developed
so far, and solicit the input of other members of the
community to inform its further development.
2 Background and rationale
The standardization of principles and methods for
the collection, processing and presentation of lan-
guage resources requires a distinct type of activity.
Basic standards must be produced with wide-
ranging applications in view. In the area of lan-
guage resources, these standards should provide
various technical committees of ISO, IEC and
other standardizing bodies with the groundwork for
building more precise standards for language re-
source management.
2
The need for harmonization of representation for-
mats for different kinds of linguistic information is
critical, as resources and information are more and
more frequently merged, compared, or otherwise
utilized in common systems. This is perhaps most
obvious for processing multi-modal information,
which must support the fusion of multimodal in-
1
Participants: Nuria Bel (Universitat de Barcelona), David
Durand (Brown University), Henry Thompson (University of
Edinburgh), Koiti Hasida (AIST Tokyo), Eric De La Clergerie
(INRIA), Lionel Clement (INRIA), Laurent Romary (LORIA),
Nancy Ide (Vassar College), Kiyong Lee (Korea University),
Keith Suderman (Vassar College), Aswani Kumar (LORIA),
Chris Laprun (NIST), Thierry Declerck (DFKI), Jean Carletta
(University of Edinburgh), Michael Strube (European Media
Laboratory), Hamish Cunningham (University of Sheffield),
Tomaz Erjavec (Institute Jozef Stefan), Hennie Brugman
(Max-Planck-Institut für Psycholinguistik), Fabio Vitali (Uni-
versite di Bologna), Key-Sun Choi (Korterm), Jean-Michel
Borde (Digital Visual), Eric Kow (LORIA).
2
This is particularly true for the two domains of Multimedia
(ISO/IEC JTC1/SC 29/WG 11) and Education (ISO
IEC/JTC1/SC 36)
puts and represent the combined and integrated
contributions of different types of input (e.g., a
spoken utterance combined with gesture and facial
expression), and enable multimodal output (see, for
example, Bunt and Romary, 2002). However, lan-
guage processing applications of any kind require
the integration of varieties of linguistic informa-
tion, which, in today’s environment, come from
potentially diverse sources. We can therefore ex-
pect use and integration of, for example, syntactic,
morphological, discourse, etc. information for mul-
tiple languages, as well as information structures
like domain models and ontologies.
We are aware that standardization is a difficult
business, and that many members of the targeted
communities are skeptical about imposing any sort
of standards at all. There are two major arguments
against the idea of standardization for language
resources. First, the diversity of theoretical ap-
proaches to, in particular, the annotation of various
linguistic phenomena suggests that standardization
is at least impractical, if not impossible. Second, it
is feared that vast amounts of existing data and
processing software, which may have taken years
of effort and considerable funding to develop, will
be rendered obsolete by the acceptance of new
standards by the community. Recognizing the va-
lidity of both of these concerns, WG1 does not
seek to establish a single, definitive annotation
scheme or format. Rather, the goal is to provide a
framework for linguistic annotation of language
resources that can serve as a reference or pivot for
different annotation schemes, and which will en-
able their merging and/or comparison. To this end,
the work of WG1 includes the following:
analysis of the full range of annotation types
and existing schemes, to identify the funda-
mental structural principles and content cate-
gories;
instantiation of an abstract format capable of
capturing the structure and content of linguis-
tic annotations, based on the analysis in (1);
establishment of a mechanism for formal defi-
nition of a set of reference content categories
which can be used “off the shelfor serve as a
point of departure for precise definition of new
or modified categories.

provision of both a set of guidelines and prin-
ciples for developing new annotation schemes
and concrete mechanisms for their implemen-
tation, for those who wish to use them.
By situating all of the standards development
squarely in the framework of XML and related
standards such as RDF, OWL, etc., we hope to en-
sure not only that the standards developed by the
committee provide for compatibility with estab-
lished and widely accepted web-based technolo-
gies, but also that transduction from legacy formats
into XML formats conformant to the new stan-
dards is feasible.
3 General requirements for a linguistic
annotation framework
3.1 Usage scenarios
Natural language processing (NLP) applications
can be applied to create annotations for linguistic
data by analyzing text, speech, and data represent-
ing other modalities to determine specific linguistic
attributes and associate them with the segments of
that data to which they apply. NLP applications
also use linguistic annotations to facilitate lan-
guage understanding and generation. Development
of a standard linguistic annotation framework must
proceed by considering both of these “views” on
linguistic annotation, and integrating the two to
ensure maximal inter-operability.
Annotation of linguistic data may involve multiple
annotation steps, for example, morpho-syntactic
tagging, syntactic analysis, entity and event recog-
nition, semantic annotation, co-reference resolu-
tion, discourse structure analysis, etc. Annotation
at higher linguistic levels typically relies on anno-
tations at lower levels—that is, information at
lower linguistic levels serves as input in the deter-
mination of higher-level annotation categories, so
that annotation can be viewed as an incremental
process. Depending on the application intended to
use the annotations, lower-level annotations may
or may not be preserved in a persistent format.
That is, the output of the annotation software may
consist solely of higher-level annotations, even
though lower-level analysis has been performed.
Note that many application programs—e.g., infor-
mation extraction software—perform the analysis
required for annotation of various linguistic fea-
tures and utilize it internally to deliver the desired
result, without preserving the annotation informa-
tion.
The need to support annotations in the context of
the Semantic Web is one of the most important
considerations for development of the Linguistic
Annotation Framework. Annotated corpora are, at
present, primarily static entities used mainly for
training annotation software, as well as for corpus
linguistics and lexicography (which rely on anno-
tated corpora to study language use). However, the
advent of the Semantic Web and the development
of supporting technologies will significantly alter
the ways in which annotations are used and pre-
served in the future. In the context of the Semantic
Web, annotations for a variety of (at least) higher-
level linguistic and communicative features will be
preserved in web-accessible form and used by
software agents and other analytic software for
inferencing and retrieval. This demands that the
Linguistic Annotation Framework not only relies
on web technologies (e.g., RDF, OWL) for repre-
senting annotations, but also that “layers’ of anno-
tations for the full range of annotation types
(including named entities, time, space, and event
annotation, annotation for gesture, facial expres-
sion, etc.) are at the same time separable (so that
agents and other analytic software can access only
those annotation types that are required for the
purpose, and mergeable (so that two or more anno-
tation types can be combined where necessary).
They may also need to be dynamic, in the sense
that new and/or modified information can be added
as necessary.
Another increasingly important concern for LAF
development is the handling of streamed data,
wherein the processor analyzes input as it is en-
countered in a linear, time-bound sequence.
Streamed data can be text, video, and audio, or
might be a stream of sensor readings, satellite im-
ages, etc. This dictates that annotations to be at-
tached to the data may be (temporarily) partial,
especially where long-distance dependencies be-
tween seen and unseen segments of the data exist.
3.2 Requirements
To serve the goals of creation and use of linguistic
annotation discussed above, we identify the fol-

lowing general requirements for a linguistic anno-
tation framework:
Expressive adequacy. The framework must provide
means to represent all varieties of linguistic infor-
mation (and possibly also other types of informa-
tion). This includes representing the full range of
information from the very general to information at
the finest level of granularity.
Media independence. The framework must handle
all potential media types, including text, audio,
video, image, etc. and should, in principle, provide
common mechanisms for handling all of them. The
framework will rely on existing or developing
standards for representing multi-media.
Semantic adequacy. Representation structures must
have a formal semantics, including definitions of
logical operations. There must exist a centralized
way of sharing descriptors and information catego-
ries
Incrementality. The framework must provide sup-
port for various stages of input interpretation and
output generation, both during annotation (which
may be accomplished at different times and with
different software) and use. It must also provide
for the representation of partial/under-specified
results and ambiguities, alternatives, etc. and their
merging and comparison.
Separability. As a complement to incrementality, it
must be possible for NLP applications to easily
separate or extract annotation types specific to the
task at hand.
Uniformity. Representations must utilize same
“building blocks” and the same methods for com-
bining them.
Openness. The framework must not dictate repre-
sentations dependent on a single linguistic theory.
Extensibility. The framework must provide ways to
declare and interchange extensions to the central-
ized data category registry.
Human readability. Representations must be hu-
man readable, at least for creation and editing.
Processability (explicitness). Information in an
annotation scheme must be explicit—that is, the
burden of interpretation should not be left to the
processing software.
Consistency. Different mechanisms should not be
used to indicate the same type of information.
To fulfill these requirements, it is necessary to
identify a consistent underlying data model for
data and its annotations. A data model is a formal-
ized description of the data objects (in terms of
composition, attributes, class membership, appli-
cable procedures, etc.) and relations among them,
independent of their instantiation in any particular
form. A data model capable of capturing the struc-
ture and relations in diverse types of data and an-
notations is a pre-requisite for developing a
common corpus-handling environment: it impacts
the design of annotation schema, encoding formats
and data architectures, and tool architectures.
As a starting assumption, we can conceive of an
annotation as a one- or two-way link between an
annotation object and a point (or a list/set of
points) or span (or a list/set of spans) within a base
data set. Links may or may not have a semantics--
i.e., a type--associated with them. Points and spans
in the base data may themselves be objects, or sets
or lists of objects. We make several observations
concerning this assumption:
the model assumes a fundamental linearity of
objects in the base,
3
e.g., as a time line
(speech); a sequence of characters, words, sen-
tences, etc.; or pixel data representing images;
the granularity of the data representation and
encoding is critical: it must be possible to
uniquely point to the smallest possible compo-
nent (e.g., character, phonetic component,
pitch signal, morpheme, word, etc.);
an annotation scheme must be mappable to the
structures defined for annotation objects in the
model;
the encoding scheme must be able to capture
the object structure and relations expressed in
the model, including class membership and in-
heritance, therefore requiring a sophisticated
means to specify linkage within and between
documents;
it is necessary to consider the logistics of iden-
tifying spans by enclosing them in start and
3
Note that this observation applies to the fundamental struc-
ture of stored data. Because the targets of a relation may be
either individual objects, or sets or lists of objects, information
with more than one dimension is accommodated.

end tags (thus enabling hierarchical grouping
of objects in the data itself), vs. explicit ad-
dressing of start and end points, which is re-
quired for read-only data;
it must be possible to represent objects and
relations in some (fairly straightforward) form
that prevents information loss;
it should be possible to represent the objects
and relations in a variety of formats suitable to
different tools and applications.
ISO TC37/SC 4’s goal is to develop a framework
for the design and implementation of linguistic
resource formats and processes in order to facili-
tate the exchange of information between language
processing modules. A well-defined representa-
tional framework for linguistic information will
also provide for the specification and comparison
of existing application-specific representations and
the definition of new ones, while ensuring a level
of interoperability between them. The framework
should allow for variation in annotation schemes
while at the same time enabling comparison and
evaluation, merging of different annotations, and
development of common tools for creating and
using annotated data. For this purpose we envisage
a common “pivot” format based on a data model
capable of capturing all types of information in
linguistic annotations, into and out of which site-
specific representation formats can be transduced.
This strategy is similar to that adopted in the de-
sign of languages intended to be reusable across
platforms, such as Java. The pivot format must
support the communication among all modules in
the system, and be adequate for representing not
only the end result of interpretation, but also in-
termediate results.
4 Terms and definitions
The following terms and definitions are used in the
discussion that follows:
Annotation: The process of adding linguistic in-
formation to language data (“annotation of a cor-
pus”) or the linguistic information itself (“an
annotation”), independent of its representation. For
example, one may annotate a document for syntax
using a LISP-like representation, an XML repre-
sentation, etc.
Representation: The format in which the annota-
tion is rendered, e.g. XML, LISP, etc. independent
of its content. For example, a phrase structure syn-
tactic annotation and a dependency-based annota-
tion may both be represented using XML, even
though the annotation information itself is very
different.
Types of Annotation: We distinguish two funda-
mental types of annotation activity:
1. segmentation : delimits linguistic elements that
appear in the primary data. Including
continuous segments (appear contiguously
in the primary data)
super- and sub-segments, where groups of
segments will comprise the parts of a
larger segment (e.g., a contiguous word
segments typically comprise a sentence
segment)
discontinuous segments (linking continu-
ous segments)
landmarks (e.g time stamps) that note a
point in the primary data
In current practice, segmental information may
or may not appear in the document containing
the primary data itself. Documents considered
to be read-only, for example, might be seg-
mented by specifying byte offsets into the pri-
mary document where a given segment begins
and ends.
2. linguistic annotation: provides linguistic
and/or communicative information about the
segments in the primary data, e.g., a morpho-
syntactic annotation in which a part of speech
and lemma are associated with each segment in
the data. Note that the identification of a seg-
ment as a word, sentence, noun phrase, etc.
also constitutes linguistic annotation.
In current practice, when it is possible to do so,
segmentation and identification of the linguis-
tic role or properties of that segment are often
combined (e.g., syntactic bracketing, or delim-
iting each word in the document with an XML
tag that identifies the segment as a word, sen-
tence, etc.).
Stand-off annotation: Annotations layered over a
given primary document and instantiated in a

Citations
More filters
Journal ArticleDOI

Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.

TL;DR: It is concluded that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable.
Journal ArticleDOI

Evolving GATE to meet new challenges in language engineering

TL;DR: The focus of this paper is on recent developments in response to new challenges in Language Engineering: Semantic Web, integration with Information Retrieval and data mining, and the need for machine learning support.
Proceedings ArticleDOI

GrAF: A Graph-based Format for Linguistic Annotations

TL;DR: GrAF is an extension of the Linguistic Annotation Framework developed within ISO TC37 SC4 and as such, implements state-of-the-art best practice guidelines for representing linguistic annotations and allows for the application of well-established graph traversal and analysis algorithms.
Book ChapterDOI

Real and Apparent Time

Guy Bailey
Book

Natural Language Annotation for Machine Learning

TL;DR: This example-driven book walks you through the annotation cycle, from selecting an annotation task and creating the annotation specification to designing the guidelines, creating a "gold standard" corpus, and then beginning the actual data creation with the annotation process.
References
More filters
Journal ArticleDOI

A formal framework for linguistic annotation

TL;DR: A wide variety of existing annotation formats are surveyed and a common conceptual core, the annotation graph, is demonstrated, which provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.
Proceedings Article

XCES: An XML-based Encoding Standard for Linguistic Corpora

TL;DR: This paper instantiated the CES as an XML application called XCES, based on the same data architecture comprised of a primary encoded text and "standoff" annotation in separate documents, and demonstrated how XML mechanisms can be used to select from and manipulate annotated corpora encoded according toXCES specifications.
Journal Article

Towards multimodal content representation

TL;DR: The present document is intended to support the discussion on multimodal content representation, its possible objectives and basic constraints, and how the definition of a generic representation framework for multimodals content representation may be approached.
Proceedings Article

Standards for Language Resources

Nancy Ide, +1 more
TL;DR: In this paper, an abstract data model for linguistic annotations and its implementation using XML, RDF and related standards is presented, and the work of a newly formed committee of the International Standards Organization (ISO), ISO/TC 37/SC 4 Language Resource Management, which will use this work as its starting point.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What have the authors contributed in "International standard for a linguistic annotation framework" ?

This paper describes the Linguistic Annotation Framework under development within ISO TC37 SC4 WG1. 

The authors anticipate the future development of annotation tools that provide a user-oriented interface for specifying annotation information, and which then generate annotations in the pivot format directly. Part of the work of SC4 WG1 is to provide development resources, including schemas, design patterns, and stylesheets, which will enable annotators and software developers to immediately adapt to LAF. 

A data model capable of capturing the structure and relations in diverse types of data and annotations is a pre-requisite for developing a common corpus-handling environment: it impacts the design of annotation schema, encoding formats and data architectures, and tool architectures. 

Part of the work of SC4 WG1 is to provide development resources, including schemas, design patterns, and stylesheets, which will enable annotators and software developers to immediately adapt to LAF. 

Resources will be available to support the design and specification of document forms, for example:o XML Schemas in several normal forms based on type definitions and abstract elements that can be exploited via type derivation and/or substitution group;o XPointer design-patterns with standoff semantics;o Schema annotations specifying mapping between document form and data model;o Meta-stylesheet for mapping from annotated XML Schema to mapping stylesheets;o Data-binding stylesheets with languagespecific bindings (e.g. Java).• 

This demands that the Linguistic Annotation Framework not only relies on web technologies (e.g., RDF, OWL) for representing annotations, but also that “layers’ of annotations for the full range of annotation types (including named entities, time, space, and event annotation, annotation for gesture, facial expression, etc.) are at the same time separable (so that agents and other analytic software can access only those annotation types that are required for the purpose, and mergeable (so that two or more annotation types can be combined where necessary). 

The authors make several observations concerning this assumption:• the model assumes a fundamental linearity of objects in the base, 3 e.g., as a time line (speech); a sequence of characters, words, sentences, etc.; or pixel data representing images;• the granularity of the data representation and encoding is critical: it must be possible to uniquely point to the smallest possible component (e.g., character, phonetic component, pitch signal, morpheme, word, etc.);• an annotation scheme must be mappable to the structures defined for annotation objects in the model;• the encoding scheme must be able to capture the object structure and relations expressed in the model, including class membership and inheritance, therefore requiring a sophisticated means to specify linkage within and between documents;• it is necessary to consider the logistics of identifying spans by enclosing them in start and3 

The framework must provide means to represent all varieties of linguistic information (and possibly also other types of information). 

To provide an infra-structure and framework for language resource development and use, the International Organization for Standardization (ISO) has formed a sub-committee (SC4) under Technical Committee 37 (TC37, Terminology and Other Language Resources) devoted to Language Resource Management. 

This mapping is accomplished via a rigid “dump” format, isomorphic to the data model and intended primarily for machine rather than human use. 

Note that the schema defines the classes but does not instantiate objects belonging to the class; instantiation may be accomplished directly in the annotation file, as follows (for brevity, the following examples assume appropriate namespace declarations specifying the URIs of schema and instance declarations):<Noun rdf:about="Mydoc#W1"><number rdf:value="Plural"/></Noun> where "Mydoc#W1" is the URI of the word being annotated as a noun.