What future works have the authors mentioned in the paper "International standard for a linguistic annotation framework" ?

The authors anticipate the future development of annotation tools that provide a user-oriented interface for specifying annotation information, and which then generate annotations in the pivot format directly. Part of the work of SC4 WG1 is to provide development resources, including schemas, design patterns, and stylesheets, which will enable annotators and software developers to immediately adapt to LAF.

What is the purpose of the SC4 WG1?

Part of the work of SC4 WG1 is to provide development resources, including schemas, design patterns, and stylesheets, which will enable annotators and software developers to immediately adapt to LAF.

What resources will be available to support the design and specification of document forms?

Resources will be available to support the design and specification of document forms, for example:o XML Schemas in several normal forms based on type definitions and abstract elements that can be exploited via type derivation and/or substitution group;o XPointer design-patterns with standoff semantics;o Schema annotations specifying mapping between document form and data model;o Meta-stylesheet for mapping from annotated XML Schema to mapping stylesheets;o Data-binding stylesheets with languagespecific bindings (e.g. Java).•

What is the purpose of the mapping?

This mapping is accomplished via a rigid “dump” format, isomorphic to the data model and intended primarily for machine rather than human use.

What is the URI of the class being annotated as?

Note that the schema defines the classes but does not instantiate objects belonging to the class; instantiation may be accomplished directly in the annotation file, as follows (for brevity, the following examples assume appropriate namespace declarations specifying the URIs of schema and instance declarations): where "Mydoc#W1" is the URI of the word being annotated as a noun.

(Open Access) International standard for a linguistic annotation framework (2004) | Nancy Ide

Q: What have the authors contributed in "International standard for a linguistic annotation framework" ?

This paper describes the Linguistic Annotation Framework under development within ISO TC37 SC4 WG1.

Q: What is the pre-requisite for a common corpus-handling environment?

A data model capable of capturing the structure and relations in diverse types of data and annotations is a pre-requisite for developing a common corpus-handling environment: it impacts the design of annotation schema, encoding formats and data architectures, and tool architectures.

Q: What is the importance of encoding spans?

The authors make several observations concerning this assumption:• the model assumes a fundamental linearity of objects in the base, 3 e.g., as a time line (speech); a sequence of characters, words, sentences, etc.; or pixel data representing images;• the granularity of the data representation and encoding is critical: it must be possible to uniquely point to the smallest possible component (e.g., character, phonetic component, pitch signal, morpheme, word, etc.);• an annotation scheme must be mappable to the structures defined for annotation objects in the model;• the encoding scheme must be able to capture the object structure and relations expressed in the model, including class membership and inheritance, therefore requiring a sophisticated means to specify linkage within and between documents;• it is necessary to consider the logistics of identifying spans by enclosing them in start and3

International Standard for a Linguistic Annotation Framework

Nancy Ide

Dept. of Computer Science

Vassar College

Poughkeepsie, New York 12604-0520 USA

ide@cs.vassar.edu

Laurent Romary

Equipe Langue et Dialogue

LORIA/INRIA

Vandoeuvre-lès-Nancy

FRANCE

romary@loria.fr

Abstract

This paper describes the Linguistic Anno-

tation Framework under development

within ISO TC37 SC4 WG1. The Lin-

guistic Annotation Framework is intended

to serve as a basis for harmonizing exist-

ing language resources as well as devel-

oping new ones.

1 Introduction

Language resources are bodies of electronic lan-

guage data used to support research and applica-

tions in the area of natural language processing.

Typically, such data are enhanced (annotated) with

linguistic information such as morpho-syntactic

categories, syntactic or discourse structure, co-

reference information, etc.; or two or more bodies

may be aligned for correspondences (e.g., parallel

translations, speech signal and transcription).

Over the past 15-20 years, increasingly large bod-

ies of language resources have been created and

annotated by the language engineering community.

Certain fundamental representation principles have

been widely adopted, such as the use of stand-off

annotation (Ide and Priest-Dorman, 1996), use of

XML, etc., and several attempts to provide gener-

alized annotation mechanisms and formats have

been developed (e.g., XCES (Ide, et al., 2000),

annotation graphs (Bird and Liberman, 2001)).

However, it remains the case that annotation for-

mats often vary considerably from resource to re-

source, often to satisfy constraints imposed by

particular processing software. The language proc-

essing community has recognized that commonal-

ity and interoperability are increasingly imperative

to enable sharing, merging, and comparison of lan-

guage resources.

To provide an infra-structure and framework for

language resource development and use, the Inter-

national Organization for Standardization (ISO)

has formed a sub-committee (SC4) under Techni-

cal Committee 37 (TC37, Terminology and Other

Language Resources) devoted to Language Re-

source Management. The objective of ISO/TC

37/SC 4 is to prepare international standards and

guidelines for effective language resource man-

agement in applications in the multilingual infor-

mation society. To this end, the committee is

developing principles and methods for creating,

coding, processing and managing language re-

sources, such as written corpora, lexical corpora,

speech corpora, dictionary compiling and classifi-

cation schemes. The focus of the work is on data

modeling, markup, data exchange and the evalua-

tion of language resources other than terminologies

(which have already been treated in other sub-

committees of ISO/TC 37). The worldwide use of

ISO/TC 37/SC 4 standards should improve infor-

mation management within industrial, technical

and scientific environments, and increase effi-

ciency in computer-supported language communi-

cation.

At present, language professionals and standardiza-

tion experts are not sufficiently aware of the stan-

dardization efforts being undertaken by ISO/TC

37/SC 4. Promoting awareness of future activities

and rising problems, therefore, is crucial for the

success of the committee, and will be required to

ensure widespread adoption of the standards it de-

velops. An even more critical factor for the success

of the committee's work is to involve, from the

outset, as many and as broad a range of potential

users of the standards as possible.

Within ISO/TC 37/SC 4, a working group (WG1)

has been established to develop a Linguistic Anno-

tation Framework (LAF) that can serve as a basis

for harmonizing existing language resources as

well as developing new ones. In order to ensure

that the framework is developed based on the input

and consensus of the research community, a group

of experts

was convened on November 21-22,

2002, at Pont-à-Mousson, France, to lay out the

overall structure of the framework. Based on the

determinations of the experts at the workshop, the

general outlines of the Linguistic Annotation

Framework have been defined. In this paper, we

describe the LAF design as it has been developed

so far, and solicit the input of other members of the

community to inform its further development.

2 Background and rationale

The standardization of principles and methods for

the collection, processing and presentation of lan-

guage resources requires a distinct type of activity.

Basic standards must be produced with wide-

ranging applications in view. In the area of lan-

guage resources, these standards should provide

various technical committees of ISO, IEC and

other standardizing bodies with the groundwork for

building more precise standards for language re-

source management.

The need for harmonization of representation for-

mats for different kinds of linguistic information is

critical, as resources and information are more and

more frequently merged, compared, or otherwise

utilized in common systems. This is perhaps most

obvious for processing multi-modal information,

which must support the fusion of multimodal in-

Participants: Nuria Bel (Universitat de Barcelona), David

Durand (Brown University), Henry Thompson (University of

Edinburgh), Koiti Hasida (AIST Tokyo), Eric De La Clergerie

(INRIA), Lionel Clement (INRIA), Laurent Romary (LORIA),

Nancy Ide (Vassar College), Kiyong Lee (Korea University),

Keith Suderman (Vassar College), Aswani Kumar (LORIA),

Chris Laprun (NIST), Thierry Declerck (DFKI), Jean Carletta

(University of Edinburgh), Michael Strube (European Media

Laboratory), Hamish Cunningham (University of Sheffield),

Tomaz Erjavec (Institute Jozef Stefan), Hennie Brugman

(Max-Planck-Institut für Psycholinguistik), Fabio Vitali (Uni-

versite di Bologna), Key-Sun Choi (Korterm), Jean-Michel

Borde (Digital Visual), Eric Kow (LORIA).

This is particularly true for the two domains of Multimedia

(ISO/IEC JTC1/SC 29/WG 11) and Education (ISO

IEC/JTC1/SC 36)

puts and represent the combined and integrated

contributions of different types of input (e.g., a

spoken utterance combined with gesture and facial

expression), and enable multimodal output (see, for

example, Bunt and Romary, 2002). However, lan-

guage processing applications of any kind require

the integration of varieties of linguistic informa-

tion, which, in today’s environment, come from

potentially diverse sources. We can therefore ex-

pect use and integration of, for example, syntactic,

morphological, discourse, etc. information for mul-

tiple languages, as well as information structures

like domain models and ontologies.

We are aware that standardization is a difficult

business, and that many members of the targeted

communities are skeptical about imposing any sort

of standards at all. There are two major arguments

against the idea of standardization for language

resources. First, the diversity of theoretical ap-

proaches to, in particular, the annotation of various

linguistic phenomena suggests that standardization

is at least impractical, if not impossible. Second, it

is feared that vast amounts of existing data and

processing software, which may have taken years

of effort and considerable funding to develop, will

be rendered obsolete by the acceptance of new

standards by the community. Recognizing the va-

lidity of both of these concerns, WG1 does not

seek to establish a single, definitive annotation

scheme or format. Rather, the goal is to provide a

framework for linguistic annotation of language

resources that can serve as a reference or pivot for

different annotation schemes, and which will en-

able their merging and/or comparison. To this end,

the work of WG1 includes the following:

• analysis of the full range of annotation types

and existing schemes, to identify the funda-

mental structural principles and content cate-

gories;

• instantiation of an abstract format capable of

capturing the structure and content of linguis-

tic annotations, based on the analysis in (1);

• establishment of a mechanism for formal defi-

nition of a set of reference content categories

which can be used “off the shelf” or serve as a

point of departure for precise definition of new

or modified categories.

• provision of both a set of guidelines and prin-

ciples for developing new annotation schemes

and concrete mechanisms for their implemen-

tation, for those who wish to use them.

By situating all of the standards development

squarely in the framework of XML and related

standards such as RDF, OWL, etc., we hope to en-

sure not only that the standards developed by the

committee provide for compatibility with estab-

lished and widely accepted web-based technolo-

gies, but also that transduction from legacy formats

into XML formats conformant to the new stan-

dards is feasible.

3 General requirements for a linguistic

annotation framework

3.1 Usage scenarios

Natural language processing (NLP) applications

can be applied to create annotations for linguistic

data by analyzing text, speech, and data represent-

ing other modalities to determine specific linguistic

attributes and associate them with the segments of

that data to which they apply. NLP applications

also use linguistic annotations to facilitate lan-

guage understanding and generation. Development

of a standard linguistic annotation framework must

proceed by considering both of these “views” on

linguistic annotation, and integrating the two to

ensure maximal inter-operability.

Annotation of linguistic data may involve multiple

annotation steps, for example, morpho-syntactic

tagging, syntactic analysis, entity and event recog-

nition, semantic annotation, co-reference resolu-

tion, discourse structure analysis, etc. Annotation

at higher linguistic levels typically relies on anno-

tations at lower levels—that is, information at

lower linguistic levels serves as input in the deter-

mination of higher-level annotation categories, so

that annotation can be viewed as an incremental

process. Depending on the application intended to

use the annotations, lower-level annotations may

or may not be preserved in a persistent format.

That is, the output of the annotation software may

consist solely of higher-level annotations, even

though lower-level analysis has been performed.

Note that many application programs—e.g., infor-

mation extraction software—perform the analysis

required for annotation of various linguistic fea-

tures and utilize it internally to deliver the desired

result, without preserving the annotation informa-

tion.

The need to support annotations in the context of

the Semantic Web is one of the most important

considerations for development of the Linguistic

Annotation Framework. Annotated corpora are, at

present, primarily static entities used mainly for

training annotation software, as well as for corpus

linguistics and lexicography (which rely on anno-

tated corpora to study language use). However, the

advent of the Semantic Web and the development

of supporting technologies will significantly alter

the ways in which annotations are used and pre-

served in the future. In the context of the Semantic

Web, annotations for a variety of (at least) higher-

level linguistic and communicative features will be

preserved in web-accessible form and used by

software agents and other analytic software for

inferencing and retrieval. This demands that the

Linguistic Annotation Framework not only relies

on web technologies (e.g., RDF, OWL) for repre-

senting annotations, but also that “layers’ of anno-

tations for the full range of annotation types

(including named entities, time, space, and event

annotation, annotation for gesture, facial expres-

sion, etc.) are at the same time separable (so that

agents and other analytic software can access only

those annotation types that are required for the

purpose, and mergeable (so that two or more anno-

tation types can be combined where necessary).

They may also need to be dynamic, in the sense

that new and/or modified information can be added

as necessary.

Another increasingly important concern for LAF

development is the handling of streamed data,

wherein the processor analyzes input as it is en-

countered in a linear, time-bound sequence.

Streamed data can be text, video, and audio, or

might be a stream of sensor readings, satellite im-

ages, etc. This dictates that annotations to be at-

tached to the data may be (temporarily) partial,

especially where long-distance dependencies be-

tween seen and unseen segments of the data exist.

3.2 Requirements

To serve the goals of creation and use of linguistic

annotation discussed above, we identify the fol-

lowing general requirements for a linguistic anno-

tation framework:

Expressive adequacy. The framework must provide

means to represent all varieties of linguistic infor-

mation (and possibly also other types of informa-

tion). This includes representing the full range of

information from the very general to information at

the finest level of granularity.

Media independence. The framework must handle

all potential media types, including text, audio,

video, image, etc. and should, in principle, provide

common mechanisms for handling all of them. The

framework will rely on existing or developing

standards for representing multi-media.

Semantic adequacy. Representation structures must

have a formal semantics, including definitions of

logical operations. There must exist a centralized

way of sharing descriptors and information catego-

ries

Incrementality. The framework must provide sup-

port for various stages of input interpretation and

output generation, both during annotation (which

may be accomplished at different times and with

different software) and use. It must also provide

for the representation of partial/under-specified

results and ambiguities, alternatives, etc. and their

merging and comparison.

Separability. As a complement to incrementality, it

must be possible for NLP applications to easily

separate or extract annotation types specific to the

task at hand.

Uniformity. Representations must utilize same

“building blocks” and the same methods for com-

bining them.

Openness. The framework must not dictate repre-

sentations dependent on a single linguistic theory.

Extensibility. The framework must provide ways to

declare and interchange extensions to the central-

ized data category registry.

Human readability. Representations must be hu-

man readable, at least for creation and editing.

Processability (explicitness). Information in an

annotation scheme must be explicit—that is, the

burden of interpretation should not be left to the

processing software.

Consistency. Different mechanisms should not be

used to indicate the same type of information.

To fulfill these requirements, it is necessary to

identify a consistent underlying data model for

data and its annotations. A data model is a formal-

ized description of the data objects (in terms of

composition, attributes, class membership, appli-

cable procedures, etc.) and relations among them,

independent of their instantiation in any particular

form. A data model capable of capturing the struc-

ture and relations in diverse types of data and an-

notations is a pre-requisite for developing a

common corpus-handling environment: it impacts

the design of annotation schema, encoding formats

and data architectures, and tool architectures.

As a starting assumption, we can conceive of an

annotation as a one- or two-way link between an

annotation object and a point (or a list/set of

points) or span (or a list/set of spans) within a base

data set. Links may or may not have a semantics--

i.e., a type--associated with them. Points and spans

in the base data may themselves be objects, or sets

or lists of objects. We make several observations

concerning this assumption:

• the model assumes a fundamental linearity of

objects in the base,

e.g., as a time line

(speech); a sequence of characters, words, sen-

tences, etc.; or pixel data representing images;

• the granularity of the data representation and

encoding is critical: it must be possible to

uniquely point to the smallest possible compo-

nent (e.g., character, phonetic component,

pitch signal, morpheme, word, etc.);

• an annotation scheme must be mappable to the

structures defined for annotation objects in the

model;

• the encoding scheme must be able to capture

the object structure and relations expressed in

the model, including class membership and in-

heritance, therefore requiring a sophisticated

means to specify linkage within and between

documents;

• it is necessary to consider the logistics of iden-

tifying spans by enclosing them in start and

Note that this observation applies to the fundamental struc-

ture of stored data. Because the targets of a relation may be

either individual objects, or sets or lists of objects, information

with more than one dimension is accommodated.

end tags (thus enabling hierarchical grouping

of objects in the data itself), vs. explicit ad-

dressing of start and end points, which is re-

quired for read-only data;

• it must be possible to represent objects and

relations in some (fairly straightforward) form

that prevents information loss;

• it should be possible to represent the objects

and relations in a variety of formats suitable to

different tools and applications.

ISO TC37/SC 4’s goal is to develop a framework

for the design and implementation of linguistic

resource formats and processes in order to facili-

tate the exchange of information between language

processing modules. A well-defined representa-

tional framework for linguistic information will

also provide for the specification and comparison

of existing application-specific representations and

the definition of new ones, while ensuring a level

of interoperability between them. The framework

should allow for variation in annotation schemes

while at the same time enabling comparison and

evaluation, merging of different annotations, and

development of common tools for creating and

using annotated data. For this purpose we envisage

a common “pivot” format based on a data model

capable of capturing all types of information in

linguistic annotations, into and out of which site-

specific representation formats can be transduced.

This strategy is similar to that adopted in the de-

sign of languages intended to be reusable across

platforms, such as Java. The pivot format must

support the communication among all modules in

the system, and be adequate for representing not

only the end result of interpretation, but also in-

termediate results.

4 Terms and definitions

The following terms and definitions are used in the

discussion that follows:

Annotation: The process of adding linguistic in-

formation to language data (“annotation of a cor-

pus”) or the linguistic information itself (“an

annotation”), independent of its representation. For

example, one may annotate a document for syntax

using a LISP-like representation, an XML repre-

sentation, etc.

Representation: The format in which the annota-

tion is rendered, e.g. XML, LISP, etc. independent

of its content. For example, a phrase structure syn-

tactic annotation and a dependency-based annota-

tion may both be represented using XML, even

though the annotation information itself is very

different.

Types of Annotation: We distinguish two funda-

mental types of annotation activity:

1. segmentation : delimits linguistic elements that

appear in the primary data. Including

• continuous segments (appear contiguously

in the primary data)

• super- and sub-segments, where groups of

segments will comprise the parts of a

larger segment (e.g., a contiguous word

segments typically comprise a sentence

segment)

• discontinuous segments (linking continu-

ous segments)

• landmarks (e.g time stamps) that note a

point in the primary data

In current practice, segmental information may

or may not appear in the document containing

the primary data itself. Documents considered

to be read-only, for example, might be seg-

mented by specifying byte offsets into the pri-

mary document where a given segment begins

and ends.

2. linguistic annotation: provides linguistic

and/or communicative information about the

segments in the primary data, e.g., a morpho-

syntactic annotation in which a part of speech

and lemma are associated with each segment in

the data. Note that the identification of a seg-

ment as a word, sentence, noun phrase, etc.

also constitutes linguistic annotation.

In current practice, when it is possible to do so,

segmentation and identification of the linguis-

tic role or properties of that segment are often

combined (e.g., syntactic bracketing, or delim-

iting each word in the document with an XML

tag that identifies the segment as a word, sen-

tence, etc.).

Stand-off annotation: Annotations layered over a

given primary document and instantiated in a

International standard for a linguistic annotation framework

Figures

Citations

Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.

Evolving GATE to meet new challenges in language engineering

GrAF: A Graph-based Format for Linguistic Annotations

Real and Apparent Time

Natural Language Annotation for Machine Learning

References

A formal framework for linguistic annotation

XCES: An XML-based Encoding Standard for Linguistic Corpora

Corpus Encoding Standard (CES)

Towards multimodal content representation

Standards for Language Resources

Related Papers (5)

A formal framework for linguistic annotation

UIMA: an architectural approach to unstructured information processing in the corporate research environment

Building a large annotated corpus of English: the penn treebank

A framework and graphical development environment for robust NLP tools and applications.

GATE, a General Architecture for Text Engineering

Frequently Asked Questions (11)

Q1. What have the authors contributed in "International standard for a linguistic annotation framework" ?

Q2. What future works have the authors mentioned in the paper "International standard for a linguistic annotation framework" ?

Q3. What is the pre-requisite for a common corpus-handling environment?

Q4. What is the purpose of the SC4 WG1?

Q5. What resources will be available to support the design and specification of document forms?

Q6. What is the need to support annotations in the context of the Semantic Web?

Q7. What is the importance of encoding spans?

Q8. What is the main requirement for the Linguistic Annotation Framework?

Q9. What is the purpose of the ISO/TC 37/SC4?

Q10. What is the purpose of the mapping?

Q11. What is the URI of the class being annotated as?