International standard for a linguistic annotation framework
read more
Citations
Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.
Evolving GATE to meet new challenges in language engineering
GrAF: A Graph-based Format for Linguistic Annotations
Natural Language Annotation for Machine Learning
References
A formal framework for linguistic annotation
XCES: An XML-based Encoding Standard for Linguistic Corpora
Towards multimodal content representation
Standards for Language Resources
Related Papers (5)
UIMA: an architectural approach to unstructured information processing in the corporate research environment
Frequently Asked Questions (11)
Q2. What future works have the authors mentioned in the paper "International standard for a linguistic annotation framework" ?
The authors anticipate the future development of annotation tools that provide a user-oriented interface for specifying annotation information, and which then generate annotations in the pivot format directly. Part of the work of SC4 WG1 is to provide development resources, including schemas, design patterns, and stylesheets, which will enable annotators and software developers to immediately adapt to LAF.
Q3. What is the pre-requisite for a common corpus-handling environment?
A data model capable of capturing the structure and relations in diverse types of data and annotations is a pre-requisite for developing a common corpus-handling environment: it impacts the design of annotation schema, encoding formats and data architectures, and tool architectures.
Q4. What is the purpose of the SC4 WG1?
Part of the work of SC4 WG1 is to provide development resources, including schemas, design patterns, and stylesheets, which will enable annotators and software developers to immediately adapt to LAF.
Q5. What resources will be available to support the design and specification of document forms?
Resources will be available to support the design and specification of document forms, for example:o XML Schemas in several normal forms based on type definitions and abstract elements that can be exploited via type derivation and/or substitution group;o XPointer design-patterns with standoff semantics;o Schema annotations specifying mapping between document form and data model;o Meta-stylesheet for mapping from annotated XML Schema to mapping stylesheets;o Data-binding stylesheets with languagespecific bindings (e.g. Java).•
Q6. What is the need to support annotations in the context of the Semantic Web?
This demands that the Linguistic Annotation Framework not only relies on web technologies (e.g., RDF, OWL) for representing annotations, but also that “layers’ of annotations for the full range of annotation types (including named entities, time, space, and event annotation, annotation for gesture, facial expression, etc.) are at the same time separable (so that agents and other analytic software can access only those annotation types that are required for the purpose, and mergeable (so that two or more annotation types can be combined where necessary).
Q7. What is the importance of encoding spans?
The authors make several observations concerning this assumption:• the model assumes a fundamental linearity of objects in the base, 3 e.g., as a time line (speech); a sequence of characters, words, sentences, etc.; or pixel data representing images;• the granularity of the data representation and encoding is critical: it must be possible to uniquely point to the smallest possible component (e.g., character, phonetic component, pitch signal, morpheme, word, etc.);• an annotation scheme must be mappable to the structures defined for annotation objects in the model;• the encoding scheme must be able to capture the object structure and relations expressed in the model, including class membership and inheritance, therefore requiring a sophisticated means to specify linkage within and between documents;• it is necessary to consider the logistics of identifying spans by enclosing them in start and3
Q8. What is the main requirement for the Linguistic Annotation Framework?
The framework must provide means to represent all varieties of linguistic information (and possibly also other types of information).
Q9. What is the purpose of the ISO/TC 37/SC4?
To provide an infra-structure and framework for language resource development and use, the International Organization for Standardization (ISO) has formed a sub-committee (SC4) under Technical Committee 37 (TC37, Terminology and Other Language Resources) devoted to Language Resource Management.
Q10. What is the purpose of the mapping?
This mapping is accomplished via a rigid “dump” format, isomorphic to the data model and intended primarily for machine rather than human use.
Q11. What is the URI of the class being annotated as?
Note that the schema defines the classes but does not instantiate objects belonging to the class; instantiation may be accomplished directly in the annotation file, as follows (for brevity, the following examples assume appropriate namespace declarations specifying the URIs of schema and instance declarations):<Noun rdf:about="Mydoc#W1"><number rdf:value="Plural"/></Noun> where "Mydoc#W1" is the URI of the word being annotated as a noun.