scispace - formally typeset
Search or ask a question

Showing papers on "Annotation published in 2005"


Journal ArticleDOI
TL;DR: Blast2GO (B2G), a research tool designed with the main purpose of enabling Gene Ontology (GO) based data mining on sequence data for which no GO annotation is yet available, is presented.
Abstract: Summary: We present here Blast2GO (B2G), a research tool designed with the main purpose of enabling Gene Ontology (GO) based data mining on sequence data for which no GO annotation is yet available. B2G joints in one application GO annotation based on similarity searches with statistical analysis and highlighted visualization on directed acyclic graphs. This tool offers a suitable platform for functional genomics research in non-model species. B2G is an intuitive and interactive desktop application that allows monitoring and comprehension of the whole annotation and analysis process. Availability: Blast2GO is freely available via Java Web Start at http://www.blast2go.de Supplementary material:http://www.blast2go.de -> Evaluation Contact:[email protected]; [email protected]

10,092 citations


Journal ArticleDOI
TL;DR: The subsystem approach is described, the first release of the growing library of populated subsystems is offered, and the SEED is the first annotation environment that supports this model of annotation.
Abstract: The release of the 1000th complete microbial genome will occur in the next two to three years. In anticipation of this milestone, the Fellowship for Interpretation of Genomes (FIG) launched the Project to Annotate 1000 Genomes. The project is built around the principle that the key to improved accuracy in high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes, rather than having an annotation expert attempt to annotate all of the genes in a single genome. Using the subsystems approach, all of the genes implementing the subsystem are analyzed by an expert in that subsystem. An annotation environment was created where populated subsystems are curated and projected to new genomes. A portable notion of a populated subsystem was defined, and tools developed for exchanging and curating these objects. Tools were also developed to resolve conflicts between populated subsystems. The SEED is the first annotation environment that supports this model of annotation. Here, we describe the subsystem approach, and offer the first release of our growing library of populated subsystems. The initial release of data includes 180 177 distinct proteins with 2133 distinct functional roles. This data comes from 173 subsystems and 383 different organisms.

1,896 citations


Journal ArticleDOI
01 Jan 2005
TL;DR: The manual annotation process and the results of an inter-annotator agreement study on a 10,000-sentence corpus of articles drawn from the world press are presented.
Abstract: This paper describes a corpus annotation project to study issues in the manual annotation of opinions, emotions, sentiments, speculations, evaluations and other private states in language. The resulting corpus annotation scheme is described, as well as examples of its use. In addition, the manual annotation process and the results of an inter-annotator agreement study on a 10,000-sentence corpus of articles drawn from the world press are presented.

1,818 citations


Journal ArticleDOI
TL;DR: This work has developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods, and annotated “TE models” in Drosophila melanogaster Release 4 genomic sequences.
Abstract: Transposable elements (TEs) are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from combining multiple independent sources of computational evidence. To elevate the quality of TE annotations to a level comparable to that of gene models, we have developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods. As proof of principle, we have annotated "TE models" in Drosophila melanogaster Release 4 genomic sequences using the combined computational evidence derived from RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, TE-HMM and the previous Release 3.1 annotation. Our system is designed for use with the Apollo genome annotation tool, allowing automatic results to be curated manually to produce reliable annotations. The euchromatic TE fraction of D. melanogaster is now estimated at 5.3% (cf. 3.86% in Release 3.1), and we found a substantially higher number of TEs (n = 6,013) than previously identified (n = 1,572). Most of the new TEs derive from small fragments of a few hundred nucleotides long and highly abundant families not previously annotated (e.g., INE-1). We also estimated that 518 TE copies (8.6%) are inserted into at least one other TE, forming a nest of elements. The pipeline allows rapid and thorough annotation of even the most complex TE models, including highly deleted and/or nested elements such as those often found in heterochromatic sequences. Our pipeline can be easily adapted to other genome sequences, such as those of the D. melanogaster heterochromatin or other species in the genus Drosophila.

364 citations


Proceedings ArticleDOI
13 Mar 2005
TL;DR: This paper examines current Semantic Web annotation platforms that provide annotation and related services, and reviews their architecture, approaches and performance.
Abstract: The realization of the Semantic Web requires the widespread availability of semantic annotations for existing and new documents on the Web. Semantic annotations are to tag ontology class instance data and map it into ontology classes. The fully automatic creation of semantic annotations is an unsolved problem. Instead, current systems focus on the semi-automatic creation of annotations. The Semantic Web also requires facilities for the storage of annotations and ontologies, user interfaces, access APIs, and other features to fully support annotation usage. This paper examines current Semantic Web annotation platforms that provide annotation and related services, and reviews their architecture, approaches and performance.

361 citations


Book ChapterDOI
29 May 2005
TL;DR: This work proposes a model for the exploitation of ontology-based KBs to improve search over large document repositories, which includes an ontological-based scheme for the semi-automatic annotation of documents, and a retrieval system.
Abstract: Semantic search has been one of the motivations of the Semantic Web since it was envisioned. We propose a model for the exploitation of ontology-based KBs to improve search over large document repositories. Our approach includes an ontology-based scheme for the semi-automatic annotation of documents, and a retrieval system. The retrieval model is based on an adaptation of the classic vector-space model, including an annotation weighting algorithm, and a ranking algorithm. Semantic search is combined with keyword-based search to achieve tolerance to KB incompleteness. Our proposal is illustrated with sample experiments showing improvements with respect to keyword-based search, and providing ground for further research and discussion.

270 citations


Patent
07 Sep 2005
TL;DR: In this article, the authors proposed a method for automatically navigating a document (252) in a display (250) having at least a first portion (252 and a second portion (254) annotated by a user at a first client and associating the annotated document with a first indication (256) in the document.
Abstract: The invention relates generally shared annotation systems. More particularly, the invention provides a method for automatically navigating a document (252) in a display (250) having at least a first portion (252) and a second portion (254), the method comprising: receiving an annotation (264) related to the document (252), the annotation (264) generated by a user at a first client; associating the annotation (264) with a first indication (256) in the document (252); receiving, from a user at a second client, an input to navigate a first portion of a display at the second client, the input causing the first indication to be displayed in the first portion of the display; and in response to the input, automatically displaying the annotation in a second portion of the display at the second client.

208 citations


Journal ArticleDOI
TL;DR: A series of functional data types has been annotated for the rice genome that includes alignment with genetic markers, assignment of gene ontologies, identification of flanking sequence tags, alignment with homologs from related species, and syntenic mapping with other cereal species.
Abstract: We have developed a rice (Oryza sativa) genome annotation database (Osa1) that provides structural and functional annotation for this emerging model species. Using the sequence of O. sativa subsp. japonica cv Nipponbare from the International Rice Genome Sequencing Project, pseudomolecules, or virtual contigs, of the 12 rice chromosomes were constructed. Our most recent release, version 3, represents our third build of the pseudomolecules and is composed of 98% finished sequence. Genes were identified using a series of computational methods developed for Arabidopsis (Arabidopsis thaliana) that were modified for use with the rice genome. In release 3 of our annotation, we identified 57,915 genes, of which 14,196 are related to transposable elements. Of these 43,719 nontransposable element-related genes, 18,545 (42.4%) were annotated with a putative function, 5,777 (13.2%) were annotated as encoding an expressed protein with no known function, and the remaining 19,397 (44.4%) were annotated as encoding a hypothetical protein. Multiple splice forms (5,873) were detected for 2,538 genes, resulting in a total of 61,250 gene models in the rice genome. We incorporated experimental evidence into 18,252 gene models to improve the quality of the structural annotation. A series of functional data types has been annotated for the rice genome that includes alignment with genetic markers, assignment of gene ontologies, identification of flanking sequence tags, alignment with homologs from related species, and syntenic mapping with other cereal species. All structural and functional annotation data are available through interactive search and display windows as well as through download of flat files. To integrate the data with other genome projects, the annotation data are available through a Distributed Annotation System and a Genome Browser. All data can be obtained through the project Web pages at http://rice.tigr.org.

208 citations


Book ChapterDOI
20 Jul 2005
TL;DR: A system that combines mass spectrometric and biological metadata to achieve long term reusability of metabolomic data needs both correct metabolite annotations and consistent biological classifications is developed.
Abstract: Unbiased metabolomic surveys are used for physiological, clinical and genomic studies to infer genotype-phenotype relationships. Long term reusability of metabolomic data needs both correct metabolite annotations and consistent biological classifications. We have developed a system that combines mass spectrometric and biological metadata to achieve this goal. First, an XMLbased LIMS system enables entering biological metadata for steering laboratory workflows by generating ‘classes' that reflect experimental designs. After data acquisition, a relational database system (BinBase) is employed for automated metabolite annotation. It consists of a manifold filtering algorithm for matching and generating database objects by utilizing mass spectral metadata such as ‘retention index', ‘purity', ‘signal/noise', and the biological information class. Once annotations and quantitations are complete for a specific larger experiment, this information is fed back into the LIMS system to notify supervisors and users. Eventually, qualitative and quantitative results are released to the public for downloads or complex queries.

201 citations


Journal ArticleDOI
TL;DR: This work presents AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence and determines the most informative functional description by combining multiple BLAST reports from several user-selected databases.
Abstract: Background: Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results: We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion: AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at http://megasun.bch.umontreal.ca/Software/AutoFACT.htm.

198 citations


Journal ArticleDOI
TL;DR: A biological perspective on the evaluation, how the GOA team annotate GO using literature is explained, and some suggestions to improve the precision of future text-retrieval and extraction techniques are offered.
Abstract: Background The Gene Ontology Annotation (GOA) database http://www.ebi.ac.uk/GOA aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded. Consequently, semi-automated aids are often employed to expedite the curation process. Traditionally, electronic techniques in GOA depend largely on exploiting the knowledge in existing resources such as InterPro. However, in recent years, text mining has been hailed as a potentially useful tool to aid the curation process. To encourage the development of such tools, the GOA team at EBI agreed to take part in the functional annotation task of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge. BioCreAtIvE task 2 was an experiment to test if automatically derived classification using information retrieval and extraction could assist expert biologists in the annotation of the GO vocabulary to the proteins in the UniProt Knowledgebase. GOA provided the training corpus of over 9000 manual GO annotations extracted from the literature. For the test set, we provided a corpus of 200 new Journal of Biological Chemistry articles used to annotate 286 human proteins with GO terms. A team of experts manually evaluated the results of 9 participating groups, each of which provided highlighted sentences to support their GO and protein annotation predictions. Here, we give a biological perspective on the evaluation, explain how we annotate GO using literature and offer some suggestions to improve the precision of future text-retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as an assessment of our current electronic GO annotation strategies.

Proceedings ArticleDOI
10 May 2005
TL;DR: C-PANKOW (Context-driven PankOW), which alleviates several shortcomings of PANKOW and uses the annotation context in order to distinguish the significance of a pattern match for the given annotation task.
Abstract: Without the proliferation of formal semantic annotations, the Semantic Web is certainly doomed to failure. In earlier work we presented a new paradigm to avoid this: the 'Self Annotating Web', in which globally available knowledge is used to annotate resources such as web pages. In particular, we presented a concrete method instantiating this paradigm, called PANKOW (Pattern-based ANnotation through Knowledge On the Web). In PANKOW, a named entity to be annotated is put into several linguistic patterns that convey competing semantic meanings. The patterns that are matched most often on the Web indicate the meaning of the named entity --- leading to automatic or semi-automatic annotation.In this paper we present C-PANKOW (Context-driven PANKOW), which alleviates several shortcomings of PANKOW. First, by downloading abstracts and processing them off-line, we avoid the generation of large number of linguistic patterns and correspondingly large number of Google queries.Second, by linguistically analyzing and normalizing the downloaded abstracts, we increase the coverage of our pattern matching mechanism and overcome several limitations of the earlier pattern generation process. Third, we use the annotation context in order to distinguish the significance of a pattern match for the given annotation task. Our experiments show that C-PANKOW inherits all the advantages of PANKOW (no training required etc.), but in addition it is far more efficient and effective.

Patent
06 Apr 2005
TL;DR: In this paper, the authors proposed a method for automatically navigating a document in a display having at least a first portion and a second portion, the method comprising: receiving an annotation related to the document, the annotation generated by a user at a first client; associating the annotation with a first indication in the document; receiving, from a second client, an input to navigate a first part of a display at the second client.
Abstract: The invention relates generally shared annotation systems. More particularly, the invention provides a method for automatically navigating a document in a display having at least a first portion and a second portion, the method comprising: receiving an annotation related to the document, the annotation generated by a user at a first client; associating the annotation with a first indication in the document; receiving, from a user at a second client, an input to navigate a first portion of a display at the second client, the input causing the first indication to be displayed in the first portion of the display; and in response to the input, automatically displaying the annotation in a second portion of the display at the second client.

Journal ArticleDOI
TL;DR: Concepts provided by GO are currently the most extended set of terms used for annotating gene products, thus they were explored to assess how effectively text mining tools are able to extract those annotations automatically.
Abstract: Molecular Biology accumulated substantial amounts of data concerning functions of genes and proteins. Information relating to functional descriptions is generally extracted manually from textual data and stored in biological databases to build up annotations for large collections of gene products. Those annotation databases are crucial for the interpretation of large scale analysis approaches using bioinformatics or experimental techniques. Due to the growing accumulation of functional descriptions in biomedical literature the need for text mining tools to facilitate the extraction of such annotations is urgent. In order to make text mining tools useable in real world scenarios, for instance to assist database curators during annotation of protein function, comparisons and evaluations of different approaches on full text articles are needed. The Critical Assessment for Information Extraction in Biology (BioCreAtIvE) contest consists of a community wide competition aiming to evaluate different strategies for text mining tools, as applied to biomedical literature. We report on task two which addressed the automatic extraction and assignment of Gene Ontology (GO) annotations of human proteins, using full text articles. The predictions of task 2 are based on triplets of protein – GO term – article passage. The annotation-relevant text passages were returned by the participants and evaluated by expert curators of the GO annotation (GOA) team at the European Institute of Bioinformatics (EBI). Each participant could submit up to three results for each sub-task comprising task 2. In total more than 15,000 individual results were provided by the participants. The curators evaluated in addition to the annotation itself, whether the protein and the GO term were correctly predicted and traceable through the submitted text fragment. Concepts provided by GO are currently the most extended set of terms used for annotating gene products, thus they were explored to assess how effectively text mining tools are able to extract those annotations automatically. Although the obtained results are promising, they are still far from reaching the required performance demanded by real world applications. Among the principal difficulties encountered to address the proposed task, were the complex nature of the GO terms and protein names (the large range of variants which are used to express proteins and especially GO terms in free text), and the lack of a standard training set. A range of very different strategies were used to tackle this task. The dataset generated in line with the BioCreative challenge is publicly available and will allow new possibilities for training information extraction methods in the domain of molecular biology.

Journal ArticleDOI
TL;DR: Over the course of three years, TIGR has completed its effort to standardize the structural and functional annotation of the Arabidopsis genome, with special emphasis on the final annotation release (version 5).
Abstract: Since the initial publication of its complete genome sequence, Arabidopsis thaliana has become more important than ever as a model for plant research. However, the initial genome annotation was submitted by multiple centers using inconsistent methods, making the data difficult to use for many applications. Over the course of three years, TIGR has completed its effort to standardize the structural and functional annotation of the Arabidopsis genome. Using both manual and automated methods, Arabidopsis gene structures were refined and gene products were renamed and assigned to Gene Ontology categories. We present an overview of the methods employed, tools developed, and protocols followed, summarizing the contents of each data release with special emphasis on our final annotation release (version 5). Over the entire period, several thousand new genes and pseudogenes were added to the annotation. Approximately one third of the originally annotated gene models were significantly refined yielding improved gene structure annotations, and every protein-coding gene was manually inspected and classified using Gene Ontology terms.


01 Jan 2005
TL;DR: Inter-annotator agreement test indicated that the writing style rather than the contents of the research abstracts is the source of the difficulty in tree annotation, and that annotation can be stably done by linguists without much knowledge of biology with appropriate guidelines regarding to linguistic phenomena particular to scientific texts.
Abstract: Linguistically annotated corpus based on texts in biomedical domain has been constructed to tune natural language processing (NLP) tools for biotextmining. As the focus of information extraction is shifting from "nominal" information such as named entity to "verbal" information such as function and interaction of substances, application of parsers has become one of the key technologies and thus the corpus annotated for syntactic structure of sentences is in demand. A subset of the GENIA corpus consisting of 500 MEDLINE abstracts has been annotated for syntactic structure in an XMLbased format based on Penn Treebank II (PTB) scheme. Inter-annotator agreement test indicated that the writing style rather than the contents of the research abstracts is the source of the difficulty in tree annotation, and that annotation can be stably done by linguists without much knowledge of biology with appropriate guidelines regarding to linguistic phenomena particular to scientific texts.

Patent
23 Dec 2005
TL;DR: In this article, an active learning component trains an annotation model and proposes annotations to documents based on the annotation model, and a request handler conveys annotation requests from the graphical user interface to the active learning components.
Abstract: A document annotation system includes a graphical user interface used by an annotator to annotate documents. An active learning component trains an annotation model and proposes annotations to documents based on the annotation model. A request handler conveys annotation requests from the graphical user interface to the active learning component, conveys proposed annotations from the active learning component to the graphical user interface, and selectably conveys evaluation requests from the graphical user interface to a domain expert. During annotation, at least some low probability proposed annotations are presented to the annotator by the graphical user interface. The presented low probability proposed annotations enhance training of the annotation model by the active learning component.

27 Jun 2005
TL;DR: In this paper, the authors describe a web-based tool for creating and sharing annotations and investigate the effect on learning of its use with college students, including full and group annotation sharing, to promote students' motivation for annotation.
Abstract: Web-based learning has become an important way to enhance learning and teaching, offering many learning opportunities. A limitation of current Web-based learning is the restricted ability of students to personalize and annotate the learning materials. Providing personalized tools and analyzing some types of learning behavior, such as students' annotation, has attracted attention as a means to enhance Web-based learning. We describe a Web-based tool for creating and sharing annotations and investigate the effect on learning of its use with college students. First, an annotation tool was designed and implemented for the research. Second, learning support mechanisms, including full and group annotation sharing, were developed to promote students' motivation for annotation. Lastly, experiments with individual and shared annotation were conducted and the results show that the influence of annotation on learning performance becomes stronger with the use of sharing mechanisms. We conclude that there is value in further study of collaborative learning through shared annotation.

Journal ArticleDOI
TL;DR: The development of a reliability index for database annotations is a more realistic option, instead of depending exclusively on efforts to correct databases, for the annotation of protein function at genomic scale.

Proceedings ArticleDOI
15 Aug 2005
TL;DR: The paper proposes methods to use a hierarchy defined on the annotation words derived from a text ontology to improve automatic image annotation and retrieval and demonstrates improvements in the annotation performance of translation models.
Abstract: Automatic image annotation is the task of automatically assigning words to an image that describe the content of the image. Machine learning approaches have been explored to model the association between words and images from an annotated set of images and generate annotations for a test image. The paper proposes methods to use a hierarchy defined on the annotation words derived from a text ontology to improve automatic image annotation and retrieval. Specifically, the hierarchy is used in the context of generating a visual vocabulary for representing images and as a framework for the proposed hierarchical classification approach for automatic image annotation. The effect of using the hierarchy in generating the visual vocabulary is demonstrated by improvements in the annotation performance of translation models. In addition to performance improvements, hierarchical classification approaches yield well to constructing multimedia ontologies.

Journal ArticleDOI
TL;DR: The core engine and expert system of the FIGENIX platform currently handle complex annotation processes of broad interest for the genomic community, and could be easily adapted to new, or more specialized pipelines, such as the annotation of miRNAs, the classification of complex multigenic families, annotation of regulatory elements and other genomic features of interest.
Abstract: Background: Two of the main objectives of the genomic and post-genomic era are to structurally and functionally annotate genomes which consists of detecting genes' position and structure, and inferring their function (as well as of other features of genomes). Structural and functional annotation both require the complex chaining of numerous different software, algorithms and methods under the supervision of a biologist. The automation of these pipelines is necessary to manage huge amounts of data released by sequencing projects. Several pipelines already automate some of these complex chaining but still necessitate an important contribution of biologists for supervising and controlling the results at various steps. Results: Here we propose an innovative automated platform, FIGENIX, which includes an expert system capable to substitute to human expertise at several key steps. FIGENIX currently automates complex pipelines of structural and functional annotation under the supervision of the expert system (which allows for example to make key decisions, check intermediate results or refine the dataset). The quality of the results produced by FIGENIX is comparable to those obtained by expert biologists with a drastic gain in terms of time costs and avoidance of errors due to the human manipulation of data. Conclusion: The core engine and expert system of the FIGENIX platform currently handle complex annotation processes of broad interest for the genomic community. They could be easily adapted to new, or more specialized pipelines, such as for example the annotation of miRNAs, the classification of complex multigenic families, annotation of regulatory elements and other genomic features of interest.

Journal ArticleDOI
TL;DR: TheOnto-Tools suite is composed of an annotation database and six seamlessly integrated, web-accessible data mining tools: Onto-Express, Ont-Compare, Onte-Design, Onti-Translate, Onta-Miner and Pathway-Express.
Abstract: The Onto-Tools suite is composed of an annotation database and six seamlessly integrated, web-accessible data mining tools: Onto-Express, Onto-Compare, Onto-Design, Onto-Translate, Onto-Miner and Pathway-Express. The Onto-Tools database has been expanded to include various types of data from 12 new databases. Our database now integrates different types of genomic data from 19 sequence, gene, protein and annotation databases. Additionally, our database is also expanded to include complete Gene Ontology (GO) annotations. Using the enhanced database and GO annotations, Onto-Express now allows functional profiling for 24 organisms and supports 17 different types of input IDs. Onto-Translate is also enhanced to fully utilize the capabilities of the new Onto-Tools database with an ultimate goal of providing the users with a non-redundant and complete mapping from any type of identification system to any other type. Currently, Onto-Translate allows arbitrary mappings between 29 types of IDs. Pathway-Express is a new tool that helps the users find the most interesting pathways for their input list of genes. Onto-Tools are freely available at http://vortex.cs.wayne.edu/Projects.html.

Proceedings ArticleDOI
06 Nov 2005
TL;DR: The IBM Efficient Video Annotation (EVA) system is presented, a server-based tool for semantic concept annotation of large video and image collections that is optimised for collaborative annotation and includes features such as workload sharing and support in conducting inter-annotator analysis.
Abstract: Annotated collections of images and videos are a necessary basis for the successful development of multimedia retrieval systems. The underlying models of such systems rely heavily on quality and availability of large training collections. The annotation of large collections, however, is a time-consuming and error prone task as it has to be performed by human annotators. In this paper we present the IBM Efficient Video Annotation (EVA) system, a server-based tool for semantic concept annotation of large video and image collections. It is optimised for collaborative annotation and includes features such as workload sharing and support in conducting inter-annotator analysis. We discuss initial results of an ongoing user-evaluation of this system. The results are based on data collected during the 2005 TRECVID Annotation Forum, where more than 100 annotators have been using the system.

Proceedings ArticleDOI
29 Mar 2005
TL;DR: Results obtained classifying a set of Web services show that the automatically classify services to specific domains and identify key concepts inside service textual documentation, and builds a lattice of relationships between service annotations.
Abstract: The need for supporting the classification and semantic annotation of services constitutes an important challenge for service-centric software engineering. Late-binding and, in general, service matching approaches, require services to be semantically annotated. Such a semantic annotation may require, in turn, to be made in agreement to a specific ontology. Also, a service description needs to properly relate with other similar services. This paper proposes an approach to i) automatically classify services to specific domains and ii) identify key concepts inside service textual documentation, and builds a lattice of relationships between service annotations. Support vector machines and formal concept analysis have been used to perform the two tasks. Results obtained classifying a set of Web services show that the approach can provide useful insights in both service publication and service retrieval phases.

Proceedings ArticleDOI
29 Jun 2005
TL;DR: Extensions to a corpus annotation scheme for the manual annotation of attributions, as well as opinions, emotions, sentiments, speculations, evaluations and other private states in language are described.
Abstract: This paper describes extensions to a corpus annotation scheme for the manual annotation of attributions, as well as opinions, emotions, sentiments, speculations, evaluations and other private states in language. It discusses the scheme with respect to the "Pie in the Sky" Check List of Desirable Semantic Information for Annotation. We believe that the scheme is a good foundation for adding private state annotations to other layers of semantic meaning.

Proceedings ArticleDOI
15 Aug 2005
TL;DR: A novel method for automatic annotation of images with keywords from a generic vocabulary of concepts or objects for the purpose of content-based image retrieval and results are presented on two image-collections | COREL and key-frames from TRECVID.
Abstract: This paper introduces a novel method for automatic annotation of images with keywords from a generic vocabulary of concepts or objects for the purpose of content-based image retrieval. An image, represented as sequence of feature-vectors characterizing low-level visual features such as color, texture or oriented-edges, is modeled as having been stochastically generated by a hidden Markov model, whose states represent concepts. The parameters of the model are estimated from a set of manually annotated (training) images. Each image in a large test collection is then automatically annotated with the a posteriori probability of concepts present in it. This annotation supports content-based search of the image-collection via keywords. Various aspects of model parameterization, parameter estimation, and image annotation are discussed. Empirical retrieval results are presented on two image-collections | COREL and key-frames from TRECVID. Comparisons are made with two other recently developed techniques on the same datasets.

Proceedings Article
01 Jan 2005
TL;DR: An XML-based, generic stand-off architecture for multi-level linguistic annotations is proposed and an example instantiation of this architecture is presented and application scenarios that profit from this architecture are sketched out.
Abstract: This paper deals with the representation of multi-level linguistic annotations. It proposes an XML-based, generic stand-off architecture and presents an example instantiation. Application scenarios that profit from this architecture are sketched out. In recent years, corpus linguistics has become more and more important to a broad community, including people working in theoretical, applied and computational linguistics. To many of them, speech and text corpora represent a rich source of data and phenomena, forming the basis of their research. Benefit from such data is even more important if the data is annotated by suitable information, allowing for fast and effective retrieval of relevant data. Whereas corpora of the first generation featured part-of-speech and syntactic annotations (e.g. PennTreebank [MSM93], TIGER corpus [BDE04]), the focus has now switched to properties beyond the (morpho-)syntactic level. Recent corpora are annotated by semantic information (PropBank [KP02], FrameNet [JPB03], SALSA [EKPP03]), pragmatic information (Penn Discourse TreeBank [MPJW04], RST Discourse Treebank [CMO03], Potsdam Commentary Corpus [Ste04]), and dialogue structure (Switchboard SWBD-DAMSL [JSB97]). Annotations often have to be carried out manually — reliable (semi-)automatic tools exist only for the annotation of part of speech and syntax, and are restricted to well-researched languages like English or German. Moreover, hand-annotated training material is a prerequisite for the development of automatic tools. As a consequence, corpora and annotations ought to be reusable so that a large community can profit from the data. To this end, various standardization efforts have been launched. Standardization of linguistic data concerns (see, e.g., [Sch05]): (i) The physical data structure: here, XML has become the widely-recognized standard format. (ii) The logical data structure: i.e., the data models that are used to model the phenomena and their properties (e.g. hierarchical structures like trees or graphs for syntax annotations 1The research reported in this paper was jointly financed by the German Research Foundation (DFG, SFB632) and the Federal Ministry of Education and Research (BMBF grant no. 03WKH22). Many thanks go to my colleagues, especially Michael Götze, for helpful discussions of the topics addressed in this paper. vs. time-aligned tiers for speech and dialogue annotations). Examples of data models are annotation graphs [BL01] and the NITE Object Model [CKO03b]. (iii) Content: in several initiatives, XML applications for specific linguistic annotations have been developed. For instance, TEI2 (“Text Encoding Initiative”, [SB94]) defines highly-detailed DTDs for encoding all kinds of bibliographic and other information; XCES3 (“XML-based Corpus Encoding Standard”) provides DTDs for the annotation of chunks, alignment, etc. More recently, however, it has been recognized that these standardized DTDs often do not meet application-specific needs. Hence, abstract, generic XML formats have been proposed that allow for the formal integration of application-specific annotations [IR01]. For the conceptual integration of specific annotations, so-called data category repositories as well as linguistic ontologies have been developed. They define reference categories, with precise semantics and examples, that specific annotation tags ought to be mapped to (see, e.g., DOLCE4, “Descriptive Ontology for Linguistic and Cognitive Engineering”). This papers deals with the formal integration of specific annotations. It first addresses the subject of stand-off architecture (sec. 1). We then propose an XML-based representation of linguistic annotation and present an example application (instantiation) in some detail (sec. 2). We also sketch out some application scenarios that profit from such a flexible architecture (sec. 3) and address related approaches (sec. 4). 1 Stand-off Architecture As early as in the mid-nineties, the topic of “stand-off annotation” has been discussed (see, e.g., [TM97]). This term describes the situation where primary data (e.g., the source text) and annotations of this data are stored in separate files. Stand-off annotation might seem problematic, because there is no immediate connection between the text and its annotation; hence, whenever the source text is modified, extra care has to be taken to synchronize its annotation. Similarly, human inspection of the data becomes cumbersome. On the other hand, however, stand-off annotation has the great advantage of leaving the source text untouched. It thus allows for annotating text that cannot be modified for whatever reasons, e.g., because it is a text available on the Internet. Moreover, whereas XML as such does not easily account for overlapping segments and conflicting hierarchies,5 they can be marked in a natural way in stand-off annotation: by distributing annotations over different files. That is, not only is the source text separated from its annotations, but individual annotations are separated from each other as well. This way, annotations at different levels can be created and modified independently of each other. Finally, competing, alternative annotations can even be represented, e.g. variants of part-of-speech annotations that are output of different tools. 2http://www.tei-c.org/ 3http://www.cs.vassar.edu/XCES/ 4http://www.loa-cnr.it/DOLCE.html 5Different methods have been proposed to accommodate conflicting markup into XML. We will come back to them below. One of the first proposals for stand-off annotation of linguistic corpora is [DBD98]. An ISO working group is currently developing the stand-off based LAF6 (“Linguistic Annotation Framework” [IRdlC03]). Some recent corpora like the ANC (“American National Corpus” [RI04]) are encoded in stand-off architecture. In our approach presented in this paper, we also subscribe to the principles of stand-off annotation. 2 A Generic XML Format Our format defines generic XML elements like (markable), (feature), and (structure), which indicate which data type the annotation conforms to. We assume that primary data is stored in a file that optionally specifies a header, followed by a tag , which contains the source text. Annotations are stored in separate files; they may refer to the source text or to other annotations. These relations are encoded by means of XLinks and XPointers. We distinguish three different types of annotations: markables, structures, and features. (i) Markables: tags specify text positions or spans of text (or spans of other markables) that can be annotated by linguistic information. For instance, tags might indicate tokens by specifying ranges of the source text, cf. fig. 1. (ii) Structures: tags are special types of markables. Similar to tags, they specify objects that then can serve as anchors for annotations. Whereas tags define simple types of anchors (flat spans of text or markables), a tag represents a complex anchor involving relations between arbitrarily many markables (including elements). Relations () can be further specified by an attribute type, e.g. as undirected or directed (= pointers). Put differently, a specifies a complete tree or graph, which consists of single tree fragments specified by the tags, cf. fig. 1. (iii) Features: tags specify information annotated to markables or structures, which are referred to by xlink attributes. The type of information (e.g., “part of speech”) is encoded by an attribute type, cf. fig. 2. For instance, the information encoded by the first in fig. 2 can be paraphrased as follows: Take the token that is defined by the tag with the ID attribute id="tok 1" and assign the part of speech “ART” (article) to that token. We intend to adopt the idea of [CKO03a] by assuming that admissible feature values (such as “NN”, normal/common noun, or “NE”, named entity) may be complex types and are organized in a type hierarchy. For instance, “NN” and “NE” might be subtypes of the more general type “N”, noun. tags then point to some type in the hierarchy (which is stored separately), thus specifying the value of the annotated property, cf. fig. 3.7 6ISO Technical TC37/SC4, http://www.tc37sc4.org 7Type hierarchies have to be defined by the user or they may be derived from annotation schemes that incorporate hierarchies, cf. the schemes used by the annotation tool MMAX. In case no hierarchy is defined, the features will be organized in a flat list. The stand-off architecture allows the user to experiment with different hierarchies. Further examples of annotations are sketched out below. They illustrate that annotations may stem from different sources (see the attribute source) and encode various types of information. Categorial annotation (anchored to constituents)
... Coreference annotation, marking coreferential expressions such as pronouns (referred to xlink:href attributes) and their antecedents (identified by target attributes)
... Document structure: headers, paragraphs, lists, etc. (anchored to markables that refer to tokens)
...

Journal Article
TL;DR: An improved version of MWSAF is presented, which replaces the schema matching technique currently used for the categorization with a Naive Bayesian Classifier, so that it can match web services with ontologies faster and with higher accuracy.
Abstract: Researchers have recognized the need for more expressive descriptions of Web services. Most approaches have suggested using ontologies to either describe the Web services or to annotate syntactical descriptions of Web services. Earlier approaches are typically manual, and the capability to support automatic or semi-automatic annotation is needed. The METEOR-S Web Service Annotation Framework (MWSAF) created at the LSDIS Lab at the University of Georgia leverages schema matching techniques for semi-automatic annotation. In this paper, we present an improved version of MWSAF. Our preliminary investigation indicates that, by replacing the schema matching technique currently used for the categorization with a Naive Bayesian Classifier, we can match web services with ontologies faster and with higher accuracy.

Proceedings Article
01 Jan 2005
TL;DR: The main focus of the seminar was on TimeML-based temporal annotation and reasoning as discussed by the authors, with three main points: determining how effectively one can use the TimeML language for consistent annotation, determining how useful such annotation is for further processing, and determining what modifications should be applied to the standard to improve its usefulness in applications such as question-answering and information retrieval.
Abstract: The main focus of the seminar was on TimeML-based temporal annotation and reasoning. We were concerned with three main points: determining how effectively one can use the TimeML language for consistent annotation, determining how useful such annotation is for further processing, and determining what modifications should be applied to the standard to improve its usefulness in applications such as question-answering and information retrieval.