scispace - formally typeset
Search or ask a question

Showing papers on "Annotation published in 2009"


Journal ArticleDOI
TL;DR: AmiGO is a web application that allows users to query, browse and visualize ontologies and related gene product annotation (association) data.
Abstract: AmiGO is a web application that allows users to query, browse, and visualize ontologies and related gene product annotation (association) data. AmiGO can be used online at the Gene Ontology (GO) website to access the data provided by the GO Consortium; it can also be downloaded and installed to browse local ontologies and annotations. AmiGO is free open source software developed and maintained by the GO Consortium.

1,648 citations


Journal ArticleDOI
TL;DR: Recent changes to the FlyBase GO annotation strategy that are improving the quality of the GO annotation data are described, including participation in the GO Reference Genome Annotation Project.
Abstract: FlyBase (http://flybase.org) is a database of Drosophila genetic and genomic information. Gene Ontology (GO) terms are used to describe three attributes of wild-type gene products: their molecular function, the biological processes in which they play a role, and their subcellular location. This article describes recent changes to the FlyBase GO annotation strategy that are improving the quality of the GO annotation data. Many of these changes stem from our participation in the GO Reference Genome Annotation Project—a multi-database collaboration producing comprehensive GO annotation sets for 12 diverse species.

764 citations


Journal ArticleDOI
TL;DR: TheGene Ontology Annotation (GOA) project at the EBI provides high-quality electronic and manual associations (annotations) of Gene Ontology (GO) terms to UniProt Knowledgebase (UniProtKB) entries.
Abstract: The Gene Ontology Annotation (GOA) project at the EBI (http://www.ebi.ac.uk/goa) provides high-quality electronic and manual associations (annotations) of Gene Ontology (GO) terms to UniProt Knowledgebase (UniProtKB) entries. Annotations created by the project are collated with annotations from external databases to provide an extensive, publicly available GO annotation resource. Currently covering over 160 000 taxa, with greater than 32 million annotations, GOA remains the largest and most comprehensive open-source contributor to the GO Consortium (GOC) project. Over the last five years, the group has augmented the number and coverage of their electronic pipelines and a number of new manual annotation projects and collaborations now further enhance this resource. A range of files facilitate the download of annotations for particular species, and GO term information and associated annotations can also be viewed and downloaded from the newly developed GOA QuickGO tool (http://www.ebi.ac.uk/QuickGO), which allows users to precisely tailor their annotation set.

555 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: A new probabilistic model for jointly modeling the image, its class label, and its annotations is developed, which derives an approximate inference and estimation algorithms based on variational methods, as well as efficient approximations for classifying and annotating new images.
Abstract: Image classification and annotation are important problems in computer vision, but rarely considered together. Intuitively, annotations provide evidence for the class label, and the class label provides evidence for annotations. For example, an image of class highway is more likely annotated with words “road,” “car,” and “traffic” than words “fish,” “boat,” and “scuba.” In this paper, we develop a new probabilistic model for jointly modeling the image, its class label, and its annotations. Our model treats the class label as a global description of the image, and treats annotation terms as local descriptions of parts of the image. Its underlying probabilistic assumptions naturally integrate these two sources of information. We derive an approximate inference and estimation algorithms based on variational methods, as well as efficient approximations for classifying and annotating new images. We examine the performance of our model on two real-world image data sets, illustrating that a single model provides competitive annotation performance, and superior classification performance.

490 citations


Proceedings ArticleDOI
05 Jun 2009
TL;DR: An empirical study is conducted to examine the effect of noisy annotations on the performance of sentiment classification models, and evaluate the utility of annotation selection on classification accuracy and efficiency.
Abstract: Annotation acquisition is an essential step in training supervised classifiers. However, manual annotation is often time-consuming and expensive. The possibility of recruiting annotators through Internet services (e.g., Amazon Mechanic Turk) is an appealing option that allows multiple labeling tasks to be outsourced in bulk, typically with low overall costs and fast completion rates. In this paper, we consider the difficult problem of classifying sentiment in political blog snippets. Annotation data from both expert annotators in a research lab and non-expert annotators recruited from the Internet are examined. Three selection criteria are identified to select high-quality annotations: noise level, sentiment ambiguity, and lexical uncertainty. Analysis confirm the utility of these criteria on improving data quality. We conduct an empirical study to examine the effect of noisy annotations on the performance of sentiment classification models, and evaluate the utility of annotation selection on classification accuracy and efficiency.

316 citations


Journal ArticleDOI
01 Jan 2009-Database
TL;DR: This article emphasizes the essential role of expert annotation as a complement of automatic annotation in microbial genome annotation, especially for genomes initially analyzed by automatic procedures alone.
Abstract: The initial outcome of genome sequencing is the creation of long text strings written in a four letter alphabet. The role of in silico sequence analysis is to assist biologists in the act of associating biological knowledge with these sequences, allowing investigators to make inferences and predictions that can be tested experimentally. A wide variety of software is available to the scientific community, and can be used to identify genomic objects, before predicting their biological functions. However, only a limited number of biologically interesting features can be revealed from an isolated sequence. Comparative genomics tools, on the other hand, by bringing together the information contained in numerous genomes simultaneously, allow annotators to make inferences based on the idea that evolution and natural selection are central to the definition of all biological processes. We have developed the MicroScope platform in order to offer a web-based framework for the systematic and efficient revision of microbial genome annotation and comparative analysis (http:// www.genoscope.cns.fr/agc/microscope). Starting with the description of the flow chart of the annotation processes implemented in the MicroScope pipeline, and the development of traditional and novel microbial annotation and comparative analysis tools, this article emphasizes the essential role of expert annotation as a complement of automatic annotation. Several examples illustrate the use of implemented tools for the review and curation of annotations of both new and publicly available microbial genomes within MicroScope’s rich integrated genome framework. The platform is used as a viewer in order to browse updated annotation information of available microbial genomes (more than 440 organisms to date), and in the context of new annotation projects (117 bacterial genomes). The human expertise gathered in the MicroScope database (about 280,000 independent annotations) contributes to improve the quality of microbial genome annotation, especially for genomes initially analyzed by automatic procedures alone. Database URLs: http://www.genoscope.cns.fr/agc/mage and http://www.genoscope.cns.fr/agc/microcyc

284 citations


01 Mar 2009
TL;DR: The Open Biomedical Annotator (OBA) is presented, an ontology-based Web service that annotates public datasets with biomedical ontology concepts based on their textual metadata ( www.bioontology.org).
Abstract: The range of publicly available biomedical data is enormous and is expanding fast. This expansion means that researchers now face a hurdle to extracting the data they need from the large numbers of data that are available. Biomedical researchers have turned to ontologies and terminologies to structure and annotate their data with ontology concepts for better search and retrieval. However, this annotation process cannot be easily automated and often requires expert curators. Plus, there is a lack of easy-to-use systems that facilitate the use of ontologies for annotation. This paper presents the Open Biomedical Annotator (OBA), an ontology-based Web service that annotates public datasets with biomedical ontology concepts based on their textual metadata (www.bioontology.org). The biomedical community can use the annotator service to tag datasets automatically with ontology terms (from UMLS and NCBO BioPortal ontologies). Such annotations facilitate translational discoveries by integrating annotated data.[1].

280 citations


Journal Article
TL;DR: An Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations, is developed.
Abstract: IMG ER: A System for Microbial Genome Annotation Expert Review and Curation Victor M. Markowitz 1, *, Konstantinos Mavromatis 2 , Natalia N. Ivanova 2 , I-Min A. Chen 1 , Ken Chu 1 , and Nikos C. Kyrpides 2 Biological Data Management and Technology Center, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA Genome Biology Program , DOE Joint Genome Institute, 2800 Mitchell Dr., Walnut Creek, CA 94598, USA ABSTRACT A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG’s rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER’s annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG’s comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes. INTRODUCTION A rapidly increasing number of microbial genomes are sequenced by organizations worldwide, undergo similar annotation procedures, and are eventually included into public genome data resources. First, raw (“read”) sequences of microbial genomes are assembled into longer “contigs” (contiguous sequences) in order to produce “draft” genome sequences, with draft genomes sometimes “finished” by closing gaps between contigs. Next, annotation pipelines are used for predicting genes and determining their functional roles in draft or finished genomes. Subsequently, annotated microbial genome sequences are submitted to/collected by primary archival public sequence data repositories, such as Genbank (Benson et al. 2009), which perform data validation on genome datasets in order to ensure consistency of their format and, to a certain degree, their content. Datasets in these resources have different degrees of precision and resolution due to diverse annotation methods employed by individual data providers. Secondary public resources, such as NCBI’s RefSeq (Pruitt et al. 2007), further process microbial genome data from primary resources with the dual goals of providing the most current view on microbial genome sequences and of gradually increasing the quality and completeness of their associated functional annotations via manual curation and computation. In addition to public primary and secondary resources, microbial genome datasets are incorporated into a variety of tertiary resources, such as SEED (Overbeek et al. 2005) and IMG To whom correspondence should be addressed.

270 citations


Journal ArticleDOI
TL;DR: This database of interferon regulated genes integrates information from high-throughput experiments with annotation, ontology, orthologue sequences from 37 species, tissue expression patterns and gene regulatory information to enable a detailed investigation of the molecular mechanisms underlying IFN biology.
Abstract: INTERFEROME is an open access database of types I, II and III Interferon regulated genes (http://www.interferome.org) collected from analysing expression data sets of cells treated with IFNs. This database of interferon regulated genes integrates information from high-throughput experiments with annotation, ontology, orthologue sequences from 37 species, tissue expression patterns and gene regulatory information to enable a detailed investigation of the molecular mechanisms underlying IFN biology. INTERFEROME fulfils a need in infection, immunity, development and cancer research by providing computational tools to assist in identifying interferon signatures in gene lists generated by high-throughput expression technologies, and their potential molecular and biological consequences.

234 citations


Journal ArticleDOI
TL;DR: The DOE-JGI Microbial Annotation Pipeline supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system.
Abstract: The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes prediction of protein coding and RNA genes, as well as repeats and assignment of product names to these genes.

208 citations


Journal ArticleDOI
Jing Liu1, Mingjing Li2, Qingshan Liu1, Hanqing Lu1, Songde Ma1 
TL;DR: A Nearest Spanning Chain (NSC) method is proposed to construct the image-based graph, whose edge-weights are derived from the chain-wise statistical information instead of the traditional pairwise similarities.

Journal ArticleDOI
TL;DR: The SNPnexus database has a user-friendly web interface, providing single or batch query options using SNP identifiers from dbSNP as well as genomic location on clones, contigs or chromosomes, and is the only database currently providing a complete set of functional annotations of SNPs in public databases and newly detected from sequencing projects.
Abstract: Motivation: Design a new computational tool allowing scientists to functionally annotate newly discovered and public domain single nucleotide polymorphisms in order to help in prioritizing targets in further disease studies and large-scale genotyping projects. Summary: SNPnexus database provides functional annotation for both novel and public SNPs. Possible effects on the transcriptome and proteome levels are characterized and reported from five major annotation systems providing the most extensive information on alternative splicing. Additional information on HapMap genotype and allele frequency, overlaps with potential regulatory elements or structural variations as well as related genetic diseases can be also retrieved. The SNPnexus database has a user-friendly web interface, providing single or batch query options using SNP identifiers from dbSNP as well as genomic location on clones, contigs or chromosomes. Therefore, SNPnexus is the only database currently providing a complete set of functional annotations of SNPs in public databases and newly detected from sequencing projects. Hence, we describe SNPnexus, provide details of the query options, the annotation categories as well as biological examples of use. Availability: The SNPnexus database is freely available at http://

Journal ArticleDOI
TL;DR: An experimental evaluation on an English-German parallel corpus is provided which demonstrates the feasibility of inducing high-precision German semantic role annotation both for manually and automatically annotated English data.
Abstract: This article considers the task of automatically inducing role-semantic annotations in the FrameNet paradigm for new languages. We propose a general framework that is based on annotation projection, phrased as a graph optimization problem. It is relatively inexpensive and has the potential to reduce the human effort involved in creating role-semantic resources. Within this framework, we present projection models that exploit lexical and syntactic information. We provide an experimental evaluation on an English-German parallel corpus which demonstrates the feasibility of inducing high-precision German semantic role annotation both for manually and automatically annotated English data.

Journal ArticleDOI
TL;DR: The construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records is described.

Journal ArticleDOI
TL;DR: The Reference Genome project has two primary goals: to increase the depth and breadth of annotations for genes in each of the organisms in the project, and to create data sets and tools that enable other genome annotation efforts to infer GO annotations for homologous genes in their organisms.
Abstract: The Gene Ontology (GO) is a collaborative effort that provides structured vocabularies for annotating the molecular function, biological role, and cellular location of gene products in a highly systematic way and in a species-neutral manner with the aim of unifying the representation of gene function across different organisms. Each contributing member of the GO Consortium independently associates GO terms to gene products from the organism(s) they are annotating. Here we introduce the Reference Genome project, which brings together those independent efforts into a unified framework based on the evolutionary relationships between genes in these different organisms. The Reference Genome project has two primary goals: to increase the depth and breadth of annotations for genes in each of the organisms in the project, and to create data sets and tools that enable other genome annotation efforts to infer GO annotations for homologous genes in their organisms. In addition, the project has several important incidental benefits, such as increasing annotation consistency across genome databases, and providing important improvements to the GO's logical structure and biological content.

Journal ArticleDOI
TL;DR: A taxonomy of annotation is proposed, describing what constitutes an annotation and outlining different dimensions along which annotation can vary, and what styles of annotation are used in different types of applications and areas where further work needs to be done to improve annotation.

Patent
27 Feb 2009
TL;DR: In this paper, an audio file format is augmented with a parallel data channel of line identifiers, or with a map associating time codes for the audio data with line numbers on the original document.
Abstract: To facilitate the use of audio files for annotation purposes, an audio file format, which includes audio data for playback purposes, is augmented with a parallel data channel of line identifiers, or with a map associating time codes for the audio data with line numbers on the original document. The line number-time code information in the audio file is used to navigate within the audio file, and also to associate bookmark links and captured audio annotation files with line numbers of the original text document. An annotation device may provide an output document wherein links to audio and/or text annotation files are embedded at corresponding line numbers. Also, a navigation index may be generated, having links to annotation files and associated document line numbers, as well as bookmark links to selected document line numbers.

Proceedings ArticleDOI
04 Aug 2009
TL;DR: The Columbia Arabic Treebank (CATiB) is a database of syntactic analyses of Arabic sentences using representations and terminology inspired by traditional Arabic syntax.
Abstract: The Columbia Arabic Treebank (CATiB) is a database of syntactic analyses of Arabic sentences. CATiB contrasts with previous approaches to Arabic treebanking in its emphasis on speed with some constraints on linguistic richness. Two basic ideas inspire the CATiB approach: no annotation of redundant information and using representations and terminology inspired by traditional Arabic syntax. We describe CATiB's representation and annotation procedure, and report on inter-annotator agreement and speed.

Journal ArticleDOI
TL;DR: The utility of the current generation of webservers is assessed; and improvements for the next generation of Webservers are suggested to better deliver value to medical geneticists and molecular biologists.
Abstract: Computational biology has the opportunity to play an important role in the identification of functional single nucleotide polymorphisms (SNPs) discovered in large-scale genotyping studies, ultimately yielding new drug targets and biomarkers. The medical genetics and molecular biology communities are increasingly turning to computational biology methods to prioritize interesting SNPs found in linkage and association studies. Many such methods are now available through web interfaces, but the interested user is confronted with an array of predictive results that are often in disagreement with each other. Many tools today produce results that are difficult to understand without bioinformatics expertise, are biased towards non-synonymous SNPs, and do not necessarily reflect up-to-date versions of their source bioinformatics resources, such as public SNP repositories. Here, I assess the utility of the current generation of webservers; and suggest improvements for the next generation of webservers to better deliver value to medical geneticists and molecular biologists.

Journal ArticleDOI
TL;DR: This work builds a prototype system for ontology based annotation and indexing of biomedical data that enables ontology-based querying and integration of tissue and gene expression microarray data, and enables identification of datasets on specific diseases across both repositories.
Abstract: The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT. In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data. Based on this work we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality of this system is to enable users to locate biomedical data resources related to particular ontology concepts.

DOI
20 Jul 2009
TL;DR: The different features of the architecture as well as actual use cases for corpus linguistic research on such diverse areas as information structure, learner language and discourse level phenomena are presented.
Abstract: ANNIS (see Dipper & Gotze 2005; Chiarcos et al. 2008) is a flexible web-based corpus architecture for search and visualization of multi-layer linguistic corpora. By multi-layer we mean that the same primary datum may be annotated independently with (i) annotations of different types (spans, DAGs with labelled edges and arbitrary pointing relations between terminals or non-terminals), and (ii) annotation structures that possibly overlap and/or conflict hierarchically. In this paper we present the different features of the architecture as well as actual use cases for corpus linguistic research on such diverse areas as information structure, learner language and discourse level phenomena. The supported search functionalities of ANNIS2 include exact and regular expression matching on word forms and annotations, as well as complex relations between individual elements, such as all forms of overlapping, contained or adjacent annotation spans, hierarchical dominance (children, ancestors, leftor rightmost child etc.) and more. Alternatively to the query language, data can be accessed using a graphical query builder. Query matches are visualized depending on annotation types: annotations referring to tokens (e.g. lemma, POS, morphology) are shown immediately in the match list. Spans (covering one or more tokens) are displayed in a grid view, trees/graphs in a tree/graph view, and pointing relations (such as anaphoric links) in a discourse view, with same-colour highlighting for coreferent elements. Full Unicode support is provided and a media player is embedded for rendering audio files linked to the data, allowing for a large variety of corpora. Corpus data is annotated with automatic tools (taggers, parsers etc.) or taskspecific expert tools for manual annotation, and then mapped onto the interchange format PAULA (Dipper 2005), where stand-off annotations refer to the same primary data. Importers exist for many formats, including EXMARaLDA (Schmidt 2004), TigerXML (Brants & Plaehn 2000), MMAX2 (Muller & Strube 2006), RSTTool (O’Donnell 2000), PALinkA (Orasan 2003) and Toolbox (Stuart et al. 2007). Data is compiled into a relational DB for optimal performance. Query matches and their features can also be exported in the ARFF format and processed with the data mining tool WEKA (Witten & Frank 2005), which offers implementations of clustering and classification algorithms. ANNIS2 compares favourably with search functionalities in the above tools as well as other corpus search engines (EXAKT, http://www.exmaralda.org/exakt.html, TIGERSearch, Lezius,2002, CWB, Christ 1994) and other frameworks/architectures (NITE, Carletta et al. 2003, GATE, Cunningham, 2002).

Patent
19 Feb 2009
TL;DR: In this paper, a graphical annotation interface allows the creation of annotations and association of the annotations with a video, such as altering the appearance and/or behavior of an existing video, e.g. by supplementing it with text, allowing linking to other videos or web pages, or pausing playback of the video.
Abstract: Systems and methods are provided for adding and displaying interactive annotations for existing online hosted videos. A graphical annotation interface allows the creation of annotations and association of the annotations with a video. Annotations may be of different types and have different functionality, such as altering the appearance and/or behavior of an existing video, e.g. by supplementing it with text, allowing linking to other videos or web pages, or pausing playback of the video. Authentication of a user desiring to perform annotation of a video may be performed in various manners, such as by checking a uniform resource locator (URL) against an existing list, checking a user identifier against an access list, and the like. As a result of authentication, a user is accorded the appropriate annotation abilities, such as full annotation, no annotation, or annotation restricted to a particular temporal or spatial portion of the video.

Proceedings ArticleDOI
01 Sep 2009
TL;DR: A novel and efficient approach, named domain adaptive semantic diffusion (DASD), to exploit semantic context while considering the domain-shift-of-context for large scale video concept annotation, which provides a means to handle domain change between training and test data.
Abstract: Learning to cope with domain change has been known as a challenging problem in many real-world applications. This paper proposes a novel and efficient approach, named domain adaptive semantic diffusion (DASD), to exploit semantic context while considering the domain-shift-of-context for large scale video concept annotation. Starting with a large set of concept detectors, the proposed DASD refines the initial annotation results using graph diffusion technique, which preserves the consistency and smoothness of the annotation over a semantic graph. Different from the existing graph learning methods which capture relations among data samples, the semantic graph treats concepts as nodes and the concept affinities as the weights of edges. Particularly, the DASD approach is capable of simultaneously improving the annotation results and adapting the concept affinities to new test data. The adaptation provides a means to handle domain change between training and test data, which occurs very often in video annotation task. We conduct extensive experiments to improve annotation results of 374 concepts over 340 hours of videos from TRECVID 2005-2007 data sets. Results show consistent and significant performance gain over various baselines. In addition, the proposed approach is very efficient, completing DASD over 374 concepts within just 2 milliseconds for each video shot on a regular PC.

Patent
05 May 2009
TL;DR: In this paper, a graphical annotation interface allows the creation of annotations and association of the annotations with a video, such as altering the appearance and/or behavior of an existing video, e.g. by supplementing it with text, allowing linking to other videos or web pages, or pausing playback of the video.
Abstract: Systems and methods are provided for adding and displaying interactive annotations for existing online hosted videos. A graphical annotation interface allows the creation of annotations and association of the annotations with a video. Annotations may be of different types and have different functionality, such as altering the appearance and/or behavior of an existing video, e.g. by supplementing it with text, allowing linking to other videos or web pages, or pausing playback of the video. Authentication of a user desiring to perform annotation of a video may be performed in various manners, such as by checking a uniform resource locator (URL) against an existing list, checking a user identifier against an access list, and the like. As a result of authentication, a user is accorded the appropriate annotation abilities, such as full annotation, no annotation, or annotation restricted to a particular temporal or spatial portion of the video.

Proceedings ArticleDOI
02 Aug 2009
TL;DR: Experiments show that adaptation from the much larger People's Daily corpus to the smaller but more popular Penn Chinese Treebank results in significant improvements in both segmentation and tagging accuracies, which in turn helps improve Chinese parsing accuracy.
Abstract: Manually annotated corpora are valuable but scarce resources, yet for many annotation tasks such as treebanking and sequence labeling there exist multiple corpora with different and incompatible annotation guidelines or standards. This seems to be a great waste of human efforts, and it would be nice to automatically adapt one annotation standard to another. We present a simple yet effective strategy that transfers knowledge from a differently annotated corpus to the corpus with desired annotation. We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese. Experiments show that adaptation from the much larger People's Daily corpus to the smaller but more popular Penn Chinese Treebank results in significant improvements in both segmentation and tagging accuracies (with error reductions of 30.2% and 14%, respectively), which in turn helps improve Chinese parsing accuracy.

Patent
18 Jun 2009
TL;DR: In this paper, a user input is received as a digital annotation, and the digital annotation is maintained as at least part of an overlay layer, and if information from a program is being displayed then the digital annotations are displayed concurrently with the information from the program.
Abstract: A user input is received as a digital annotation, and the digital annotation is maintained as at least part of an overlay layer. The digital annotation is displayed, and if information from a program is being displayed then the digital annotation is displayed concurrently with the information from the program. Interaction between the overlay layer and the application layer can also be allowed.

Journal ArticleDOI
TL;DR: GenomeGraphs is a flexible and extensible software package which can be used to visualize a multitude of genomic datasets within the statistical programming environment R.
Abstract: Biological studies involve a growing number of distinct high-throughput experiments to characterize samples of interest. There is a lack of methods to visualize these different genomic datasets in a versatile manner. In addition, genomic data analysis requires integrated visualization of experimental data along with constantly changing genomic annotation and statistical analyses. We developed GenomeGraphs, as an add-on software package for the statistical programming environment R, to facilitate integrated visualization of genomic datasets. GenomeGraphs uses the biomaRt package to perform on-line annotation queries to Ensembl and translates these to gene/transcript structures in viewports of the grid graphics package. This allows genomic annotation to be plotted together with experimental data. GenomeGraphs can also be used to plot custom annotation tracks in combination with different experimental data types together in one plot using the same genomic coordinate system. GenomeGraphs is a flexible and extensible software package which can be used to visualize a multitude of genomic datasets within the statistical programming environment R.

Journal ArticleDOI
20 Jul 2009-PLOS ONE
TL;DR: This work submitted the genome sequence of halophilic archaeon Halorhabdus utahensis to be analyzed by three genome annotation services to compare the methodology and effectiveness of the annotations, as well as to explore the genes, pathways, and physiology of the previously unannotated genome.
Abstract: Genome annotations are accumulating rapidly and depend heavily on automated annotation systems. Many genome centers offer annotation systems but no one has compared their output in a systematic way to determine accuracy and inherent errors. Errors in the annotations are routinely deposited in databases such as NCBI and used to validate subsequent annotation errors. We submitted the genome sequence of halophilic archaeon Halorhabdus utahensis to be analyzed by three genome annotation services. We have examined the output from each service in a variety of ways in order to compare the methodology and effectiveness of the annotations, as well as to explore the genes, pathways, and physiology of the previously unannotated genome. The annotation services differ considerably in gene calls, features, and ease of use. We had to manually identify the origin of replication and the species-specific consensus ribosome-binding site. Additionally, we conducted laboratory experiments to test H. utahensis growth and enzyme activity. Current annotation practices need to improve in order to more accurately reflect a genome's biological potential. We make specific recommendations that could improve the quality of microbial annotation projects.

Journal ArticleDOI
01 Apr 2009
TL;DR: This paper proposes a novel method named correlative linear neighborhood propagation to improve annotation performance and demonstrates its effectiveness and efficiency on the Text REtrieval Conference VIDeo retrieval evaluation data set.
Abstract: Recently, graph-based semi-supervised learning methods have been widely applied in multimedia research area. However, for the application of video semantic annotation in multi-label setting, these methods neglect an important characteristic of video data: The semantic concepts appear correlatively and interact naturally with each other rather than exist in isolation. In this paper, we adapt this semantic correlation into graph-based semi-supervised learning and propose a novel method named correlative linear neighborhood propagation to improve annotation performance. Experiments conducted on the Text REtrieval Conference VIDeo retrieval evaluation data set have demonstrated its effectiveness and efficiency.

Journal ArticleDOI
TL;DR: This work employs a constrained clustering method to partition a photo collection into event-based subcollections and uses conditional random field (CRF) models to exploit the correlation between photos based on time-location constraints.
Abstract: Most image annotation systems consider a single photo at a time and label photos individually. In this work, we focus on collections of personal photos and exploit the contextual information naturally implied by the associated GPS and time metadata. First, we employ a constrained clustering method to partition a photo collection into event-based subcollections, considering that the GPS records may be partly missing (a practical issue). We then use conditional random field (CRF) models to exploit the correlation between photos based on 1) time-location constraints and 2) the relationship between collection-level annotation (i.e., events) and image-level annotation (i.e., scenes). With the introduction of such a multilevel annotation hierarchy, our system addresses the problem of annotating consumer photo collections that requires a more hierarchical description of the customers' activities than do the simpler image annotation tasks. The efficacy of the proposed system is validated by extensive evaluation using a sizable geotagged personal photo collection database, which consists of over 100 photo collections and is manually labeled for 12 events and 12 scenes to create ground truth.