scispace - formally typeset
Search or ask a question

Showing papers on "Annotation published in 2015"


Journal ArticleDOI
TL;DR: A protocol to use the ANNOVAR (ANNOtate VARiation) software to facilitate fast and easy variant annotations, including gene-based, region-based and filter-based annotations on a variant call format (VCF) file generated from human genomes.
Abstract: This protocol describes how to annotate genomic variants using either the ANNOVAR software or the web-based wANNOVAR tool. Recent developments in sequencing techniques have enabled rapid and high-throughput generation of sequence data, democratizing the ability to compile information on large amounts of genetic variations in individual laboratories. However, there is a growing gap between the generation of raw sequencing data and the extraction of meaningful biological information. Here, we describe a protocol to use the ANNOVAR (ANNOtate VARiation) software to facilitate fast and easy variant annotations, including gene-based, region-based and filter-based annotations on a variant call format (VCF) file generated from human genomes. We further describe a protocol for gene-based annotation of a newly sequenced nonhuman species. Finally, we describe how to use a user-friendly and easily accessible web server called wANNOVAR to prioritize candidate genes for a Mendelian disease. The variant annotation protocols take 5–30 min of computer time, depending on the size of the variant file, and 5–10 min of hands-on time. In summary, through the command-line tool and the web server, these protocols provide a convenient means to analyze genetic variants generated in humans and other species.

654 citations


Journal ArticleDOI
TL;DR: The Gene Ontology Annotation resource now provides annotations to five times the number of proteins it did 4 years ago, thanks to the use of quality control checks that ensures that the GOA resource supplies high-quality functional information to proteins from a wide range of species.
Abstract: The Gene Ontology Annotation (GOA) resource (http://www.ebi.ac.uk/GOA) provides evidence-based Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB). Manual annotations provided by UniProt curators are supplemented by manual and automatic annotations from model organism databases and specialist annotation groups. GOA currently supplies 368 million GO annotations to almost 54 million proteins in more than 480,000 taxonomic groups. The resource now provides annotations to five times the number of proteins it did 4 years ago. As a member of the GO Consortium, we adhere to the most up-to-date Consortium-agreed annotation guidelines via the use of quality control checks that ensures that the GOA resource supplies high-quality functional information to proteins from a wide range of species. Annotations from GOA are freely available and are accessible through a powerful web browser as well as a variety of annotation file formats.

498 citations


Journal ArticleDOI
TL;DR: The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Micro microbial Genome comparative analysis system.
Abstract: The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.

303 citations


Journal ArticleDOI
TL;DR: This work integrated many annotation datasets into a user-friendly webserver to enhance their accessibility and link genes and functional information to genetic variants identified by association studies remains difficult.
Abstract: Motivation: Linking genes and functional information to genetic variants identified by association studies remains difficult. Resources containing extensive genomic annotations are available but often not fully utilized due to heterogeneous data formats. To enhance their accessibility, we integrated many annotation datasets into a user-friendly webserver. Availability and implementation: http://www.snipa.org/ Contact: ed.nehcneum-ztlohmleh@relleumnetsak.g Supplementary information: Supplementary data are available at Bioinformatics online.

251 citations


Journal ArticleDOI
TL;DR: The progress of the GENCODE mouse annotation project is described, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium.
Abstract: Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium We discuss the more recent incorporation of next-generation sequencing datasets into this workflow, including the usage of mass-spectrometry data to potentially identify novel protein-coding genes Finally, we will outline how the C57BL6/J genebuild can be used to gain insights into the variant sites that distinguish different mouse strains and species

187 citations


Journal ArticleDOI
Qiongshi Lu1, Yiming Hu1, Jiehuan Sun1, Yuwei Cheng1, Kei-Hoi Cheung1, Hongyu Zhao1 
TL;DR: GenoCanyon is presented, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome.
Abstract: Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu

150 citations


Journal ArticleDOI
TL;DR: A report on the growth of HAMAP and updates to the HAMAP system since the last report in the NAR Database Issue of 2013 and improvements to the web-based tool HAMAP-Scan which simplify the classification and annotation of sequences, and the incorporation of an improved sequence-profile search algorithm.
Abstract: HAMAP (High-quality Automated and Manual Annotation of Proteins‐‐available at http://hamap.expasy. org/) is a system for the automatic classification and annotation of protein sequences. HAMAP provides annotation of the same quality and detail as UniProtKB/Swiss-Prot, using manually curated profiles for protein sequence family classification and expert curated rules for functional annotation of family members. HAMAP data and tools are made available through our website and as part of the UniRule pipeline of UniProt, providing annotation for millions of unreviewed sequences of UniProtKB/TrEMBL. Here we report on the growth of HAMAP and updates to the HAMAP system since our last report in the NAR Database Issue of 2013. We continue to augment HAMAP with new family profiles and annotation rules as new protein families are characterized and annotated in UniProtKB/Swiss-Prot; the latest version of HAMAP (as of 3 September 2014) contains 1983 family classification profiles and 1998 annotation rules (up from 1780 and 1720). We demonstrate how the complex logic of HAMAP rules allows for precise annotation of individual functional variants within large homologous protein families. We also describe improvements to our web-based tool HAMAP-Scan which simplify the classification and annotation of sequences, and the incorporation of an improved sequence-profile search algorithm.

119 citations


Book ChapterDOI
Joakim Nivre1
14 Apr 2015
TL;DR: The motivation behind the initiative, how the basic design principles follow from these requirements, and the different components of the annotation standard, including principles for word segmentation, morphological annotation, and syntactic annotation are discussed.
Abstract: Universal Dependencies is a recent initiative to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. In this paper, I outline the motivation behind the initiative and explain how the basic design principles follow from these requirements. I then discuss the different components of the annotation standard, including principles for word segmentation, morphological annotation, and syntactic annotation. I conclude with some thoughts on the challenges that lie ahead.

116 citations


Journal ArticleDOI
TL;DR: Results demonstrate that the use of the semi-automatic, as well as the automatic, modality drastically reduces the human effort while preserving the quality of the annotations.

93 citations


Proceedings ArticleDOI
01 Jun 2015
TL;DR: An annotation tool that supports the annotation of such graphs is developed and an annotation study with four annotators on 24 scientific articles from the domain of educational research is carried out.
Abstract: This paper presents the results of an annotation study focused on the fine-grained analysis of argumentation structures in scientific publications. Our new annotation scheme specifies four types of binary argumentative relations between sentences, resulting in the representation of arguments as small graph structures. We developed an annotation tool that supports the annotation of such graphs and carried out an annotation study with four annotators on 24 scientific articles from the domain of educational research. For calculating the inter-annotator agreement, we adapted existing measures and developed a novel graphbased agreement measure which reflects the semantic similarity of different annotation graphs.

83 citations


Journal ArticleDOI
18 Sep 2015-PLOS ONE
TL;DR: The updated gene annotation of the Arabidopsis lyrata gene annotation corrects hundreds of incorrectly split or merged gene models in the original annotation, and as a result the identification of alternative splicing events and differential isoform usage are vastly improved.
Abstract: Gene model annotations are important community resources that ensure comparability and reproducibility of analyses and are typically the first step for functional annotation of genomic regions. Without up-to-date genome annotations, genome sequences cannot be used to maximum advantage. It is therefore essential to regularly update gene annotations by integrating the latest information to guarantee that reference annotations can remain a common basis for various types of analyses. Here, we report an improvement of the Arabidopsis lyrata gene annotation using extensive RNA-seq data. This new annotation consists of 31,132 protein coding gene models in addition to 2,089 genes with high similarity to transposable elements. Overall, ~87% of the gene models are corroborated by evidence of expression and 2,235 of these models feature multiple transcripts. Our updated gene annotation corrects hundreds of incorrectly split or merged gene models in the original annotation, and as a result the identification of alternative splicing events and differential isoform usage are vastly improved.

Proceedings ArticleDOI
01 Sep 2015
TL;DR: It is found that non-experts, with very little training, can reliably provide judgments about what events are mentioned and the extent to which the author thinks they actually happened.
Abstract: Events are communicated in natural language with varying degrees of certainty. For example, if you are “hoping for a raise,” it may be somewhat less likely than if you are “expecting” one. To study these distinctions, we present scalable, highquality annotation schemes for event detection and fine-grained factuality assessment. We find that non-experts, with very little training, can reliably provide judgments about what events are mentioned and the extent to which the author thinks they actually happened. We also show how such data enables the development of regression models for fine-grained scalar factuality predictions that outperform strong baselines.

Posted Content
TL;DR: This paper annotates static 3D scene elements with rough bounding primitives and develops a model which transfers this information into the image domain and reveals that 3D information enables more efficient annotation while at the same time resulting in improved accuracy and time-coherent labels.
Abstract: Semantic annotations are vital for training models for object recognition, semantic segmentation or scene understanding. Unfortunately, pixelwise annotation of images at very large scale is labor-intensive and only little labeled data is available, particularly at instance level and for street scenes. In this paper, we propose to tackle this problem by lifting the semantic instance labeling task from 2D into 3D. Given reconstructions from stereo or laser data, we annotate static 3D scene elements with rough bounding primitives and develop a model which transfers this information into the image domain. We leverage our method to obtain 2D labels for a novel suburban video dataset which we have collected, resulting in 400k semantic and instance image annotations. A comparison of our method to state-of-the-art label transfer baselines reveals that 3D information enables more efficient annotation while at the same time resulting in improved accuracy and time-coherent labels.

Journal ArticleDOI
TL;DR: The creation of the IxaMed-GS gold standard composed of real electronic health records written in Spanish and manually annotated by experts in pharmacology and pharmacovigilance, and used for the automatic extraction of adverse drug reaction events using machine learning.

Journal ArticleDOI
TL;DR: A novel variant annotator, TransVar, is designed to perform three main functions supporting diverse reference genomes and transcript databases and can be used to ascertain if two protein variants have identical genomic origin, thus reducing inconsistency in annotation data.
Abstract: One DNA sequence can code for multiple different mRNAs, and therefore many different proteins. Conversely, a variant identified at the protein or transcript level may have non-unique genomic origins. For example, EGFR:p.L747S, which mediates acquired resistance of non-small cell lung cancer to tyrosine kinase inhibitors1, can be translated from multiple genomic variants such as chr7:g.55249076_55249077delinsAG and chr7:g.55242470T>C on different isoforms defined on the human reference assembly GRCh37. One-to-many, many-to-one and many-to-many relationships among sequence variants at the genomic level and those at transcript and protein levels introduce frequent inconsistencies in current practice when vital information about the annotation process (e.g., transcript or isoform IDs) is omitted from variant identifiers. To facilitate standardization and reveal inconsistency in existing variant annotations, we have designed a novel variant annotator, TransVar, to perform three main functions supporting diverse reference genomes and transcript databases (Fig. 1a): (i) “forward annotation”, which annotates all potential effects of a genomic variant on mRNAs and proteins; (ii) “reverse annotation”, which traces an mRNA or protein variant to all potential genomic origins; and (iii) “equivalence annotation”, which, for a given protein variant, searches for alternative protein variants that have identical genomic origin but are represented based on different isoforms. Figure 1 Schematic overview of TransVar and comparison of TransVar with other tools. (a) TransVar performs forward (green arrows) and reverse annotation (pink arrows) and considers all possible mRNA transcripts or protein isoforms available in user-specified reference ... We annotated 964,132 unique single-nucleotide substitutions (SNS), 3,715 multi-nucleotide substitutions (MNS), 11,761 insertions (INS), 24,595 deletions (DEL) and 166 block substitutions (BLS) in the Catalogue of Somatic Mutations in Cancer (COSMIC v67) using TransVar, ANNOVAR2, VEP3, snpEff4, and Oncotator5, and asked whether the resulting protein identifiers (gene name, protein coordinates, and reference amino acid (AA)) match those in COSMIC. We observed comparable consistency in SNS and MNS but variable consistency in INS, DEL and BLS from different annotators (Fig. 1b, Supplementary Table 1 and Supplementary Notes). That finding can largely be attributed to a lack of standardization among variant annotations (codon or AA positions of variants) submitted to COSMIC and among conventions implemented in various annotators. Inconsistency in annotations blurred the lines of evidence for variant frequency estimation and led to inaccurate determination of variant function. TransVar revealed hidden inconsistency in these variant annotations by comprehensively outputting alternative annotations in all available transcripts in standard HGVS nomenclature, and thus resulted in greater consistency in this experiment. TransVar’s novel reverse annotation can be used to ascertain if two protein variants have identical genomic origin, thus reducing inconsistency in annotation data. It can also reveal whether or not a protein variant has non-unique genomic origins and requires caution in genetic and clinical interpretation. We reverse-annotated the protein level variants in COSMIC and found that even under the constraints imposed by the reference base or AA identity, a sizeable fraction (e.g., 11.9% of single-AA substitutions) were associated with multiple genomic variants (Supplementary Table 2), if transcripts were not specified. Among the 537 variants that were cited as clinically actionable at PersonalizedCancerTherapy.org, 78 (14.5%) (e.g., CDKN2A:p.R87P and ERBB2:p.L755_T759del) could be mapped to multiple genomic locations (Supplementary Table 3). The reverse-annotation functionality also enabled systematic genomic characterization of variants directly identified from proteomic or RNA-seq data. For example, we were able to identify in just a few minutes of compute-time the putative genomic origins of 187,464 (97.69%) protein phosphorylation sites (e.g., p.Y308/p.S473 in AKT1 and p.Y1068/p.Y1172 in EGFR) in human proteins6. Our investigation revealed frequent inconsistencies in current databases and tools and highlighted the importance of standardization. With both forward and reverse annotation enabled in TransVar, we can reveal hidden inconsistency and improve the precision of translational and clinical genomics. The source code and detailed instructions of TransVar is available at https://bitbucket.org/wanding/transvar and a web interface is at http://www.transvar.net.

01 Aug 2015
TL;DR: This paper quantifies the differences that can be observed when replacing gold standard labels and their results should influence application developers that rely on crosslingual models that are not tested in real life.
Abstract: This paper presents cross-lingual models for dependency parsing using the first release of the universal dependencies data set. We systematically compare annotation projection with monolingual baseline models and study the effect of predicted PoS labels in evaluation. Our results reveal the strong impact of tagging accuracy especially with models trained on noisy projected data sets. This paper quantifies the differences that can be observed when replacing gold standard labels and our results should influence application developers that rely on crosslingual models that are not tested in realis-

Proceedings ArticleDOI
01 Jul 2015
TL;DR: This work proposes an alternative approach, which takes the example trigger terms mentioned in the guidelines as seeds, and then applies an eventindependent similarity-based classifier for trigger labeling, which can skip manual annotation for new event types, while requiring only minimal annotated training data for few example events at system setup.
Abstract: The task of event trigger labeling is typically addressed in the standard supervised setting: triggers for each target event type are annotated as training data, based on annotation guidelines. We propose an alternative approach, which takes the example trigger terms mentioned in the guidelines as seeds, and then applies an eventindependent similarity-based classifier for trigger labeling. This way we can skip manual annotation for new event types, while requiring only minimal annotated training data for few example events at system setup. Our method is evaluated on the ACE-2005 dataset, achieving 5.7% F1 improvement over a state-of-the-art supervised system which uses the full training data.

Journal ArticleDOI
TL;DR: The FunFHMMer web server is presented, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource, and also provides valuable information for the prediction of functional sites.
Abstract: The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence-structure-function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer.

Proceedings ArticleDOI
01 Jun 2015
TL;DR: This paper describes the processes and issues of annotating event nuggets based on DEFT ERE Annotation Guidelines v1.3 and TAC KBP Event Detection Annotation guidelines 1.7 and proposes using Event Nuggets to help meet the definitions of the specific type/subtypes which are part of this project.
Abstract: This paper describes the processes and issues of annotating event nuggets based on DEFT ERE Annotation Guidelines v1.3 and TAC KBP Event Detection Annotation Guidelines 1.7. Using Brat Rapid Annotation Tool (brat), newswire and discussion forum documents were annotated. One of the challenges arising from human annotation of documents is annotators’ disagreement about the way of tagging events. We propose using Event Nuggets to help meet the definitions of the specific type/subtypes which are part of this project. We present case studies of several examples of event annotation issues, including discontinuous multi-word events representing single events. Annotation statistics and consistency analysis is provided to characterize the interannotator agreement, considering single term events and multi-word events which are both continuous and discontinuous. Consistency analysis is conducted using a scorer to compare first pass annotated files against adjudicated files.

Proceedings ArticleDOI
21 Sep 2015
TL;DR: The key findings of the paper demonstrate that the current dominant practice in continuous affect annotation via rating-based labeling is detrimental to advancements in the field of affective computing.
Abstract: The question of how to best annotate affect within available content has been a milestone challenge for affective computing. Appropriate methods and tools addressing that question can provide better estimations of the ground truth which, in turn, may lead to more efficient affect detection and more reliable models of affect. This paper introduces a rank-based real-time annotation tool, we name AffectRank, and compares it against the popular rating-based real-time FeelTrace tool through a proof-of-concept video annotation experiment. Results obtained suggest that the rank-based (ordinal) annotation approach proposed yields significantly higher inter-rater reliability and, thereby, approximation of the underlying ground truth. The key findings of the paper demonstrate that the current dominant practice in continuous affect annotation via rating-based labeling is detrimental to advancements in the field of affective computing.

Journal ArticleDOI
TL;DR: This is the first gold-standard corpus for biomedical concept recognition in languages other than English, and the inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques.

06 May 2015
TL;DR: This work details the mapping of previously introduced annotation to the UD standard, describing specific challenges and their resolution, and presents parsing experiments comparing the performance of a state of theart parser trained on a languagespecific annotation schema to performance on the corresponding UD annotation.
Abstract: There has been substantial recent interest in annotation schemes that can be applied consistently to many languages. Building on several recent efforts to unify morphological and syntactic annotation, the Universal Dependencies (UD) project seeks to introduce a cross-linguistically applicable part-of-speech tagset, feature inventory, and set of dependency relations as well as a large number of uniformly annotated treebanks. We present Universal Dependencies for Finnish, one of the ten languages in the recent first release of UD project treebank data. We detail the mapping of previously introduced annotation to the UD standard, describing specific challenges and their resolution. We additionally present parsing experiments comparing the performance of a stateof-the-art parser trained on a languagespecific annotation schema to performance on the corresponding UD annotation. The results show improvement compared to the source annotation, indicating that the conversion is accurate and supporting the feasibility of UD as a parsing target. The introduced tools and resources are available under open licenses from http://bionlp.utu.fi/ud-finnish.html.

Journal ArticleDOI
TL;DR: The current status of the FlyBase annotated gene set for Drosophila melanogaster is reported and improvements based on high-throughput data are highlighted and remaining challenges are discussed, for instance, identification of functional small polypeptides and detection of alternative translation starts.
Abstract: We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3′ UTRs (up to 15–18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts.

Journal ArticleDOI
TL;DR: CAVA (Clinical Annotation of VAriants), a fast, lightweight tool designed for easy incorporation into NGS pipelines, is a freely available tool that provides rapid, robust, high-throughput clinical annotation of NGS data, using a standardized clinical sequencing nomenclature.
Abstract: Next-generation sequencing (NGS) offers unprecedented opportunities to expand clinical genomics. It also presents challenges with respect to integration with data from other sequencing methods and historical data. Provision of consistent, clinically applicable variant annotation of NGS data has proved difficult, particularly of indels, an important variant class in clinical genomics. Annotation in relation to a reference genome sequence, the DNA strand of coding transcripts and potential alternative variant representations has not been well addressed. Here we present tools that address these challenges to provide rapid, standardized, clinically appropriate annotation of NGS data in line with existing clinical standards. We developed a clinical sequencing nomenclature (CSN), a fixed variant annotation consistent with the principles of the Human Genome Variation Society (HGVS) guidelines, optimized for automated variant annotation of NGS data. To deliver high-throughput CSN annotation we created CAVA (Clinical Annotation of VAriants), a fast, lightweight tool designed for easy incorporation into NGS pipelines. CAVA allows transcript specification, appropriately accommodates the strand of a gene transcript and flags variants with alternative annotations to facilitate clinical interpretation and comparison with other datasets. We evaluated CAVA in exome data and a clinical BRCA1/BRCA2 gene testing pipeline. CAVA generated CSN calls for 10,313,034 variants in the ExAC database in 13.44 hours, and annotated the ICR1000 exome series in 6.5 hours. Evaluation of 731 different indels from a single individual revealed 92 % had alternative representations in left aligned and right aligned data. Annotation of left aligned data, as performed by many annotation tools, would thus give clinically discrepant annotation for the 339 (46 %) indels in genes transcribed from the forward DNA strand. By contrast, CAVA provides the correct clinical annotation for all indels. CAVA also flagged the 370 indels with alternative representations of a different functional class, which may profoundly influence clinical interpretation. CAVA annotation of 50 BRCA1/BRCA2 gene mutations from a clinical pipeline gave 100 % concordance with Sanger data; only 8/25 BRCA2 mutations were correctly clinically annotated by other tools. CAVA is a freely available tool that provides rapid, robust, high-throughput clinical annotation of NGS data, using a standardized clinical sequencing nomenclature.

Proceedings ArticleDOI
01 Jul 2015
TL;DR: DIWAN is presented, an annotation interface for Arabic dialectal texts that makes analyses from other variants available to the annotator, who then can choose to use them or not.
Abstract: This paper presents DIWAN, an annotation interface for Arabic dialectal texts. While the Arabic dialects differ in many respects from each other and from Modern Standard Arabic, they also have much in common. To facilitate annotation and to make it as efficient as possible, it is therefore not advisable to treat each Arabic dialect as a separate language, unrelated to the other variants of Arabic. Instead, we make analyses from other variants available to the annotator, who then can choose to use them or not.

Proceedings ArticleDOI
01 Jul 2015
TL;DR: This paper proposes a general framework to construct and analyze the code-switching emotional posts in social media, and proposes a multiple-classifier-based automatic detection approach to detect emotion in the codeswitching corpus for evaluating the effectiveness of both Chinese and English texts.
Abstract: Previous researches have focused on analyzing emotion through monolingual text, when in fact bilingual or code-switching posts are also common in social media. Despite the important implications of code-switching for emotion analysis, existing automatic emotion extraction methods fail to accommodate for the code-switching content. In this paper, we propose a general framework to construct and analyze the code-switching emotional posts in social media. We first propose an annotation scheme to identify the emotions associated with the languages expressing them in a Chinese-English code-switching corpus. We then make some observations and generate statistics from the corpus to analyze the linguistic phenomena of code-switching texts in social media. Finally, we propose a multiple-classifier-based automatic detection approach to detect emotion in the codeswitching corpus for evaluating the effectiveness of both Chinese and English texts.

Patent
06 May 2015
TL;DR: In this paper, the authors present a method which, in one example embodiment, can include reading text data corresponding to messages, creating semantic annotations to the text data to generate one or more annotated messages, and aggregating the annotated message and storing information associated with the aggregated annotations in a message store.
Abstract: In one aspect, the present disclosure relates to a method which, in one example embodiment, can include reading text data corresponding to messages, creating semantic annotations to the text data to generate one or more annotated messages, and aggregating the annotated messages and storing information associated with the aggregated annotated messages in a message store. The method can further include performing, based on information from the message store and associated with the one or more messages, one or more global analytics functions that include: identifying an annotation error in the semantic annotations created using the trained statistical language model, updating the respective semantic annotation to correct the annotation error, and back-propagating corrected data corresponding to the updated semantic annotation into training data for further language model training.

Journal ArticleDOI
TL;DR: A software application for uncovering functional themes in sets of genes based on their annotations to bio-ontologies, such as the gene ontology and the mammalian phenotype ontology, which is tightly integrated with functional and biological details about mouse genes in the Mouse Genome Informatics database.
Abstract: Experiments that employ genome scale technology platforms frequently result in lists of tens to thousands of genes with potential significance to a specific biological process or disease. Searching for biologically relevant connections among the genes or gene products in these lists is a common data analysis task. We have implemented a software application for uncovering functional themes in sets of genes based on their annotations to bio-ontologies, such as the gene ontology and the mammalian phenotype ontology. The application, called VisuaL Annotation Display (VLAD), performs a statistical analysis to test for the enrichment of ontology terms in a set of genes submitted by a researcher. The results for each analysis using VLAD includes a table of ontology terms, sorted in decreasing order of significance. Each row contains the term, statistics such as the number of annotated terms, the p value, etc., and the symbols of annotated genes. An accompanying graphical display shows portions of the ontology hierarchy, where node sizes are scaled based on p values. Although numerous ontology term enrichment programs already exist, VLAD is unique in that it allows users to upload their own annotation files and ontologies for customized term enrichment analyses, supports the analysis of multiple gene sets at once, provides interfaces to customize graphical output, and is tightly integrated with functional and biological details about mouse genes in the Mouse Genome Informatics (MGI) database. VLAD is available as a web-based application from the MGI web site (http://proto.informatics.jax.org/prototypes/vlad/).

Journal ArticleDOI
02 Jun 2015-Life
TL;DR: A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei).
Abstract: Genome annotation errors are a persistent problem that impede research in the biosciences. A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei, Natronomonas pharaonis and Nmn. moolapensis, Haloquadratum walsbyi strains HBSQ001 and C23, Natrialba magadii, Haloarcula marismortui and Har. hispanica, and Halohasta litchfieldiae). Genomes are checked for missing genes, start codon misassignments, and disrupted genes. Assignments of a specific function are preferably based on experimentally characterized homologs (Gold Standard Proteins). To avoid overannotation, which is a major source of database errors, we restrict annotation to only general function assignments when support for a specific substrate assignment is insufficient. This strategy results in annotations that are resistant to the plethora of errors that compromise public databases. Annotation consistency is rigorously validated for ortholog pairs from the genomes surveyed. The annotation is regularly crosschecked against the UniProt database to further improve annotations and increase the level of standardization. Enhanced genome annotations are submitted to public databases (EMBL/GenBank, UniProt), to the benefit of the scientific community. The enhanced annotations are also publically available via HaloLex.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: An unsupervised feature-independent quantification of the context of the image through tensor decomposition is presented, which incorporates the estimated context as prior knowledge in the process of automatic image annotation.
Abstract: Automatic image annotation is a highly valuable tool for image search, retrieval and archival systems. In the absence of an annotation tool, such systems have to rely on either users' input or large amount of text on the webpage of the image, to acquire its textual description. Users may provide insufficient/noisy tags and all the text on the webpage may not be a description or an explanation of the accompanying image. Therefore, it is of extreme importance to develop efficient tools for automatic annotation of images with correct and sufficient tags. The context of the image plays a significant role in this process, along with the content of the image. A suitable quantification of the context of the image may reduce the semantic gap between visual features and appropriate textual description of the image. In this paper, we present an unsupervised feature-independent quantification of the context of the image through tensor decomposition. We incorporate the estimated context as prior knowledge in the process of automatic image annotation. Evaluation of the predicted annotations provides evidence of the effectiveness of our feature-independent context estimation method.