scispace - formally typeset
Search or ask a question

Showing papers on "Annotation published in 2019"


Journal ArticleDOI
TL;DR: The results of two case studies show that CPGAVAS2 annotates better than several other servers, and will likely become an indispensible tool for plastome research.
Abstract: We previously developed a web server CPGAVAS for annotation, visualization and GenBank submission of plastome sequences. Here, we upgrade the server into CPGAVAS2 to address the following challenges: (i) inaccurate annotation in the reference sequence likely causing the propagation of errors; (ii) difficulty in the annotation of small exons of genes petB, petD and rps16 and trans-splicing gene rps12; (iii) lack of annotation for other genome features and their visualization, such as repeat elements; and (iv) lack of modules for diversity analysis of plastomes. In particular, CPGAVAS2 provides two reference datasets for plastome annotation. The first dataset contains 43 plastomes whose annotation have been validated or corrected by RNA-seq data. The second one contains 2544 plastomes curated with sequence alignment. Two new algorithms are also implemented to correctly annotate small exons and trans-splicing genes. Tandem and dispersed repeats are identified, whose results are displayed on a circular map together with the annotated genes. DNA-seq and RNA-seq data can be uploaded for identification of single-nucleotide polymorphism sites and RNA-editing sites. The results of two case studies show that CPGAVAS2 annotates better than several other servers. CPGAVAS2 will likely become an indispensible tool for plastome research and can be accessed from http://www.herbalgenomics.org/cpgavas2.

467 citations


Journal ArticleDOI
TL;DR: A redesigned and significantly enhanced MapMan4 framework is presented, together with a revised version of the associated online Mercator annotation tool, providing protein annotations for all embryophytes with a comparably high quality.

276 citations


Journal ArticleDOI
TL;DR: While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that has been used for the past two decades.
Abstract: While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that we have used for the past two decades. The sheer number of genomes necessitates the use of fully automated procedures for annotation, but errors in annotation are just as prevalent as they were in the past, if not more so. How are we to solve this growing problem?

196 citations


Journal ArticleDOI
TL;DR: COCO-CN as mentioned in this paper is a dataset enriched with manually written Chinese sentences and tags, which provides a unified and challenging platform for cross-lingual image tagging, captioning, and retrieval.
Abstract: This paper contributes to cross-lingual image annotation and retrieval in terms of data and baseline methods. We propose COCO-CN , a novel dataset enriching MS-COCO with manually written Chinese sentences and tags. For effective annotation acquisition, we develop a recommendation-assisted collective annotation system, automatically providing an annotator with several tags and sentences deemed to be relevant with respect to the pictorial content. Having 20 342 images annotated with 27 218 Chinese sentences and 70 993 tags, COCO-CN is currently the largest Chinese–English dataset that provides a unified and challenging platform for cross-lingual image tagging, captioning, and retrieval. We develop conceptually simple yet effective methods per task for learning from cross-lingual resources. Extensive experiments on the three tasks justify the viability of the proposed dataset and methods. Data and code are publicly available at https://github.com/li-xirong/coco-cn .

121 citations


Journal ArticleDOI
TL;DR: The FAANG (Functional Annotation of Animal Genomes) Consortium is producing genome-wide data sets on RNA expression, DNA methylation, and chromatin modification, as well as chromatin accessibility and interactions that will improve the precision and sensitivity of genomic selection for animal improvement.
Abstract: Functional annotation of genomes is a prerequisite for contemporary basic and applied genomic research, yet farmed animal genomics is deficient in such annotation. To address this, the FAANG (Functional Annotation of Animal Genomes) Consortium is producing genome-wide data sets on RNA expression, DNA methylation, and chromatin modification, as well as chromatin accessibility and interactions. In addition to informing our understanding of genome function, including comparative approaches to elucidate constrained sequence or epigenetic elements, these annotation maps will improve the precision and sensitivity of genomic selection for animal improvement. A scientific community-driven effort has already created a coordinated data collection and analysis enterprise crucial for the success of this global effort. Although it is early in this continuing process, functional data have already been produced and application to genetic improvement reported. The functional annotation delivered by the FAANG initiative will add value and utility to the greatly improved genome sequences being established for domesticated animal species.

111 citations


Journal ArticleDOI
TL;DR: MetaErg performs a comprehensive annotation of each gene, including taxonomic classification, enabling functional inferences despite low similarity to reference genes, as well as detection of potential assembly or binning artifacts.
Abstract: Here, we describe MetaErg, a standalone and fully automated metagenome and metaproteome annotation pipeline. Annotation of metagenomes is challenging. First, metagenomes contain sequence data of many organisms from all domains of life. Second, many of these are from understudied lineages, encoding genes with low similarity to experimentally validated reference genes. Third, assembly and binning are not perfect, sometimes resulting in artifactual hybrid contigs or genomes. To address these challenges, MetaErg provides graphical summaries of annotation outcomes, both for the complete metagenome and for individual metagenome-assembled genomes (MAGs). It performs a comprehensive annotation of each gene, including taxonomic classification, enabling functional inferences despite low similarity to reference genes, as well as detection of potential assembly or binning artifacts. When provided with metaproteome information, it visualizes gene and pathway activity using sequencing coverage and proteomic spectral counts, respectively. For visualization, MetaErg provides an HTML interface, bringing all annotation results together, and producing sortable and searchable tables, collapsible trees, and other graphic representations enabling intuitive navigation of complex data. MetaErg, implemented in Perl, HTML, and JavaScript, is a fully open source application, distributed under Academic Free License at https://github.com/xiaoli-dong/metaerg. MetaErg is also available as a docker image at https://hub.docker.com/r/xiaolidong/docker-metaerg.

85 citations


Posted ContentDOI
25 Sep 2019-bioRxiv
TL;DR: This work provides a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files.
Abstract: Correct annotation metadata is critical for reproducible and accurate RNA-seq analysis. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. We provide a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The tximeta package is available at https://bioconductor.org/packages/tximeta.

81 citations


Posted ContentDOI
29 Jan 2019-bioRxiv
TL;DR: It is demonstrated that scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as a case study.
Abstract: As single-cell transcriptomics becomes a mainstream technology, the natural next step is to integrate the accumulating data in order to achieve a common ontology of cell types and states. However, owing to various nuisance factors of variation, it is not straightforward how to compare gene expression levels across data sets and how to automatically assign cell type labels in a new data set based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of cohorts of single-cell RNA-seq data sets, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage any available cell state annotations — for instance when only one data set in a cohort is annotated, or when only a few cells in a single data set can be labeled using marker genes. We demonstrate that scVI and scANVI compare favorably to the existing methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings such as a hierarchical structure of cell state labels. We further show that different from existing methods, scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as our case study. scVI and scANVI are available as open source software and can be readily used to facilitate cell state annotation and help ensure consistency and reproducibility across studies.

76 citations


Proceedings ArticleDOI
13 May 2019
TL;DR: This work proposes an effective framework based on reinforcement learning that explicitly encourages the code annotation model to generate annotations that can be used for the retrieval task, and shows that code annotations generated by this framework are much more detailed and more useful for code retrieval, and they can further improve the performance of existing code retrieval models significantly.
Abstract: To accelerate software development, much research has been performed to help people understand and reuse the huge amount of available code resources. Two important tasks have been widely studied: code retrieval, which aims to retrieve code snippets relevant to a given natural language query from a code base, and code annotation, where the goal is to annotate a code snippet with a natural language description. Despite their advancement in recent years, the two tasks are mostly explored separately. In this work, we investigate a novel perspective of Code annotation for Code retrieval (hence called “CoaCor”), where a code annotation model is trained to generate a natural language annotation that can represent the semantic meaning of a given code snippet and can be leveraged by a code retrieval model to better distinguish relevant code snippets from others. To this end, we propose an effective framework based on reinforcement learning, which explicitly encourages the code annotation model to generate annotations that can be used for the retrieval task. Through extensive experiments, we show that code annotations generated by our framework are much more detailed and more useful for code retrieval, and they can further improve the performance of existing code retrieval models significantly.1

76 citations


Journal ArticleDOI
TL;DR: In this article, a method, OPA2Vec, is proposed to generate vector representations of biological entities in ontologies by combining formal ontology axioms and annotation axiom from the ontology meta-data.
Abstract: Motivation Ontologies are widely used in biology for data annotation, integration and analysis. In addition to formally structured axioms, ontologies contain meta-data in the form of annotation axioms which provide valuable pieces of information that characterize ontology classes. Annotation axioms commonly used in ontologies include class labels, descriptions or synonyms. Despite being a rich source of semantic information, the ontology meta-data are generally unexploited by ontology-based analysis methods such as semantic similarity measures. Results We propose a novel method, OPA2Vec, to generate vector representations of biological entities in ontologies by combining formal ontology axioms and annotation axioms from the ontology meta-data. We apply a Word2Vec model that has been pre-trained on either a corpus or abstracts or full-text articles to produce feature vectors from our collected data. We validate our method in two different ways: first, we use the obtained vector representations of proteins in a similarity measure to predict protein-protein interaction on two different datasets. Second, we evaluate our method on predicting gene-disease associations based on phenotype similarity by generating vector representations of genes and diseases using a phenotype ontology, and applying the obtained vectors to predict gene-disease associations using mouse model phenotypes. We demonstrate that OPA2Vec significantly outperforms existing methods for predicting gene-disease associations. Using evidence from mouse models, we apply OPA2Vec to identify candidate genes for several thousand rare and orphan diseases. OPA2Vec can be used to produce vector representations of any biomedical entity given any type of biomedical ontology. Availability and implementation https://github.com/bio-ontology-research-group/opa2vec. Supplementary information Supplementary data are available at Bioinformatics online.

72 citations


Journal ArticleDOI
TL;DR: RIL-Contour and the AID methodology accelerate dataset annotation and model development by facilitating rapid collaboration between analysts, radiologists, and engineers.
Abstract: Deep-learning algorithms typically fall within the domain of supervised artificial intelligence and are designed to “learn” from annotated data. Deep-learning models require large, diverse training datasets for optimal model convergence. The effort to curate these datasets is widely regarded as a barrier to the development of deep-learning systems. We developed RIL-Contour to accelerate medical image annotation for and with deep-learning. A major goal driving the development of the software was to create an environment which enables clinically oriented users to utilize deep-learning models to rapidly annotate medical imaging. RIL-Contour supports using fully automated deep-learning methods, semi-automated methods, and manual methods to annotate medical imaging with voxel and/or text annotations. To reduce annotation error, RIL-Contour promotes the standardization of image annotations across a dataset. RIL-Contour accelerates medical imaging annotation through the process of annotation by iterative deep learning (AID). The underlying concept of AID is to iteratively annotate, train, and utilize deep-learning models during the process of dataset annotation and model development. To enable this, RIL-Contour supports workflows in which multiple-image analysts annotate medical images, radiologists approve the annotations, and data scientists utilize these annotations to train deep-learning models. To automate the feedback loop between data scientists and image analysts, RIL-Contour provides mechanisms to enable data scientists to push deep newly trained deep-learning models to other users of the software. RIL-Contour and the AID methodology accelerate dataset annotation and model development by facilitating rapid collaboration between analysts, radiologists, and engineers.

Journal ArticleDOI
TL;DR: An end-to-end automatic image annotation method based on a deep convolutional neural network (CNN) and multi-label data augmentation and Wasserstein generative adversarial networks (ML-WGAN) is presented.
Abstract: Automatic image annotation is a key step in image retrieval and image understanding. In this paper, we present an end-to-end automatic image annotation method based on a deep convolutional neural network (CNN) and multi-label data augmentation. Different from traditional annotation models that usually perform feature extraction and annotation as two independent tasks, we propose an end-to-end automatic image annotation model based on deep CNN (E2E-DCNN). E2E-DCNN transforms the image annotation problem into a multi-label learning problem. It uses a deep CNN structure to carry out the adaptive feature learning before constructing the end-to-end annotation structure using multiple cross-entropy loss functions for training. It is difficult to train a deep CNN model using small-scale datasets or scale up multi-label datasets using traditional data augmentation methods; hence, we propose a multi-label data augmentation method based on Wasserstein generative adversarial networks (ML-WGAN). The ML-WGAN generator can approximate the data distribution of a single multi-label image. The images generated by ML-WGAN can assist in the reduction of the over-fitting problem of training a deep CNN model and enhance the generalization ability of the trained CNN model. We optimize the network structure by using deformable convolution and spatial pyramid pooling. We experiment the proposed E2E-DCNN model with data augmentation by the proposed ML-WGAN on several public datasets. The experimental results demonstrate that the proposed model outperforms the state-of-the-art automatic image annotation models.

Journal ArticleDOI
TL;DR: It is shown that integrating the taxonomic position of the biological source of the analyzed samples and candidate structures enhances confidence in metabolite annotation, and the importance of considering taxonomic information in the process of specialized metabolites annotation is demonstrated.
Abstract: Mass spectrometry (MS) offers unrivalled sensitivity for the metabolite profiling of complex biological matrices encountered in natural products (NP) research. The massive and complex sets of spectral data generated by such platforms require computational approaches for their interpretation. Within such approaches, computational metabolite annotation automatically links spectral data to candidate structures via a score, which is usually established between the acquired data and experimental or theoretical spectral databases (DB). This process leads to various candidate structures for each MS features. However, at this stage, obtaining high annotation confidence level remains a challenge notably due to the extensive chemodiversity of specialized metabolomes. The design of a metascore is a way to capture complementary experimental attributes and improve the annotation process. Here, we show that integrating the taxonomic position of the biological source of the analyzed samples and candidate structures enhances confidence in metabolite annotation. A script is proposed to automatically input such information at various granularity levels (species, genus, and family) and complement the score obtained between experimental spectral data and output of available computational metabolite annotation tools (ISDB-DNP, MS-Finder, Sirius). In all cases, the consideration of the taxonomic distance allowed an efficient re-ranking of the candidate structures leading to a systematic enhancement of the recall and precision rates of the tools (1.5- to 7-fold increase in the F1 score). Our results clearly demonstrate the importance of considering taxonomic information in the process of specialized metabolites annotation. This requires to access structural data systematically documented with biological origin, both for new and previously reported NPs. In this respect, the establishment of an open structural DB of specialized metabolites and their associated metadata, particularly biological sources, is timely and critical for the NP research community.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: The first version of the Open Source International Arabic News (OSIAN) corpus is presented, which consists of about 3.5 million articles comprising more than 37 million sentences and roughly 1 billion tokens and is encoded in XML.
Abstract: The World Wide Web has become a fundamental resource for building large text corpora. Broadcasting platforms such as news websites are rich sources of data regarding diverse topics and form a valuable foundation for research. The Arabic language is extensively utilized on the Web. Still, Arabic is relatively an under-resourced language in terms of availability of freely annotated corpora. This paper presents the first version of the Open Source International Arabic News (OSIAN) corpus. The corpus data was collected from international Arabic news websites, all being freely available on the Web. The corpus consists of about 3.5 million articles comprising more than 37 million sentences and roughly 1 billion tokens. It is encoded in XML; each article is annotated with metadata information. Moreover, each word is annotated with lemma and part-of-speech. the described corpus is processed, archived and published into the CLARIN infrastructure. This publication includes descriptive metadata via OAI-PMH, direct access to the plain text material (available under Creative Commons Attribution-Non-Commercial 4.0 International License - CC BY-NC 4.0), and integration into the WebLicht annotation platform and CLARIN’s Federated Content Search FCS.

Journal ArticleDOI
TL;DR: Examining the literature, a uniform summary of the degree of automation of the many semantic annotation tools that were previously investigated can now be presented and the inconsistency in the terminology used in the literature is addressed and provides more consistent terminology.
Abstract: Semantic annotation is a crucial part of achieving the vision of the Semantic Web and has long been a research topic among various communities. The most challenging problem in reaching the Semantic Web’s real potential is the gap between a large amount of unlabeled existing/new data and the limited annotation capability available. To resolve this problem, numerous works have been carried out to increase the degree of automation of semantic annotation from manual to semi-automatic to fully automatic. The richness of these works has been well-investigated by numerous surveys focusing on different aspects of the problem. However, a comprehensive survey targeting unsupervised approaches for semantic annotation is still missing and is urgently needed. To better understand the state-of-the-art of semantic annotation in the textual domain adopting unsupervised approaches, this article investigates existing literature and presents a survey to answer three research questions: (1) To what extent can semantic annotation be performed in a fully automatic manner by using an unsupervised way? (2) What kind of unsupervised approaches for semantic annotation already exist in literature? (3) What characteristics and relationships do these approaches have? In contrast to existing surveys, this article helps the reader get an insight into the state-of-art of semantic annotation using unsupervised approaches. While examining the literature, this article also addresses the inconsistency in the terminology used in the literature to describe the various semantic annotation tools’ degree of automation and provides more consistent terminology. Based on this, a uniform summary of the degree of automation of the many semantic annotation tools that were previously investigated can now be presented.

Book ChapterDOI
TL;DR: The Genome Sequence Annotation Server (GenSAS, https://www.gensas.org ) is a secure, web-based genome annotation platform for structural and functional annotation, as well as manual curation.
Abstract: The Genome Sequence Annotation Server (GenSAS, https://www.gensas.org ) is a secure, web-based genome annotation platform for structural and functional annotation, as well as manual curation. Requiring no installation by users, GenSAS integrates popular command line-based, annotation tools under a single, easy-to-use, online interface. GenSAS integrates JBrowse and Apollo, so users can view annotation data and manually curate gene models. Users are guided step by step through the annotation process by embedded instructions and a more in-depth GenSAS User's Guide. In addition to a genome assembly file, users can also upload organism-specific transcript, protein, and RNA-seq read evidence for use in the annotation process. The latest versions of the NCBI RefSeq transcript and protein databases and the SwissProt and TrEMBL protein databases are provided for all users. GenSAS projects can be shared with other GenSAS users enabling collaborative annotation. Once annotation is complete, GenSAS generates the final files of the annotated gene models in common file formats for use with other annotation tools, submission to a repository, and use in publications.

Proceedings ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a framework based on reinforcement learning to encourage code annotation models to generate annotations that can be used for code retrieval task, which can further improve the performance of existing code retrieval models significantly.
Abstract: To accelerate software development, much research has been performed to help people understand and reuse the huge amount of available code resources. Two important tasks have been widely studied: code retrieval, which aims to retrieve code snippets relevant to a given natural language query from a code base, and code annotation, where the goal is to annotate a code snippet with a natural language description. Despite their advancement in recent years, the two tasks are mostly explored separately. In this work, we investigate a novel perspective of Code annotation for Code retrieval (hence called `CoaCor'), where a code annotation model is trained to generate a natural language annotation that can represent the semantic meaning of a given code snippet and can be leveraged by a code retrieval model to better distinguish relevant code snippets from others. To this end, we propose an effective framework based on reinforcement learning, which explicitly encourages the code annotation model to generate annotations that can be used for the retrieval task. Through extensive experiments, we show that code annotations generated by our framework are much more detailed and more useful for code retrieval, and they can further improve the performance of existing code retrieval models significantly.

Journal ArticleDOI
TL;DR: The Continuously Annotated Signals of Emotion (CASE) dataset as discussed by the authors provides a solution as it focusses on real-time continuous annotation of emotions, as experienced by the participants, while watching various videos.
Abstract: From a computational viewpoint, emotions continue to be intriguingly hard to understand. In research, a direct and real-time inspection in realistic settings is not possible. Discrete, indirect, post-hoc recordings are therefore the norm. As a result, proper emotion assessment remains a problematic issue. The Continuously Annotated Signals of Emotion (CASE) dataset provides a solution as it focusses on real-time continuous annotation of emotions, as experienced by the participants, while watching various videos. For this purpose, a novel, intuitive joystick-based annotation interface was developed, that allowed for simultaneous reporting of valence and arousal, that are instead often annotated independently. In parallel, eight high quality, synchronized physiological recordings (1000 Hz, 16-bit ADC) were obtained from ECG, BVP, EMG (3x), GSR (or EDA), respiration and skin temperature sensors. The dataset consists of the physiological and annotation data from 30 participants, 15 male and 15 female, who watched several validated video-stimuli. The validity of the emotion induction, as exemplified by the annotation and physiological data, is also presented. Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.9891446

Journal ArticleDOI
TL;DR: The landscape of current annotation practices among the COmputational Modeling in BIology NEtwork community is reported and a set of recommendations for building a consensus approach to semantic annotation are provided.
Abstract: Life science researchers use computational models to articulate and test hypotheses about the behavior of biological systems. Semantic annotation is a critical component for enhancing the interoperability and reusability of such models as well as for the integration of the data needed for model parameterization and validation. Encoded as machine-readable links to knowledge resource terms, semantic annotations describe the computational or biological meaning of what models and data represent. These annotations help researchers find and repurpose models, accelerate model composition and enable knowledge integration across model repositories and experimental data stores. However, realizing the potential benefits of semantic annotation requires the development of model annotation standards that adhere to a community-based annotation protocol. Without such standards, tool developers must account for a variety of annotation formats and approaches, a situation that can become prohibitively cumbersome and which can defeat the purpose of linking model elements to controlled knowledge resource terms. Currently, no consensus protocol for semantic annotation exists among the larger biological modeling community. Here, we report on the landscape of current annotation practices among the COmputational Modeling in BIology NEtwork community and provide a set of recommendations for building a consensus approach to semantic annotation.

Journal ArticleDOI
17 Jul 2019
TL;DR: The proposed representative annotation (RA) scheme uses unsupervised networks for feature extraction and selects representative image patches for annotation in the latent space of learned feature descriptors, which implicitly characterizes the underlying data while minimizing redundancy.
Abstract: Deep learning has been applied successfully to many biomedical image segmentation tasks. However, due to the diversity and complexity of biomedical image data, manual annotation for training common deep learning models is very timeconsuming and labor-intensive, especially because normally only biomedical experts can annotate image data well. Human experts are often involved in a long and iterative process of annotation, as in active learning type annotation schemes. In this paper, we propose representative annotation (RA), a new deep learning framework for reducing annotation effort in biomedical image segmentation. RA uses unsupervised networks for feature extraction and selects representative image patches for annotation in the latent space of learned feature descriptors, which implicitly characterizes the underlying data while minimizing redundancy. A fully convolutional network (FCN) is then trained using the annotated selected image patches for image segmentation. Our RA scheme offers three compelling advantages: (1) It leverages the ability of deep neural networks to learn better representations of image data; (2) it performs one-shot selection for manual annotation and frees annotators from the iterative process of common active learning based annotation schemes; (3) it can be deployed to 3D images with simple extensions. We evaluate our RA approach using three datasets (two 2D and one 3D) and show our framework yields competitive segmentation results comparing with state-of-the-art methods.

Journal ArticleDOI
TL;DR: It is shown that LoReAn outperforms popular annotation pipelines by integrating single-molecule cDNA-sequencing data generated from either the Pacific Biosciences or MinION sequencing platforms, correctly predicting gene structure, and capturing genes missed by other annotation pipelines.
Abstract: Single-molecule full-length complementary DNA (cDNA) sequencing can aid genome annotation by revealing transcript structure and alternative splice forms, yet current annotation pipelines do not incorporate such information. Here we present long-read annotation (LoReAn) software, an automated annotation pipeline utilizing short- and long-read cDNA sequencing, protein evidence, and ab initio prediction to generate accurate genome annotations. Based on annotations of two fungal genomes (Verticillium dahliae and Plicaturopsis crispa) and two plant genomes (Arabidopsis [Arabidopsis thaliana] and Oryza sativa), we show that LoReAn outperforms popular annotation pipelines by integrating single-molecule cDNA-sequencing data generated from either the Pacific Biosciences or MinION sequencing platforms, correctly predicting gene structure, and capturing genes missed by other annotation pipelines.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: A robust English corpus and annotation schema is presented that allows us to explore the less straightforward examples of term-definition structures in free and semi-structured text.
Abstract: Definition extraction has been a popular topic in NLP research for well more than a decade, but has been historically limited to well-defined, structured, and narrow conditions. In reality, natural language is messy, and messy data requires both complex solutions and data that reflects that reality. In this paper, we present a robust English corpus and annotation schema that allows us to explore the less straightforward examples of term-definition structures in free and semi-structured text.

01 Jan 2019
TL;DR: This paper presents the results of the HAHA task at IberLEF 2019, the second edition of the challenge on automatic humor recognition and analysis in Spanish, and presents a summary of their systems and the general results for both subtasks.
Abstract: This paper presents the results of the HAHA task at IberLEF 2019, the second edition of the challenge on automatic humor recognition and analysis in Spanish. The challenge consists of two subtasks related to humor in language: automatic detection and automatic rating of humor in Spanish tweets. This year we used a corpus of 30,000 annotated Spanish tweets labeled as humorous or non-humorous and the humorous ones contain a funniness score. A total of 18 participants submitted their systems obtaining good results overall, we present a summary of their systems and the general results for both subtasks.

Proceedings ArticleDOI
02 May 2019
TL;DR: This paper describes the data collection on the Zooniverse citizen science platform, comparing the efficiencies of different audio annotation strategies, and compared multiple-pass binary annotation, single-pass multi-label annotation, and a hybrid approach: hierarchical multi- pass multi- label annotation.
Abstract: Annotating rich audio data is an essential aspect of training and evaluating machine listening systems. We approach this task in the context of temporally-complex urban soundscapes, which require multiple labels to identify overlapping sound sources. Typically this work is crowdsourced, and previous studies have shown that workers can quickly label audio with binary annotation for single classes. However, this approach can be difficult to scale when multiple passes with different focus classes are required to annotate data with multiple labels. In citizen science, where tasks are often image-based, annotation efforts typically label multiple classes simultaneously in a single pass. This paper describes our data collection on the Zooniverse citizen science platform, comparing the efficiencies of different audio annotation strategies. We compared multiple-pass binary annotation, single-pass multi-label annotation, and a hybrid approach: hierarchical multi-pass multi-label annotation. We discuss our findings, which support using multi-label annotation, with reference to volunteer citizen scientists' motivations.

Journal ArticleDOI
01 Jan 2019-Database
TL;DR: GenoSurf is implemented, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies.
Abstract: Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata.

Journal ArticleDOI
TL;DR: All annotations are combined into a single, cell type-agnostic encyclopedia that catalogs all human regulatory elements and developed a measure of the importance of each genomic position called the “conservation-associated activity score.”
Abstract: Semi-automated genome annotation methods such as Segway take as input a set of genome-wide measurements such as of histone modification or DNA accessibility and output an annotation of genomic activity in the target cell type. Here we present annotations of 164 human cell types using 1615 data sets. To produce these annotations, we automated the label interpretation step to produce a fully automated annotation strategy. Using these annotations, we developed a measure of the importance of each genomic position called the “conservation-associated activity score.” We further combined all annotations into a single, cell type-agnostic encyclopedia that catalogs all human regulatory elements.

Journal ArticleDOI
TL;DR: In this paper, the authors outline a workflow for efficient manual annotation driven by a team of primarily undergraduate annotators, which can be scaled to large teams and includes quality control processes through incremental evaluation.
Abstract: High quality gene models are necessary to expand the molecular and genetic tools available for a target organism, but these are available for only a handful of model organisms that have undergone extensive curation and experimental validation over the course of many years. The majority of gene models present in biological databases today have been identified in draft genome assemblies using automated annotation pipelines that are frequently based on orthologs from distantly related model organisms and usually have minor or major errors. Manual curation is time consuming and often requires substantial expertise, but is instrumental in improving gene model structure and identification. Manual annotation may seem to be a daunting and cost-prohibitive task for small research communities but involving undergraduates in community genome annotation consortiums can be mutually beneficial for both education and improved genomic resources. We outline a workflow for efficient manual annotation driven by a team of primarily undergraduate annotators. This model can be scaled to large teams and includes quality control processes through incremental evaluation. Moreover, it gives students an opportunity to increase their understanding of genome biology and to participate in scientific research in collaboration with peers and senior researchers at multiple institutions.

Journal ArticleDOI
25 Jan 2019-Sensors
TL;DR: A Semi-Supervised Active Learning (SSAL) based on Self-Training (ST) approach for Human Activity Recognition is introduced to partially automate the annotation process, reducing the annotation effort and the required volume of annotated data to obtain a high performance classifier.
Abstract: Modern smartphones and wearables often contain multiple embedded sensors which generate significant amounts of data. This information can be used for body monitoring-based areas such as healthcare, indoor location, user-adaptive recommendations and transportation. The development of Human Activity Recognition (HAR) algorithms involves the collection of a large amount of labelled data which should be annotated by an expert. However, the data annotation process on large datasets is expensive, time consuming and difficult to obtain. The development of a HAR approach which requires low annotation effort and still maintains adequate performance is a relevant challenge. We introduce a Semi-Supervised Active Learning (SSAL) based on Self-Training (ST) approach for Human Activity Recognition to partially automate the annotation process, reducing the annotation effort and the required volume of annotated data to obtain a high performance classifier. Our approach uses a criterion to select the most relevant samples for annotation by the expert and propagate their label to the most confident samples. We present a comprehensive study comparing supervised and unsupervised methods with our approach on two datasets composed of daily living activities. The results showed that it is possible to reduce the required annotated data by more than 89% while still maintaining an accurate model performance.

Journal ArticleDOI
TL;DR: GTCreator is introduced, a flexible annotation tool for providing image and text annotations to image-based datasets that is proven to be efficient for large image dataset annotation, as well as showing potential of use in other stages of method evaluation such as experimental setup or results analysis.
Abstract: Methodology evaluation for decision support systems for health is a time-consuming task. To assess performance of polyp detection methods in colonoscopy videos, clinicians have to deal with the annotation of thousands of images. Current existing tools could be improved in terms of flexibility and ease of use. We introduce GTCreator, a flexible annotation tool for providing image and text annotations to image-based datasets. It keeps the main basic functionalities of other similar tools while extending other capabilities such as allowing multiple annotators to work simultaneously on the same task or enhanced dataset browsing and easy annotation transfer aiming to speed up annotation processes in large datasets. The comparison with other similar tools shows that GTCreator allows to obtain fast and precise annotation of image datasets, being the only one which offers full annotation editing and browsing capabilites. Our proposed annotation tool has been proven to be efficient for large image dataset annotation, as well as showing potential of use in other stages of method evaluation such as experimental setup or results analysis.

Posted Content
TL;DR: Comparisons of the method with other publicly available annotation tools reveal that 3D annotations can be obtained faster and more efficiently by using the 3D BAT toolbox.
Abstract: In this paper, we focus on obtaining 2D and 3D labels, as well as track IDs for objects on the road with the help of a novel 3D Bounding Box Annotation Toolbox (3D BAT). Our open source, web-based 3D BAT incorporates several smart features to improve usability and efficiency. For instance, this annotation toolbox supports semi-automatic labeling of tracks using interpolation, which is vital for downstream tasks like tracking, motion planning and motion prediction. Moreover, annotations for all camera images are automatically obtained by projecting annotations from 3D space into the image domain. In addition to the raw image and point cloud feeds, a Masterview consisting of the top view (bird's-eye-view), side view and front views is made available to observe objects of interest from different perspectives. Comparisons of our method with other publicly available annotation tools reveal that 3D annotations can be obtained faster and more efficiently by using our toolbox.