scispace - formally typeset
Search or ask a question

Showing papers on "Annotation published in 2020"


Journal ArticleDOI
TL;DR: The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics as mentioned in this paper. But the annotation process does not support the creation of transcript structures and the determination of their function.
Abstract: The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.

371 citations


Posted ContentDOI
11 Aug 2020-bioRxiv
TL;DR: In comparison with BRAKER1 supported by a large volume of transcript data, BRAKER2 could produce a better gene prediction accuracy if the evolutionary distances to the reference species in the protein database were rather small.
Abstract: Full automation of gene prediction has become an important bioinformatics task since the advent of next generation sequencing. The eukaryotic genome annotation pipeline BRAKER1 had combined self-training GeneMark-ET with AUGUSTUS to generate genes’ coordinates with support of transcriptomic data. Here, we introduce BRAKER2, a pipeline with GeneMark-EP+ and AUGUSTUS externally supported by cross-species protein sequences aligned to the genome. Among the challenges addressed in the development of the new pipeline was generation of reliable hints to the locations of protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. Under equal conditions, the gene prediction accuracy of BRAKER2 was shown to be higher than the one of MAKER2, yet another genome annotation pipeline. Also, in comparison with BRAKER1 supported by a large volume of transcript data, BRAKER2 could produce a better gene prediction accuracy if the evolutionary distances to the reference species in the protein database were rather small. All over, our tests demonstrated that fully automatic BRAKER2 is a fast and accurate method for structural annotation of novel eukaryotic genomes.

336 citations


Journal ArticleDOI
TL;DR: The data annotation bottleneck is identified as one of the key obstacles to machine learning approaches in clinical NLP, and future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which do not require data annotation.
Abstract: Background: Clinical narratives represent the main form of communication within healthcare providing a personalized account of patient history and assessments, offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. Objective: The main aim of this study is to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigate the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. Methods: Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multi-faceted interface, to perform a literature search against MEDLINE. We identified a total of 110 relevant studies and extracted information about the text data used to support machine learning, the NLP tasks supported and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation and any relevant statistics. Results: The vast majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable due to sensitive nature of data considered. Beside the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The vast majority of studies focused on the task of text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management and surveillance. Conclusions: We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which does not require data annotation.

149 citations


Journal ArticleDOI
TL;DR: This work provides a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files.
Abstract: Correct annotation metadata is critical for reproducible and accurate RNA-seq analysis. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. We provide a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The tximeta package is available at https://bioconductor.org/packages/tximeta.

112 citations


Journal ArticleDOI
TL;DR: The scope for data annotation has been substantially expanded to enhance biological interpretations of queried variants and this includes the addition of pathway analysis for the identification of enriched biological pathways and molecular processes.
Abstract: SNPnexus is a web-based annotation tool for the analysis and interpretation of both known and novel sequencing variations. Since its last release, SNPnexus has received continual updates to expand the range and depth of annotations provided. SNPnexus has undergone a complete overhaul of the underlying infrastructure to accommodate faster computational times. The scope for data annotation has been substantially expanded to enhance biological interpretations of queried variants. This includes the addition of pathway analysis for the identification of enriched biological pathways and molecular processes. We have further expanded the range of user directed annotation fields available for the study of cancer sequencing data. These new additions facilitate investigations into cancer driver variants and targetable molecular alterations within input datasets. New user directed filtering options have been coupled with the addition of interactive graphical and visualization tools. These improvements streamline the analysis of variants derived from large sequencing datasets for the identification of biologically and clinically significant subsets in the data. SNPnexus is the most comprehensible web-based application currently available and these new set of updates ensures that it remains a state-of-the-art tool for researchers. SNPnexus is freely available at https://www.snp-nexus.org.

107 citations


Journal ArticleDOI
TL;DR: A suite of phage-oriented tools housed in open, user-friendly web-based interfaces and a multi-purpose platform that enables researchers to easily and accurately annotate an entire phage genome are developed.
Abstract: In the modern genomic era, scientists without extensive bioinformatic training need to apply high-power computational analyses to critical tasks like phage genome annotation. At the Center for Phage Technology (CPT), we developed a suite of phage-oriented tools housed in open, user-friendly web-based interfaces. A Galaxy platform conducts computationally intensive analyses and Apollo, a collaborative genome annotation editor, visualizes the results of these analyses. The collection includes open source applications such as the BLAST+ suite, InterProScan, and several gene callers, as well as unique tools developed at the CPT that allow maximum user flexibility. We describe in detail programs for finding Shine-Dalgarno sequences, resources used for confident identification of lysis genes such as spanins, and methods used for identifying interrupted genes that contain frameshifts or introns. At the CPT, genome annotation is separated into two robust segments that are facilitated through the automated execution of many tools chained together in an operation called a workflow. First, the structural annotation workflow results in gene and other feature calls. This is followed by a functional annotation workflow that combines sequence comparisons and conserved domain searching, which is contextualized to allow integrated evidence assessment in functional prediction. Finally, we describe a workflow used for comparative genomics. Using this multi-purpose platform enables researchers to easily and accurately annotate an entire phage genome. The portal can be accessed at https://cpt.tamu.edu/galaxy-pub with accompanying user training material.

77 citations


Journal ArticleDOI
TL;DR: EnTAP (Eukaryotic Non‐Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non‐model eukaryotes.
Abstract: EnTAP (Eukaryotic Non-Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates of protein-coding transcripts. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the reduced set of translated proteins. Downstream features include fast similarity search across five repositories, protein domain assignment, orthologous gene family assessment, and Gene Ontology (GO) term assignment. The final annotation integrates across multiple databases and selects an optimal assignment from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness. Researchers have the option to include additional filters to identify and remove contaminants, identify associated pathways, and prepare the transcripts for enrichment analysis. This fully featured pipeline is easy to install, configure, and runs significantly faster than comparable annotation packages. EnTAP is optimized to generate extensive functional information for the gene space of organisms with limited or poorly characterized genomic resources.

76 citations



Posted Content
TL;DR: In this paper, the authors describe a dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam and release 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases.
Abstract: Most of the existing chest X-ray datasets include labels from a list of findings without specifying their locations on the radiographs. This limits the development of machine learning algorithms for the detection and localization of chest abnormalities. In this work, we describe a dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data, we release 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases. The released dataset is divided into a training set of 15,000 and a test set of 3,000. Each scan in the training set was independently labeled by 3 radiologists, while each scan in the test set was labeled by the consensus of 5 radiologists. We designed and built a labeling platform for DICOM images to facilitate these annotation procedures. All images are made publicly available in DICOM format in company with the labels of the training set. The labels of the test set are hidden at the time of writing this paper as they will be used for benchmarking machine learning algorithms on an open platform.

54 citations


Journal ArticleDOI
01 Mar 2020
TL;DR: The analysis of bacterial genomes from the Genome Taxonomy Database revealed that 52 and 79 % of the average bacterial proteome could be functionally annotated based on protein and domain-based homology searches, respectively, highlighting the disparity in annotation coverage.
Abstract: Although gene-finding in bacterial genomes is relatively straightforward, the automated assignment of gene function is still challenging, resulting in a vast quantity of hypothetical sequences of unknown function. But how prevalent are hypothetical sequences across bacteria, what proportion of genes in different bacterial genomes remain unannotated, and what factors affect annotation completeness? To address these questions, we surveyed over 27 000 bacterial genomes from the Genome Taxonomy Database, and measured genome annotation completeness as a function of annotation method, taxonomy, genome size, 'research bias' and publication date. Our analysis revealed that 52 and 79 % of the average bacterial proteome could be functionally annotated based on protein and domain-based homology searches, respectively. Annotation coverage using protein homology search varied significantly from as low as 14 % in some species to as high as 98 % in others. We found that taxonomy is a major factor influencing annotation completeness, with distinct trends observed across the microbial tree (e.g. the lowest level of completeness was found in the Patescibacteria lineage). Most lineages showed a significant association between genome size and annotation incompleteness, likely reflecting a greater degree of uncharacterized sequences in 'accessory' proteomes than in 'core' proteomes. Finally, research bias, as measured by publication volume, was also an important factor influencing genome annotation completeness, with early model organisms showing high completeness levels relative to other genomes in their own taxonomic lineages. Our work highlights the disparity in annotation coverage across the bacterial tree of life and emphasizes a need for more experimental characterization of accessory proteomes as well as understudied lineages.

46 citations


Posted Content
TL;DR: The conormal fan of a matroid M is a Lagrangian analog of the Bergman fan of M as mentioned in this paper, and it is shown that the conormal fans satisfy Poincare duality, hard Lefschetz theorem, and the Hodge-Riemann relations.
Abstract: We introduce the conormal fan of a matroid M, which is a Lagrangian analog of the Bergman fan of M. We use the conormal fan to give a Lagrangian interpretation of the Chern-Schwartz-MacPherson cycle of M. This allows us to express the h-vector of the broken circuit complex of M in terms of the intersection theory of the conormal fan of M. We also develop general tools for tropical Hodge theory to prove that the conormal fan satisfies Poincare duality, the hard Lefschetz theorem, and the Hodge-Riemann relations. The Lagrangian interpretation of the Chern-Schwartz-MacPherson cycle of M, when combined with the Hodge-Riemann relations for the conormal fan of M, implies Brylawski's and Dawson's conjectures that the h-vectors of the broken circuit complex and the independence complex of M are log-concave sequences.

Journal ArticleDOI
TL;DR: TeamTat is a novel tool for managing multi-user, multi-label document annotation, reflecting the entire production life cycle, and provides corpus quality assessment via inter-annotator agreement statistics, and a user-friendly interface convenient for annotation review and inter- annotator disagreement resolution to improve corpus quality.
Abstract: Manually annotated data is key to developing text-mining and information-extraction algorithms. However, human annotation requires considerable time, effort and expertise. Given the rapid growth of biomedical literature, it is paramount to build tools that facilitate speed and maintain expert quality. While existing text annotation tools may provide user-friendly interfaces to domain experts, limited support is available for figure display, project management, and multi-user team annotation. In response, we developed TeamTat (https://www.teamtat.org), a web-based annotation tool (local setup available), equipped to manage team annotation projects engagingly and efficiently. TeamTat is a novel tool for managing multi-user, multi-label document annotation, reflecting the entire production life cycle. Project managers can specify annotation schema for entities and relations and select annotator(s) and distribute documents anonymously to prevent bias. Document input format can be plain text, PDF or BioC (uploaded locally or automatically retrieved from PubMed/PMC), and output format is BioC with inline annotations. TeamTat displays figures from the full text for the annotator's convenience. Multiple users can work on the same document independently in their workspaces, and the team manager can track task completion. TeamTat provides corpus quality assessment via inter-annotator agreement statistics, and a user-friendly interface convenient for annotation review and inter-annotator disagreement resolution to improve corpus quality.

Book ChapterDOI
02 Nov 2020
TL;DR: Evaluations run on a novel dataset consisting of a set of high-quality manually-curated tables with non-obviously linkable cells show that ambiguity is a key problem for entity linking algorithms and encourage a promising direction for future work in the field.
Abstract: Table annotation is a key task to improve querying the Web and support the Knowledge Graph population from legacy sources (tables). Last year, the SemTab challenge was introduced to unify different efforts to evaluate table annotation algorithms by providing a common interface and several general-purpose datasets as a ground truth. The SemTab dataset is useful to have a general understanding of how these algorithms work, and the organizers of the challenge included some artificial noise to the data to make the annotation trickier. However, it is hard to analyze specific aspects in an automatic way. For example, the ambiguity of names at the entity-level can largely affect the quality of the annotation. In this paper, we propose a novel dataset to complement the datasets proposed by SemTab. The dataset consists of a set of high-quality manually-curated tables with non-obviously linkable cells, i.e., where values are ambiguous names, typos, and misspelled entity names not appearing in the current version of the SemTab dataset. These challenges are particularly relevant for the ingestion of structured legacy sources into existing knowledge graphs. Evaluations run on this dataset show that ambiguity is a key problem for entity linking algorithms and encourage a promising direction for future work in the field.

Journal ArticleDOI
28 May 2020-Sensors
TL;DR: MorphoCluster as discussed by the authors is a software tool for data-driven, fast, and accurate annotation of large image data sets by aggregating similar images into clusters, which increases consistency, multiplies the throughput of an annotator, and allows experts to adapt the granularity of their sorting scheme to the structure in the data.
Abstract: In this work, we present MorphoCluster, a software tool for data-driven, fast, and accurate annotation of large image data sets. While already having surpassed the annotation rate of human experts, volume and complexity of marine data will continue to increase in the coming years. Still, this data requires interpretation. MorphoCluster augments the human ability to discover patterns and perform object classification in large amounts of data by embedding unsupervised clustering in an interactive process. By aggregating similar images into clusters, our novel approach to image annotation increases consistency, multiplies the throughput of an annotator, and allows experts to adapt the granularity of their sorting scheme to the structure in the data. By sorting a set of 1.2 M objects into 280 data-driven classes in 71 h (16 k objects per hour), with 90% of these classes having a precision of 0.889 or higher. This shows that MorphoCluster is at the same time fast, accurate, and consistent; provides a fine-grained and data-driven classification; and enables novelty detection.

Posted Content
TL;DR: The framework is designed to mitigate the issues of collecting and annotating social media data by cohesively combining machine and human in the data collection process by reducing the workload and problems behind the data annotation from the social media platforms.
Abstract: In this paper, we present a semi-automated framework called AMUSED for gathering multi-modal annotated data from the multiple social media platforms. The framework is designed to mitigate the issues of collecting and annotating social media data by cohesively combining machine and human in the data collection process. From a given list of the articles from professional news media or blog, AMUSED detects links to the social media posts from news articles and then downloads contents of the same post from the respective social media platform to gather details about that specific post. The framework is capable of fetching the annotated data from multiple platforms like Twitter, YouTube, Reddit. The framework aims to reduce the workload and problems behind the data annotation from the social media platforms. AMUSED can be applied in multiple application domains, as a use case, we have implemented the framework for collecting COVID-19 misinformation data from different social media platforms.

Journal ArticleDOI
TL;DR: A deep learning-based method that includes pre-annotation of the phases and steps in surgical videos and user assistance in the annotation process that significantly improved annotation duration and accuracy is presented.
Abstract: Annotation of surgical videos is a time-consuming task which requires specific knowledge. In this paper, we present and evaluate a deep learning-based method that includes pre-annotation of the phases and steps in surgical videos and user assistance in the annotation process. We propose a classification function that automatically detects errors and infers temporal coherence in predictions made by a convolutional neural network. First, we trained three different architectures of neural networks to assess the method on two surgical procedures: cholecystectomy and cataract surgery. The proposed method was then implemented in an annotation software to test its ability to assist surgical video annotation. A user study was conducted to validate our approach, in which participants had to annotate the phases and the steps of a cataract surgery video. The annotation and the completion time were recorded. The participants who used the assistance system were 7% more accurate on the step annotation and 10 min faster than the participants who used the manual system. The results of the questionnaire showed that the assistance system did not disturb the participants and did not complicate the task. The annotation process is a difficult and time-consuming task essential to train deep learning algorithms. In this publication, we propose a method to assist the annotation of surgical workflows which was validated through a user study. The proposed assistance system significantly improved annotation duration and accuracy.

Journal ArticleDOI
TL;DR: SnpHub is presented, a Shiny/R-based server framework for retrieving, analysing, and visualizing large-scale genomic variation data that can be easily set up on any Linux server and can be applied to any species.
Abstract: Background The cost of high-throughput sequencing is rapidly decreasing, allowing researchers to investigate genomic variations across hundreds or even thousands of samples in the post-genomic era. The management and exploration of these large-scale genomic variation data require programming skills. The public genotype querying databases of many species are usually centralized and implemented independently, making them difficult to update with new data over time. Currently, there is a lack of a widely used framework for setting up user-friendly web servers to explore new genomic variation data in diverse species. Results Here, we present SnpHub, a Shiny/R-based server framework for retrieving, analysing, and visualizing large-scale genomic variation data that can be easily set up on any Linux server. After a pre-building process based on the provided VCF files and genome annotation files, the local server allows users to interactively access single-nucleotide polymorphisms and small insertions/deletions with annotation information by locus or gene and to define sample sets through a web page. Users can freely analyse and visualize genomic variations in heatmaps, phylogenetic trees, haplotype networks, or geographical maps. Sample-specific sequences can be accessed as replaced by detected sequence variations. Conclusions SnpHub can be applied to any species, and we build up a SnpHub portal website for wheat and its progenitors based on published data in recent studies. SnpHub and its tutorial are available at http://guoweilong.github.io/SnpHub/. The wheat-SnpHub-portal website can be accessed at http://wheat.cau.edu.cn/Wheat_SnpHub_Portal/.

Proceedings ArticleDOI
01 Jul 2020
TL;DR: This work presents a novel domain-agnostic Human-In-The-Loop annotation approach that uses recommenders that suggest potential concepts and adaptive candidate ranking, thereby speeding up the overall annotation process and making it less tedious for users.
Abstract: Entity linking (EL) is concerned with disambiguating entity mentions in a text against knowledge bases (KB). It is crucial in a considerable number of fields like humanities, technical writing and biomedical sciences to enrich texts with semantics and discover more knowledge. The use of EL in such domains requires handling noisy texts, low resource settings and domain-specific KBs. Existing approaches are mostly inappropriate for this, as they depend on training data. However, in the above scenario, there exists hardly annotated data, and it needs to be created from scratch. We therefore present a novel domain-agnostic Human-In-The-Loop annotation approach: we use recommenders that suggest potential concepts and adaptive candidate ranking, thereby speeding up the overall annotation process and making it less tedious for users. We evaluate our ranking approach in a simulation on difficult texts and show that it greatly outperforms a strong baseline in ranking accuracy. In a user study, the annotation speed improves by 35% compared to annotating without interactive support; users report that they strongly prefer our system. An open-source and ready-to-use implementation based on the text annotation platform INCEpTION (https://inception-project.github.io) is made available.

Journal ArticleDOI
18 Sep 2020-Biology
TL;DR: A summary of both structural and functional annotations, as well as the associated comparative annotation tools and pipelines for both annotations of structures and functions, are presented.
Abstract: Next-Generation Sequencing (NGS) has made it easier to obtain genome-wide sequence data and it has shifted the research focus into genome annotation. The challenging tasks involved in annotation rely on the currently available tools and techniques to decode the information contained in nucleotide sequences. This information will improve our understanding of general aspects of life and evolution and improve our ability to diagnose genetic disorders. Here, we present a summary of both structural and functional annotations, as well as the associated comparative annotation tools and pipelines. We highlight visualization tools that immensely aid the annotation process and the contributions of the scientific community to the annotation. Further, we discuss quality-control practices and the need for re-annotation, and highlight the future of annotation.

Journal ArticleDOI
01 Dec 2020
TL;DR: This work develops a multi-platform dataset that consists purely of the text from posts gathered from seven social media platforms and shows that, despite the diversity of examples present in the dataset, good performance is possible for models trained on datasets produced in this manner.
Abstract: Recent work on cyberbullying detection relies on using machine learning models with text and metadata in small datasets, mostly drawn from single social media platforms. Such models have succeeded in predicting cyberbullying when dealing with posts containing the text and the metadata structure as found on the platform. Instead, we develop a multi-platform dataset that consists purely of the text from posts gathered from seven social media platforms. We present a multi-stage and multi-technique annotation system that initially uses crowdsourcing for post and hashtag annotation and subsequently utilizes machine-learning methods to identify additional posts for annotation. This process has the benefit of selecting posts for annotation that have a significantly greater than chance likelihood of constituting clear cases of cyberbullying without limiting the range of samples to those containing predetermined features (as is the case when hashtags alone are used to select posts for annotation). We show that, despite the diversity of examples present in the dataset, good performance is possible for models trained on datasets produced in this manner. This becomes a clear advantage compared to traditional methods of post selection and labeling because it increases the number of positive examples that can be produced using the same resources and it enhances the diversity of communication media to which the models can be applied.

Proceedings ArticleDOI
01 Jul 2020
TL;DR: In this paper, an improved crowdsourcing protocol for complex semantic annotation, involving worker selection and training, and a data consolidation phase, was presented, which yielded high-quality annotation with drastically higher coverage, producing a new gold evaluation dataset.
Abstract: Question-answer driven Semantic Role Labeling (QA-SRL) was proposed as an attractive open and natural flavour of SRL, potentially attainable from laymen. Recently, a large-scale crowdsourced QA-SRL corpus and a trained parser were released. Trying to replicate the QA-SRL annotation for new texts, we found that the resulting annotations were lacking in quality, particularly in coverage, making them insufficient for further research and evaluation. In this paper, we present an improved crowdsourcing protocol for complex semantic annotation, involving worker selection and training, and a data consolidation phase. Applying this protocol to QA-SRL yielded high-quality annotation with drastically higher coverage, producing a new gold evaluation dataset. We believe that our annotation protocol and gold standard will facilitate future replicable research of natural semantic annotations.

Journal ArticleDOI
TL;DR: The bitacora system as mentioned in this paper integrates popular sequence similarity-based search tools and Perl scripts to facilitate both the curation of these inaccurate annotations and the identification of previously undetected gene family copies directly in genomic DNA sequences.
Abstract: Gene annotation is a critical bottleneck in genomic research, especially for the comprehensive study of very large gene families in the genomes of nonmodel organisms. Despite the recent progress in automatic methods, state-of-the-art tools used for this task often produce inaccurate annotations, such as fused, chimeric, partial or even completely absent gene models for many family copies, errors that require considerable extra efforts to be corrected. Here we present bitacora, a bioinformatics solution that integrates popular sequence similarity-based search tools and Perl scripts to facilitate both the curation of these inaccurate annotations and the identification of previously undetected gene family copies directly in genomic DNA sequences. We tested the performance of bitacora in annotating the members of two chemosensory gene families with different repertoire size in seven available genome sequences, and compared its performance with that of augustus-ppx, a tool also designed to improve automatic annotations using a sequence similarity-based approach. Despite the relatively high fragmentation of some of these drafts, bitacora was able to improve the annotation of many members of these families and detected thousands of new chemoreceptors encoded in genome sequences. The program creates general feature format (GFF) files, with both curated and newly identified gene models, and FASTA files with the predicted proteins. These outputs can be easily integrated in genomic annotation editors, greatly facilitating subsequent manual annotation and downstream evolutionary analyses.

Posted ContentDOI
19 Nov 2020-bioRxiv
TL;DR: The full-stack ChromHMM model provides a universal chromatin state annotation of the genome and a unified global view of over 1000 datasets, and is more predictive of locations of external genomic annotations.
Abstract: Genome-wide maps of chromatin marks such as histone modifications and open chromatin sites provide valuable information for annotating the non-coding genome, including identifying regulatory elements. Computational approaches such as ChromHMM have been applied to discover and annotate chromatin states defined by combinatorial and spatial patterns of chromatin marks within the same cell type. An alternative stacked modeling approach was previously suggested, where chromatin states are defined jointly from datasets of multiple cell types to produce a single universal genome annotation based on all datasets. Despite its potential benefits for applications that are not specific to one cell type, such an approach was previously applied only for small-scale specialized purposes. Large-scale applications of stacked modeling have previously posed scalability challenges. In this paper, using a version of ChromHMM enhanced for large-scale applications, we applied the stacked modeling approach to produce a universal chromatin state annotation of the human genome using over 1000 datasets from more than 100 cell types, denoted the full-stack model. The full-stack model states show distinct enrichments for external genomic annotations, which we used in characterizing each state. Compared to cell-type-specific annotations, the full-stack annotation directly differentiates constitutive from cell-type-specific activity and is more predictive of locations of external genomic annotations. Overall, the full-stack ChromHMM model provides a universal chromatin state annotation of the genome and a unified global view of over 1000 datasets. We expect this to be a useful resource that complements existing cell-type-specific annotations for studying the non-coding human genome.

Posted Content
TL;DR: In this article, the existence of a locally dense set of real polynomial automorphisms displaying a wandering Fatou component was shown to be true in the Newhouse domain.
Abstract: We prove the existence of a locally dense set of real polynomial automorphisms of C 2 displaying a wandering Fatou component; in particular this solves the problem of their existence, reported by Bedford and Smillie in 1991. These Fatou components have non-empty real trace and their statistical behavior is historical with high emergence. The proof is based on a geometric model for parameter families of surface real mappings. At a dense set of parameters, we show that the dynamics of the model displays a historical, high emergent, stable domain. We show that this model can be embedded into families of H{e}non maps of explicit degree and also in an open and dense set of 5-parameter C r-families of surface diffeomorphisms in the Newhouse domain, for every 2 $\le$ r $\le$ $\infty$ and r = $\omega$. This implies a complement of the work of Kiriki and Soma (2017), a proof of the last Taken's problem in the C $\infty$ and C $\omega$-case. The main difficulty is that here perturbations are done only along finite-dimensional parameter families. The proof is based on the multi-renormalization introduced in [Ber18].

Posted Content
TL;DR: The study of Latin squares goes back more than 200 years to the work of Euler and one of the most famous open problems in this area is a conjecture of Ryser-Brualdi-Stein from 60s which says that every Latin square of order is of order.
Abstract: A Latin square of order $n$ is an $n \times n$ array filled with $n$ symbols such that each symbol appears only once in every row or column and a transversal is a collection of cells which do not share the same row, column or symbol. The study of Latin squares goes back more than 200 years to the work of Euler. One of the most famous open problems in this area is a conjecture of Ryser-Brualdi-Stein from 60s which says that every Latin square of order $n\times n$ contains a transversal of order $n-1$. In this paper we prove the existence of a transversal of order $n-O(\log{n}/\log{\log{n}})$, improving the celebrated bound of $n-O(\log^2n)$ by Hatami and Shor. Our approach (different from that of Hatami-Shor) is quite general and gives several other applications as well. We obtain a new lower bound on a 40 year old conjecture of Brouwer on the maximum matching in Steiner triple systems, showing that every such system of order $n$ is guaranteed to have a matching of size $n/3-O(\log{n}/\log{\log{n}})$. This substantially improves the current best result of Alon, Kim and Spencer which has the error term of order $n^{1/2+o(1)}$. Finally, we also show that $O(n\log{n}/\log{\log{n}})$ many symbols in Latin arrays suffice to guarantee a full transversal, improving on previously known bound of $n^{2-\varepsilon}$. The proofs combine in a novel way the semirandom method together with the robust expansion properties of edge coloured pseudorandom graphs to show the existence of a rainbow matching covering all but $O(\log n/\log{\log{n}})$ vertices. All previous results, based on the semi-random method, left uncovered at least $\Omega(n^{\alpha})$ (for some constant $\alpha$) vertices.

Journal ArticleDOI
TL;DR: UniRule provides scalable enrichment of annotation in UniProtKB, a method of annotation based on expertly curated rules, which integrates related systems (RuleBase, HAMAP, PirSR, PIRNR) developed by the members of the UniProt consortium.
Abstract: Motivation The number of protein records in the UniProt Knowledgebase (UniProtKB: https://www.uniprot.org) continues to grow rapidly as a result of genome sequencing and the prediction of protein-coding genes. Providing functional annotation for these proteins presents a significant and continuing challenge. Results In response to this challenge, UniProt has developed a method of annotation, known as UniRule, based on expertly curated rules, which integrates related systems (RuleBase, HAMAP, PIRSR, PIRNR) developed by the members of the UniProt consortium. UniRule uses protein family signatures from InterPro, combined with taxonomic and other constraints, to select sets of reviewed proteins which have common functional properties supported by experimental evidence. This annotation is propagated to unreviewed records in UniProtKB that meet the same selection criteria, most of which do not have (and are never likely to have) experimentally verified functional annotation. Release 2020_01 of UniProtKB contains 6496 UniRule rules which provide annotation for 53 million proteins, accounting for 30% of the 178 million records in UniProtKB. UniRule provides scalable enrichment of annotation in UniProtKB. Availability and implementation UniRule rules are integrated into UniProtKB and can be viewed at https://www.uniprot.org/unirule/. UniRule rules and the code required to run the rules, are publicly available for researchers who wish to annotate their own sequences. The implementation used to run the rules is known as UniFIRE and is available at https://gitlab.ebi.ac.uk/uniprot-public/unifire.

Posted Content
TL;DR: In this article, the authors prove a correspondence between genus zero logarithmic Gromov-Witten invariants intersecting in a single point with maximal tangency and the consistent wall structure appearing in the dual intersection complex of a log Calabi-Yau pair.
Abstract: Consider a log Calabi-Yau pair $(X,D)$ consisting of a smooth del Pezzo surface $X$ of degree $\geq 3$ and a smooth anticanonical divisor $D$. We prove a correspondence between genus zero logarithmic Gromov-Witten invariants of $X$ intersecting $D$ in a single point with maximal tangency and the consistent wall structure appearing in the dual intersection complex of $(X,D)$ from the Gross-Siebert reconstruction algorithm. More precisely, the logarithm of the product of functions attached to unbounded walls in the consistent wall structure gives a generating function for these invariants.

Journal ArticleDOI
TL;DR: An end-to-end deep learning framework for object-level multilabel annotation of RS images and adopts the binary cross-entropy loss for classification and the triplet loss for image embedding learning is proposed.
Abstract: Multilabel remote sensing (RS) image annotation is a challenging and time-consuming task that requires a considerable amount of expert knowledge. Most existing RS image annotation methods are based on handcrafted features and require multistage processes that are not sufficiently efficient and effective. An RS image can be assigned with a single label at the scene level to depict the overall understanding of the scene and with multiple labels at the object level to represent the major components. The multiple labels can be used as supervised information for annotation, whereas the single label can be used as additional information to exploit the scene-level similarity relationships. By exploiting the dual-level semantic concepts, we propose an end-to-end deep learning framework for object-level multilabel annotation of RS images. The proposed framework consists of a shared convolutional neural network for discriminative feature learning, a classification branch for multilabel annotation and an embedding branch for preserving the scene-level similarity relationships. In the classification branch, an attention mechanism is introduced to generate attention-aware features, and skip-layer connections are incorporated to combine information from multiple layers. The philosophy of the embedding branch is that images with the same scene-level semantic concepts should have similar visual representations. The proposed method adopts the binary cross-entropy loss for classification and the triplet loss for image embedding learning. The evaluations on three multilabel RS image data sets demonstrate the effectiveness and superiority of the proposed method in comparison with the state-of-the-art methods.

Posted ContentDOI
11 Feb 2020-bioRxiv
TL;DR: SnpHub is presented, a Shiny/R-based server framework for retrieving, analysing and visualizing the large-scale genomic variation data that be easily set up on any Linux server and can be applied to any species.
Abstract: Background: The cost of high-throughput sequencing is rapidly decreasing, allowing researchers to investigate genomic variations across hundreds or even thousands of samples in the post-genomic era. The management and exploration of these large-scale genomic variation data require programming skills. The public genotype querying databases of many species are usually centralized and implemented independently, making them difficult to update with new data over time. Currently, there is a lack of a widely used framework for setting up user-friendly web servers for exploring new genomic variation data in diverse species. Results: Here, we present SnpHub, a Shiny/R-based server framework for retrieving, analysing and visualizing the large-scale genomic variation data that be easily set up on any Linux server. After a pre-building process based on the provided VCF files and genome annotation files, the local server allows users to interactively access SNPs/INDELs and annotation information by locus or gene and for user-defined sample sets through a webpage. Users can freely analyse and visualize genomic variations in heatmaps, phylogenetic trees, haplotype networks, or geographical maps. Sample-specific sequences can be accessed as replaced by SNPs/INDELs. Conclusions: SnpHub can be applied to any species, and we build up a SnpHub portal website for wheat and its progenitors based on published data in present studies. SnpHub and its tutorial are available as http://guoweilong.github.io/SnpHub/.

Journal ArticleDOI
TL;DR: A novel workflow to interactively incorporate the ‘human in the loop’ when training classification models from annotated data and is able to speed up the annotation process, and argues that by providing additional visual explanations annotators get to understand the decision making process as well as the trustworthiness of their trained machine learning models.
Abstract: In the following article, we introduce a novel workflow, which we subsume under the term “explainable cooperative machine learning” and show its practical application in a data annotation and model training tool called NOVA. The main idea of our approach is to interactively incorporate the ‘human in the loop’ when training classification models from annotated data. In particular, NOVA offers a collaborative annotation backend where multiple annotators join their workforce. A main aspect is the possibility of applying semi-supervised active learning techniques already during the annotation process by giving the possibility to pre-label data automatically, resulting in a drastic acceleration of the annotation process. Furthermore, the user-interface implements recent eXplainable AI techniques to provide users with both, a confidence value of the automatically predicted annotations, as well as visual explanation. We show in an use-case evaluation that our workflow is able to speed up the annotation process, and further argue that by providing additional visual explanations annotators get to understand the decision making process as well as the trustworthiness of their trained machine learning models.