scispace - formally typeset
Search or ask a question

Showing papers on "Annotation published in 2021"


Journal ArticleDOI
TL;DR: The Interactive Tree Of Life (ITOL) as mentioned in this paper is an online tool for the display, manipulation and annotation of phylogenetic and other trees, which allows users to draw shapes, labels and other features directly onto the trees.
Abstract: The Interactive Tree Of Life (https://itol.embl.de) is an online tool for the display, manipulation and annotation of phylogenetic and other trees. It is freely available and open to everyone. iTOL version 5 introduces a completely new tree display engine, together with numerous new features. For example, a new dataset type has been added (MEME motifs), while annotation options have been expanded for several existing ones. Node metadata display options have been extended and now also support non-numerical categorical values, as well as multiple values per node. Direct manual annotation is now available, providing a set of basic drawing and labeling tools, allowing users to draw shapes, labels and other features by hand directly onto the trees. Support for tree and dataset scales has been extended, providing fine control over line and label styles. Unrooted tree displays can now use the equal-daylight algorithm, proving a much greater display clarity. The user account system has been streamlined and expanded with new navigation options and currently handles >1 million trees from >70 000 individual users.

2,856 citations


Journal ArticleDOI
TL;DR: A historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations is made available to maintain consistency with other ontologies.
Abstract: The Gene Ontology Consortium (GOC) provides the most comprehensive resource currently available for computable knowledge regarding the functions of genes and gene products. Here, we report the advances of the consortium over the past two years. The new GO-CAM annotation framework was notably improved, and we formalized the model with a computational schema to check and validate the rapidly increasing repository of 2838 GO-CAMs. In addition, we describe the impacts of several collaborations to refine GO and report a 10% increase in the number of GO annotations, a 25% increase in annotated gene products, and over 9,400 new scientific articles annotated. As the project matures, we continue our efforts to review older annotations in light of newer findings, and, to maintain consistency with other ontologies. As a result, 20 000 annotations derived from experimental data were reviewed, corresponding to 2.5% of experimental GO annotations. The website (http://geneontology.org) was redesigned for quick access to documentation, downloads and tools. To maintain an accurate resource and support traceability and reproducibility, we have made available a historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations.

1,988 citations


Journal ArticleDOI
06 Jan 2021
TL;DR: The BRAKER2 pipeline as mentioned in this paper generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS, and it is favorably compared with other pipelines, e.g. MAKER2, in terms of accuracy and performance.
Abstract: The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.

455 citations


Journal ArticleDOI
TL;DR: Motivation Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms, and some tools are comprehensive and mature enough to be used on most annotation projects.
Abstract: MOTIVATION: Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms. Further, annotation tools are also used to extract new information for a particular use case. However, owing to the high number of existing annotation tools, finding the one that best fits particular needs is a demanding task that requires searching the scientific literature followed by installing and trying various tools. METHODS: We searched for annotation tools and selected a subset of them according to five requirements with which they should comply, such as being Web-based or supporting the definition of a schema. We installed the selected tools (when necessary), carried out hands-on experiments and evaluated them using 26 criteria that covered functional and technical aspects. We defined each criterion on three levels of matches and a score for the final evaluation of the tools. RESULTS: We evaluated 78 tools and selected the following 15 for a detailed evaluation: BioQRator, brat, Catma, Djangology, ezTag, FLAT, LightTag, MAT, MyMiner, PDFAnno, prodigy, tagtog, TextAE, WAT-SL and WebAnno. Full compliance with our 26 criteria ranged from only 9 up to 20 criteria, which demonstrated that some tools are comprehensive and mature enough to be used on most annotation projects. The highest score of 0.81 was obtained by WebAnno (of a maximum value of 1.0).

78 citations


DOI
01 Nov 2021
TL;DR: Bakta as discussed by the authors is a command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes, including the detection of small proteins taking into account replicon metadata.
Abstract: Command-line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command-line software pipelines heavily depend on taxon-specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross-references. Annotation results are exported in GFF3 and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files, as well as comprehensive JSON files, facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command-line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references, whilst providing comparable wall-clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta. An accompanying web version is available at https://bakta.computational.bio.

76 citations


Journal ArticleDOI
TL;DR: This paper proposes an innovative method in which the visual features of the image are presented by the intermediate layer features of deep learning, while semantic concepts are represented by mean vectors of positive samples.
Abstract: The automatic image annotation is an effective computer operation that predicts the annotation of an unknown image by automatically learning potential relationships between the semantic concept space and the visual feature space in the annotation image dataset. Usually, the auto-labeling image includes the processing: learning processing and labeling processing. Existing image annotation methods that employ convolutional features of deep learning methods have a number of limitations, including complex training and high space/time expenses associated with the image annotation procedure. Accordingly, this paper proposes an innovative method in which the visual features of the image are presented by the intermediate layer features of deep learning, while semantic concepts are represented by mean vectors of positive samples. Firstly, the convolutional result is directly output in the form of low-level visual features through the mid-level of the pre-trained deep learning model, with the image being represented by sparse coding. Secondly, the positive mean vector method is used to construct visual feature vectors for each text vocabulary item, so that a visual feature vector database is created. Finally, the visual feature vector similarity between the testing image and all text vocabulary is calculated, and the vocabulary with the largest similarity used for annotation. Experiments on the datasets demonstrate the effectiveness of the proposed method; in terms of F1 score, the proposed method’s performance on the Corel5k dataset and IAPR TC-12 dataset is superior to that of MBRM, JEC-AF, JEC-DF, and 2PKNN with end-to-end deep features.

57 citations


Journal ArticleDOI
TL;DR: This update presents the major improvements since the initial release of OpenProt, which enables a more comprehensive annotation of eukaryotic genomes and fosters functional proteomic discoveries.
Abstract: OpenProt (www.openprot.org) is the first proteogenomic resource supporting a polycistronic annotation model for eukaryotic genomes. It provides a deeper annotation of open reading frames (ORFs) while mining experimental data for supporting evidence using cutting-edge algorithms. This update presents the major improvements since the initial release of OpenProt. All species support recent NCBI RefSeq and Ensembl annotations, with changes in annotations being reported in OpenProt. Using the 131 ribosome profiling datasets re-analysed by OpenProt to date, non-AUG initiation starts are reported alongside a confidence score of the initiating codon. From the 177 mass spectrometry datasets re-analysed by OpenProt to date, the unicity of the detected peptides is controlled at each implementation. Furthermore, to guide the users, detectability statistics and protein relationships (isoforms) are now reported for each protein. Finally, to foster access to deeper ORF annotation independently of one's bioinformatics skills or computational resources, OpenProt now offers a data analysis platform. Users can submit their dataset for analysis and receive the results from the analysis by OpenProt. All data on OpenProt are freely available and downloadable for each species, the release-based format ensuring a continuous access to the data. Thus, OpenProt enables a more comprehensive annotation of eukaryotic genomes and fosters functional proteomic discoveries.

57 citations


Journal ArticleDOI
TL;DR: The MuSe-CaR dataset as discussed by the authors is a large-scale multimodal dataset for sentiment and emotion research, which includes audio-visual and language modalities, and has been used as a testbed for the 1st Multimodal Sentiment Analysis Challenge.
Abstract: Truly real-life data presents a strong, but exciting challenge for sentiment and emotion research. The high variety of possible ‘in-the-wild’ properties makes large datasets such as these indispensable with respect to building robust machine learning models. A sufficient quantity of data covering a deep variety in the challenges of each modality to force the exploratory analysis of the interplay of all modalities has not yet been made available in this context. In this contribution, we present MuSe-CaR, a first of its kind multimodal dataset. The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge, and focused on the tasks of emotion, emotion-target engagement, and trustworthiness recognition by means of comprehensively integrating the audio-visual and language modalities. Furthermore, we give a thorough overview of the dataset in terms of collection and annotation, including annotation tiers not used in this year's MuSe 2020. In addition, for one of the sub-challenges -- predicting the level of trustworthiness -- no participant outperformed the baseline model, and so we propose a simple, but highly efficient Multi-Head-Attention network that exceeds using multimodal fusion the baseline by around 0.2 CCC (almost 50 % improvement).

32 citations


Journal ArticleDOI
15 Apr 2021
TL;DR: The authors developed an annotation schema and a benchmark for automated claim detection that is more consistent across time, topics, and annotators than are previous approaches, achieving an F1 score of 0.83 with over 5% relative improvement over the state-of-the-art methods ClaimBuster and ClaimRank.
Abstract: In an effort to assist factcheckers in the process of factchecking, we tackle the claim detection task, one of the necessary stages prior to determining the veracity of a claim. It consists of identifying the set of sentences, out of a long text, deemed capable of being factchecked. This article is a collaborative work between Full Fact, an independent factchecking charity, and academic partners. Leveraging the expertise of professional factcheckers, we develop an annotation schema and a benchmark for automated claim detection that is more consistent across time, topics, and annotators than are previous approaches. Our annotation schema has been used to crowdsource the annotation of a dataset with sentences from UK political TV shows. We introduce an approach based on universal sentence representations to perform the classification, achieving an F1 score of 0.83, with over 5% relative improvement over the state-of-the-art methods ClaimBuster and ClaimRank. The system was deployed in production and received positive user feedback.

30 citations


Proceedings ArticleDOI
Candice Schumann1, Susanna Ricco1, Utsav Prabhu1, Vittorio Ferrari1, Caroline Pantofaru1 
21 Jul 2021
TL;DR: More Inclusive Annotations for People (MIAP) as discussed by the authors is a subset of the Open Images dataset, containing bounding boxes and attributes for all of the people visible in those images.
Abstract: The Open Images Dataset contains approximately 9 million images and is a widely accepted dataset for computer vision research. As is common practice for large datasets, the annotations are not exhaustive, with bounding boxes and attribute labels for only a subset of the classes in each image. In this paper, we present a new set of annotations on a subset of the Open Images dataset called the MIAP (More Inclusive Annotations for People) subset, containing bounding boxes and attributes for all of the people visible in those images. The attributes and labeling methodology for the MIAP subset were designed to enable research into model fairness. In addition, we analyze the original annotation methodology for the person class and its subclasses, discussing the resulting patterns in order to inform future annotation efforts. By considering both the original and exhaustive annotation sets, researchers can also now study how systematic patterns in training annotations affect modeling.

26 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a deep label-specific feature (Deep-LIFT) learning model to build the explicit and exact correspondence between the label and the local visual region, which improves the effectiveness of feature learning and enhances the interpretability of the model itself.
Abstract: Image annotation aims to jointly predict multiple tags for an image. Although significant progress has been achieved, existing approaches usually overlook aligning specific labels and their corresponding regions due to the weak supervised information (i.e., ``bag of labels'' for regions), thus failing to explicitly exploit the discrimination from different classes. In this article, we propose the deep label-specific feature (Deep-LIFT) learning model to build the explicit and exact correspondence between the label and the local visual region, which improves the effectiveness of feature learning and enhances the interpretability of the model itself. Deep-LIFT extracts features for each label by aligning each label and its region. Specifically, Deep-LIFTs are achieved through learning multiple correlation maps between image convolutional features and label embeddings. Moreover, we construct two variant graph convolutional networks (GCNs) to further capture the interdependency among labels. Empirical studies on benchmark datasets validate that the proposed model achieves superior performance on multilabel classification over other existing state-of-the-art methods.

Journal ArticleDOI
TL;DR: In this article, four working groups were formed from a pool of participants that included clinicians, engineers, and data scientists, focused on four themes: (1) temporal models, actions and tasks, tissue characteristics and general anatomy, and software and data structure.
Abstract: The growing interest in analysis of surgical video through machine learning has led to increased research efforts; however, common methods of annotating video data are lacking. There is a need to establish recommendations on the annotation of surgical video data to enable assessment of algorithms and multi-institutional collaboration. Four working groups were formed from a pool of participants that included clinicians, engineers, and data scientists. The working groups were focused on four themes: (1) temporal models, (2) actions and tasks, (3) tissue characteristics and general anatomy, and (4) software and data structure. A modified Delphi process was utilized to create a consensus survey based on suggested recommendations from each of the working groups. After three Delphi rounds, consensus was reached on recommendations for annotation within each of these domains. A hierarchy for annotation of temporal events in surgery was established. While additional work remains to achieve accepted standards for video annotation in surgery, the consensus recommendations on a general framework for annotation presented here lay the foundation for standardization. This type of framework is critical to enabling diverse datasets, performance benchmarks, and collaboration.

Posted Content
TL;DR: The authors proposed a multi-task based approach to resolve annotator disagreements and derive single ground truth labels from multiple annotations, where predicting each annotators' judgements is treated as separate subtasks, while sharing a common learned representation of the task.
Abstract: Majority voting and averaging are common approaches employed to resolve annotator disagreements and derive single ground truth labels from multiple annotations. However, annotators may systematically disagree with one another, often reflecting their individual biases and values, especially in the case of subjective tasks such as detecting affect, aggression, and hate speech. Annotator disagreements may capture important nuances in such tasks that are often ignored while aggregating annotations to a single ground truth. In order to address this, we investigate the efficacy of multi-annotator models. In particular, our multi-task based approach treats predicting each annotators' judgements as separate subtasks, while sharing a common learned representation of the task. We show that this approach yields same or better performance than aggregating labels in the data prior to training across seven different binary classification tasks. Our approach also provides a way to estimate uncertainty in predictions, which we demonstrate better correlate with annotation disagreements than traditional methods. Being able to model uncertainty is especially useful in deployment scenarios where knowing when not to make a prediction is important.

Proceedings ArticleDOI
Candice Schumann1, Susanna Ricco1, Utsav Prabhu1, Vittorio Ferrari1, Caroline Pantofaru1 
TL;DR: More Inclusive Annotations for People (MIAP) as discussed by the authors is a subset of the Open Images dataset, containing bounding boxes and attributes for all of the people visible in those images.
Abstract: The Open Images Dataset contains approximately 9 million images and is a widely accepted dataset for computer vision research. As is common practice for large datasets, the annotations are not exhaustive, with bounding boxes and attribute labels for only a subset of the classes in each image. In this paper, we present a new set of annotations on a subset of the Open Images dataset called the MIAP (More Inclusive Annotations for People) subset, containing bounding boxes and attributes for all of the people visible in those images. The attributes and labeling methodology for the MIAP subset were designed to enable research into model fairness. In addition, we analyze the original annotation methodology for the person class and its subclasses, discussing the resulting patterns in order to inform future annotation efforts. By considering both the original and exhaustive annotation sets, researchers can also now study how systematic patterns in training annotations affect modeling.

Journal ArticleDOI
TL;DR: Exact as discussed by the authors is an open-source online platform that enables the collaborative interdisciplinary analysis of images from different domains online and offline, including medical whole-slide images and image series with thousands of images.
Abstract: In many research areas, scientific progress is accelerated by multidisciplinary access to image data and their interdisciplinary annotation. However, keeping track of these annotations to ensure a high-quality multi-purpose data set is a challenging and labour intensive task. We developed the open-source online platform EXACT (EXpert Algorithm Collaboration Tool) that enables the collaborative interdisciplinary analysis of images from different domains online and offline. EXACT supports multi-gigapixel medical whole slide images as well as image series with thousands of images. The software utilises a flexible plugin system that can be adapted to diverse applications such as counting mitotic figures with a screening mode, finding false annotations on a novel validation view, or using the latest deep learning image analysis technologies. This is combined with a version control system which makes it possible to keep track of changes in the data sets and, for example, to link the results of deep learning experiments to specific data set versions. EXACT is freely available and has already been successfully applied to a broad range of annotation tasks, including highly diverse applications like deep learning supported cytology scoring, interdisciplinary multi-centre whole slide image tumour annotation, and highly specialised whale sound spectroscopy clustering.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed weakly supervised ReID (SYSU-30k), which groups the images into bags in terms of time and assigns a bag-level label for each bag.
Abstract: Person reidentification (Re-ID) benefits greatly from the accurate annotations of existing data sets (e.g., CUHK03 and Market-1501), which are quite expensive because each image in these data sets has to be assigned with a proper label. In this work, we ease the annotation of Re-ID by replacing the accurate annotation with inaccurate annotation, i.e., we group the images into bags in terms of time and assign a bag-level label for each bag. This greatly reduces the annotation effort and leads to the creation of a large-scale Re-ID benchmark called SYSU- $30k$ . The new benchmark contains 30k individuals, which is about 20 times larger than CUHK03 (1.3k individuals) and Market-1501 (1.5k individuals), and 30 times larger than ImageNet (1k categories). It sums up to 29606918 images. Learning a Re-ID model with bag-level annotation is called the weakly supervised Re-ID problem. To solve this problem, we introduce a differentiable graphical model to capture the dependencies from all images in a bag and generate a reliable pseudolabel for each person’s image. The pseudolabel is further used to supervise the learning of the Re-ID model. Compared with the fully supervised Re-ID models, our method achieves state-of-the-art performance on SYSU- $30k$ and other data sets. The code, data set, and pretrained model will be available at https://github.com/wanggrun/SYSU-30k .

Journal ArticleDOI
TL;DR: Mantis as mentioned in this paper is a protein annotation tool that uses database identifiers intersection and text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output, allowing for customization of reference data and execution parameters, and is reproducible across different research goals and user environments.
Abstract: BACKGROUND The rapid development of the (meta-)omics fields has produced an unprecedented amount of high-resolution and high-fidelity data Through the use of these datasets we can infer the role of previously functionally unannotated proteins from single organisms and consortia In this context, protein function annotation can be described as the identification of regions of interest (ie, domains) in protein sequences and the assignment of biological functions Despite the existence of numerous tools, challenges remain in terms of speed, flexibility, and reproducibility In the big data era, it is also increasingly important to cease limiting our findings to a single reference, coalescing knowledge from different data sources, and thus overcoming some limitations in overly relying on computationally generated data from single sources RESULTS We implemented a protein annotation tool, Mantis, which uses database identifiers intersection and text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output Mantis is flexible, allowing for the customization of reference data and execution parameters, and is reproducible across different research goals and user environments We implemented a depth-first search algorithm for domain-specific annotation, which significantly improved annotation performance compared to sequence-wide annotation The parallelized implementation of Mantis results in short runtimes while also outputting high coverage and high-quality protein function annotations CONCLUSIONS Mantis is a protein function annotation tool that produces high-quality consensus-driven protein annotations It is easy to set up, customize, and use, scaling from single genomes to large metagenomes Mantis is available under the MIT license at https://githubcom/PedroMTQ/mantis

Proceedings ArticleDOI
Alina Kuznetsova1, Aakrati Talati1, Yiwen Luo1, Keith Simmons1, Vittorio Ferrari1 
01 Jan 2021
TL;DR: In this article, a unified framework for generic video annotation with bounding boxes is proposed, which has both interpolating and extrapolating capabilities and a guiding mechanism that sequentially generates suggestions for what frame to annotate next, based on the annotations made previously.
Abstract: We introduce a unified framework for generic video annotation with bounding boxes. Video annotation is a longstanding problem, as it is a tedious and time-consuming process. We tackle two important challenges of video annotation: (1) automatic temporal interpolation and extrapolation of bounding boxes provided by a human annotator on a subset of all frames, and (2) automatic selection of frames to annotate manually. Our contribution is two-fold: first, we propose a model that has both interpolating and extrapolating capabilities; second, we propose a guiding mechanism that sequentially generates suggestions for what frame to annotate next, based on the annotations made previously. We extensively evaluate our approach on several challenging datasets in simulation and demonstrate a reduction in terms of the number of manual bounding boxes drawn by 60% over linear interpolation and by 35% over an off-the-shelf tracker. Moreover, we also show 10% annotation time improvement over a state-of-the-art method for video annotation with bounding boxes [25]. Finally, we run human annotation experiments and provide extensive analysis of the results, showing that our approach reduces actual measured annotation time by 50% compared to commonly used linear interpolation.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a semi-supervised approach based on adaptive weighted fusion for automatic image annotation that can simultaneously utilize the labeled data and unlabeled data to improve the annotation performance.
Abstract: To learn a well-performed image annotation model, a large number of labeled samples are usually required. Although the unlabeled samples are readily available and abundant, it is a difficult task for humans to annotate large numbers of images manually. In this article, we propose a novel semi-supervised approach based on adaptive weighted fusion for automatic image annotation that can simultaneously utilize the labeled data and unlabeled data to improve the annotation performance. At first, two different classifiers, constructed based on support vector machine and covolutional neural network, respectively, are trained by different features extracted from the labeled data. Therefore, these two classifiers are independently represented as different feature views. Then, the corresponding features of unlabeled images are extracted and input into these two classifiers, and the semantic annotation of images can be obtained respectively. At the same time, the confidence of corresponding image annotation can be measured by an adaptive weighted fusion strategy. After that, the images and its semantic annotations with high confidence are submitted to the classifiers for retraining until a certain stop condition is reached. As a result, we can obtain a strong classifier that can make full use of unlabeled data. Finally, we conduct experiments on four datasets, namely, Corel 5K, IAPR TC12, ESP Game, and NUS-WIDE. In addition, we measure the performance of our approach with standard criteria, including precision, recall, F-measure, N+, and mAP. The experimental results show that our approach has superior performance and outperforms many state-of-the-art approaches.

Journal ArticleDOI
TL;DR: The Mouse Genome Informatics (MGI) database system combines multiple expertly curated community data resources into a shared knowledge management ecosystem united by common metadata annotation standards as mentioned in this paper.
Abstract: The Mouse Genome Informatics (MGI) database system combines multiple expertly curated community data resources into a shared knowledge management ecosystem united by common metadata annotation standards. MGI’s mission is to facilitate the use of the mouse as an experimental model for understanding the genetic and genomic basis of human health and disease. MGI is the authoritative source for mouse gene, allele, and strain nomenclature and is the primary source of mouse phenotype annotations, functional annotations, developmental gene expression information, and annotations of mouse models with human diseases. MGI maintains mouse anatomy and phenotype ontologies and contributes to the development of the Gene Ontology and Disease Ontology and uses these ontologies as standard terminologies for annotation. The Mouse Genome Database (MGD) and the Gene Expression Database (GXD) are MGI’s two major knowledgebases. Here, we highlight some of the recent changes and enhancements to MGD and GXD that have been implemented in response to changing needs of the biomedical research community and to improve the efficiency of expert curation. MGI can be accessed freely at http://www.informatics.jax.org .

Journal ArticleDOI
TL;DR: In this paper, the authors provide a step-by-step guideline for the FAIR data and metadata management specific to grapevine and wine science, including meaningful information on experimental design and phenotyping, sample collection, sample preparation, chemotype analysis, data analysis and metabolite annotation.
Abstract: In the era of big and omics data, good organization, management, and description of experimental data are crucial for achieving high-quality datasets. This, in turn, is essential for the export of robust results, to publish reliable papers, make data more easily available, and unlock the huge potential of data reuse. Lately, more and more journals now require authors to share data and metadata according to the FAIR (Findable, Accessible, Interoperable, Reusable) principles. This work aims to provide a step-by-step guideline for the FAIR data and metadata management specific to grapevine and wine science. In detail, the guidelines include recommendations for the organization of data and metadata regarding (i) meaningful information on experimental design and phenotyping, (ii) sample collection, (iii) sample preparation, (iv) chemotype analysis, (v) data analysis (vi) metabolite annotation, and (vii) basic ontologies. We hope that these guidelines will be helpful for the grapevine and wine metabolomics community and that it will benefit from the true potential of data usage in creating new knowledge being revealed.

Journal ArticleDOI
TL;DR: The authors evaluate the reliability of annotation with common subfamilies by assessing the extent to which subfamily annotation is reproducible in replicate copies created by segmental duplications in the human genome, and in homologous copies shared by human and chimpanzee.
Abstract: Transposable element (TE) sequences are classified into families based on the reconstructed history of replication, and into subfamilies based on more fine-grained features that are often intended to capture family history. We evaluate the reliability of annotation with common subfamilies by assessing the extent to which subfamily annotation is reproducible in replicate copies created by segmental duplications in the human genome, and in homologous copies shared by human and chimpanzee. We find that standard methods annotate over 10% of replicates as belonging to different subfamilies, despite the fact that they are expected to be annotated as belonging to the same subfamily. Point mutations and homologous recombination appear to be responsible for some of this discordant annotation (particularly in the young Alu family), but are unlikely to fully explain the annotation unreliability. The surprisingly high level of disagreement in subfamily annotation of homologous sequences highlights a need for further research into definition of TE subfamilies, methods for representing subfamily annotation confidence of TE instances, and approaches to better utilizing such nuanced annotation data in downstream analysis.

Proceedings ArticleDOI
06 May 2021
TL;DR: EventAnchor as mentioned in this paper uses machine learning models in computer vision to help users acquire essential events from videos (e.g., serve, the ball bouncing on the court) and offers users a set of interactive tools for data annotation.
Abstract: The popularity of racket sports (e.g., tennis and table tennis) leads to high demands for data analysis, such as notational analysis, on player performance. While sports videos offer many benefits for such analysis, retrieving accurate information from sports videos could be challenging. In this paper, we propose EventAnchor, a data analysis framework to facilitate interactive annotation of racket sports video with the support of computer vision algorithms. Our approach uses machine learning models in computer vision to help users acquire essential events from videos (e.g., serve, the ball bouncing on the court) and offers users a set of interactive tools for data annotation. An evaluation study on a table tennis annotation system built on this framework shows significant improvement of user performances in simple annotation tasks on objects of interest and complex annotation tasks requiring domain knowledge.

Journal ArticleDOI
TL;DR: CorGAT as discussed by the authors is a tool for functional annotation of SARS-CoV-2 genomic variants, which can facilitate the identification of evolutionary patterns in the genome of the SARS co-v-2.
Abstract: SUMMARY: While over 150 thousand genomic sequences are currently available through dedicated repositories, ad hoc methods for the functional annotation of SARS-CoV-2 genomes do not harness all currently available resources for the annotation of functionally relevant genomic sites. Here we present CorGAT, a novel tool for the functional annotation of SARS-CoV-2 genomic variants. By comparisons with other state of the art methods we demonstrate that, by providing a more comprehensive and rich annotation, our method can facilitate the identification of evolutionary patterns in the genome of SARS-CoV-2. AVAILABILITY: Galaxy: http://corgat.cloud.ba.infn.it/galaxy; software: https://github.com/matteo14c/CorGAT/tree/Revision_V1; docker: https://hub.docker.com/r/laniakeacloud/galaxy_corgat. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: In this paper, a method, Quantitative Annotation of Unknown Structure (QAUST), is proposed to infer protein functions, specifically Gene Ontology (GO) terms and Enzyme Commission (EC) numbers.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a robust deep learning based single-cell multiple reference annotator (scMRA), which aggregated multiple datasets into a meta-dataset on which annotation is conducted.
Abstract: Motivation Single-cell RNA-seq (scRNA-seq) has been widely used to resolve cellular heterogeneity. After collecting scRNA-seq data, the natural next step is to integrate the accumulated data to achieve a common ontology of cell types and states. Thus, an effective and efficient cell-type identification method is urgently needed. Meanwhile, high quality reference data remain a necessity for precise annotation. However, such tailored reference data are always lacking in practice. To address this, we aggregated multiple datasets into a meta-dataset on which annotation is conducted. Existing supervised or semi-supervised annotation methods suffer from batch effects caused by different sequencing platforms, the effect of which increases in severity with multiple reference datasets. Results Herein, a robust deep learning based single-cell Multiple Reference Annotator (scMRA) is introduced. In scMRA, a knowledge graph is constructed to represent the characteristics of cell types in different datasets, and a graphic convolutional network (GCN) serves as a discriminator based on this graph. scMRA keeps intra-cell-type closeness and the relative position of cell types across datasets. scMRA is remarkably powerful at transferring knowledge from multiple reference datasets, to the unlabeled target domain, thereby gaining an advantage over other state-of-the-art annotation methods in multi-reference data experiments. Furthermore, scMRA can remove batch effects. To the best of our knowledge, this is the first attempt to use multiple insufficient reference datasets to annotate target data, and it is, comparatively, the best annotation method for multiple scRNA-seq datasets. Availability An implementation of scMRA is available from https://github.com/ddb-qiwang/scMRA-torch. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
20 Nov 2021-Displays
TL;DR: Li et al. as discussed by the authors adopt the teacher-student framework to generate pseudo-labels from unlabeled training data, and use a label filtering method to improve the pseudo label quality.

Posted Content
TL;DR: Wang et al. as discussed by the authors used only patch-level classification labels to achieve tissue semantic segmentation on histopathology images, finally reducing the annotation efforts, and proposed a two-step model including a classification and a segmentation phases.
Abstract: Tissue-level semantic segmentation is a vital step in computational pathology. Fully-supervised models have already achieved outstanding performance with dense pixel-level annotations. However, drawing such labels on the giga-pixel whole slide images is extremely expensive and time-consuming. In this paper, we use only patch-level classification labels to achieve tissue semantic segmentation on histopathology images, finally reducing the annotation efforts. We proposed a two-step model including a classification and a segmentation phases. In the classification phase, we proposed a CAM-based model to generate pseudo masks by patch-level labels. In the segmentation phase, we achieved tissue semantic segmentation by our proposed Multi-Layer Pseudo-Supervision. Several technical novelties have been proposed to reduce the information gap between pixel-level and patch-level annotations. As a part of this paper, we introduced a new weakly-supervised semantic segmentation (WSSS) dataset for lung adenocarcinoma (LUAD-HistoSeg). We conducted several experiments to evaluate our proposed model on two datasets. Our proposed model outperforms two state-of-the-art WSSS approaches. Note that we can achieve comparable quantitative and qualitative results with the fully-supervised model, with only around a 2\% gap for MIoU and FwIoU. By comparing with manual labeling, our model can greatly save the annotation time from hours to minutes. The source code is available at: \url{this https URL}.

Proceedings ArticleDOI
22 May 2021
TL;DR: In this paper, the authors perform a fine-grained empirical study on test annotation changes and present a taxonomy by manually inspecting and classifying a sample of 368 test annotations and documenting the motivations driving these changes.
Abstract: Since the introduction of annotations in Java 5, the majority of testing frameworks, such as JUnit, TestNG, and Mockito, have adopted annotations in their core design. This adoption affected the testing practices in every step of the test life-cycle, from fixture setup and test execution to fixture teardown. Despite the importance of test annotations, most research on test maintenance has mainly focused on test code quality and test assertions. As a result, there is little empirical evidence on the evolution and maintenance of test annotations. To fill this gap, we perform the first fine-grained empirical study on annotation changes. We developed a tool to mine 82,810 commits and detect 23,936 instances of test annotation changes from 12 open-source Java projects. Our main findings are: (1) Test annotation changes are more frequent than rename and type change refactorings. (2) We recover various migration efforts within the same testing framework or between different frameworks by analyzing common annotation replacement patterns. (3) We create a taxonomy by manually inspecting and classifying a sample of 368 test annotation changes and documenting the motivations driving these changes. Finally, we present a list of actionable implications for developers, researchers, and framework designers.

Journal ArticleDOI
TL;DR: MultiPhATE2 as mentioned in this paper performs gene finding using multiple algorithms, compares the results of the algorithms, performs functional annotation of coding sequences, and incorporates additional search algorithms and databases to extend the search space of the original code.
Abstract: To address a need for improved tools for annotation and comparative genomics of bacteriophage genomes, we developed multiPhATE2. As an extension of multiPhATE, a functional annotation code released previously, multiPhATE2 performs gene finding using multiple algorithms, compares the results of the algorithms, performs functional annotation of coding sequences, and incorporates additional search algorithms and databases to extend the search space of the original code. MultiPhATE2 performs gene matching among sets of closely related bacteriophage genomes, and uses multiprocessing to speed computations. MultiPhATE2 can be re-started at multiple points within the workflow to allow the user to examine intermediate results and adjust the subsequent computations accordingly. In addition, multiPhATE2 accommodates custom gene calls and sequence databases, again adding flexibility. MultiPhATE2 was implemented in Python 3.7 and runs as a command-line code under Linux or MAC operating systems. Full documentation is provided as a README file and a Wiki website.