scispace - formally typeset
Search or ask a question

Showing papers on "Annotation published in 2022"


Journal ArticleDOI
TL;DR: The DAVID Gene system was rebuilt to gain coverage of more organisms, which increased the taxonomy coverage from 17 399 to 55 464 and a species parameter for uploading a list of gene symbols to minimize the ambiguity between species, which increases the efficiency of the list upload and eliminates confusion for users.
Abstract: DAVID is a popular bioinformatics resource system including a web server and web service for functional annotation and enrichment analyses of gene lists. It consists of a comprehensive knowledgebase and a set of functional analysis tools. Here, we report all updates made in 2021. The DAVID Gene system was rebuilt to gain coverage of more organisms, which increased the taxonomy coverage from 17 399 to 55 464. All existing annotation types have been updated, if available, based on the new DAVID Gene system. Compared with the last version, the number of gene-term records for most annotation types within the updated Knowledgebase have significantly increased. Moreover, we have incorporated new annotations in the Knowledgebase including small molecule-gene interactions from PubChem, drug-gene interactions from DrugBank, tissue expression information from the Human Protein Atlas, disease information from DisGeNET, and pathways from WikiPathways and PathBank. Eight of ten subgroups split from Uniprot Keyword annotation were assigned to specific types. Finally, we added a species parameter for uploading a list of gene symbols to minimize the ambiguity between species, which increases the efficiency of the list upload and eliminates confusion for users. These current updates have significantly expanded the Knowledgebase and enhanced the discovery power of DAVID.

860 citations


Journal ArticleDOI
TL;DR: The DAVID Gene system as discussed by the authors was rebuilt to gain coverage of more organisms, which increased the taxonomy coverage from 17 399 to 55 464, and the number of gene-term records for most annotation types within the updated knowledgebase have significantly increased.
Abstract: Abstract DAVID is a popular bioinformatics resource system including a web server and web service for functional annotation and enrichment analyses of gene lists. It consists of a comprehensive knowledgebase and a set of functional analysis tools. Here, we report all updates made in 2021. The DAVID Gene system was rebuilt to gain coverage of more organisms, which increased the taxonomy coverage from 17 399 to 55 464. All existing annotation types have been updated, if available, based on the new DAVID Gene system. Compared with the last version, the number of gene-term records for most annotation types within the updated Knowledgebase have significantly increased. Moreover, we have incorporated new annotations in the Knowledgebase including small molecule-gene interactions from PubChem, drug-gene interactions from DrugBank, tissue expression information from the Human Protein Atlas, disease information from DisGeNET, and pathways from WikiPathways and PathBank. Eight of ten subgroups split from Uniprot Keyword annotation were assigned to specific types. Finally, we added a species parameter for uploading a list of gene symbols to minimize the ambiguity between species, which increases the efficiency of the list upload and eliminates confusion for users. These current updates have significantly expanded the Knowledgebase and enhanced the discovery power of DAVID.

797 citations


Journal ArticleDOI
TL;DR: The Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration as mentioned in this paper is a set of transcripts and corresponding proteins annotated for each human protein-coding gene.
Abstract: Comprehensive genome annotation is essential to understand the impact of clinically relevant variants. However, the absence of a standard for clinical reporting and browser display complicates the process of consistent interpretation and reporting. To address these challenges, Ensembl/GENCODE1 and RefSeq2 launched a joint initiative, the Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration, to converge on human gene and transcript annotation and to jointly define a high-value set of transcripts and corresponding proteins. Here, we describe the MANE transcript sets for use as universal standards for variant reporting and browser display. The MANE Select set identifies a representative transcript for each human protein-coding gene, whereas the MANE Plus Clinical set provides additional transcripts at loci where the Select transcripts alone are not sufficient to report all currently known clinical variants. Each MANE transcript represents an exact match between the exonic sequences of an Ensembl/GENCODE transcript and its counterpart in RefSeq such that the identifiers can be used synonymously. We have now released MANE Select transcripts for 97% of human protein-coding genes, including all American College of Medical Genetics and Genomics Secondary Findings list v3.0 (ref. 3) genes. MANE transcripts are accessible from major genome browsers and key resources. Widespread adoption of these transcript sets will increase the consistency of reporting, facilitate the exchange of data regardless of the annotation source and help to streamline clinical interpretation.

95 citations


Journal ArticleDOI
TL;DR: For example, this article used deep learning models to predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam.
Abstract: Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools. A deep learning model predicts protein functional annotations for unaligned amino acid sequences.

90 citations


Journal ArticleDOI
TL;DR: In this article , a human-in-the-loop pipeline for rapid prototyping of new custom segmentation models is proposed. But the pipeline does not allow users to adapt the segmentation style to their specific needs and can perform suboptimally for test images that are very different from the training images.
Abstract: Pretrained neural network models for biological segmentation can provide good out-of-the-box results for many image types. However, such models do not allow users to adapt the segmentation style to their specific needs and can perform suboptimally for test images that are very different from the training images. Here we introduce Cellpose 2.0, a new package that includes an ensemble of diverse pretrained models as well as a human-in-the-loop pipeline for rapid prototyping of new custom models. We show that models pretrained on the Cellpose dataset can be fine-tuned with only 500-1,000 user-annotated regions of interest (ROI) to perform nearly as well as models trained on entire datasets with up to 200,000 ROI. A human-in-the-loop approach further reduced the required user annotation to 100-200 ROI, while maintaining high-quality segmentations. We provide software tools such as an annotation graphical user interface, a model zoo and a human-in-the-loop pipeline to facilitate the adoption of Cellpose 2.0.

88 citations


Journal ArticleDOI
TL;DR: ScType as mentioned in this paper is a computational platform that enables a fully-automated and ultra-fast cell-type identification based solely on a given scRNA-seq data, along with a comprehensive cell marker database as background information.
Abstract: Identification of cell populations often relies on manual annotation of cell clusters using established marker genes. However, the selection of marker genes is a time-consuming process that may lead to sub-optimal annotations as the markers must be informative of both the individual cell clusters and various cell types present in the sample. Here, we developed a computational platform, ScType, which enables a fully-automated and ultra-fast cell-type identification based solely on a given scRNA-seq data, along with a comprehensive cell marker database as background information. Using six scRNA-seq datasets from various human and mouse tissues, we show how ScType provides unbiased and accurate cell type annotations by guaranteeing the specificity of positive and negative marker genes across cell clusters and cell types. We also demonstrate how ScType distinguishes between healthy and malignant cell populations, based on single-cell calling of single-nucleotide variants, making it a versatile tool for anticancer applications. The widely applicable method is deployed both as an interactive web-tool ( https://sctype.app ), and as an open-source R-package.

79 citations


Journal ArticleDOI
Alan Poling1
01 Jun 2022
TL;DR: The SynergyFinder R package as mentioned in this paper is a software used to analyze pre-clinical drug combination datasets and provide a statistical analysis of drug combination synergy and sensitivity with confidence intervals and P values.
Abstract: Combinatorial therapies have been recently proposed to improve the efficacy of anticancer treatment. The SynergyFinder R package is a software used to analyze pre-clinical drug combination datasets. Here, we report the major updates to the SynergyFinder R package for improved interpretation and annotation of drug combination screening results. Unlike the existing implementations, the updated SynergyFinder R package includes five main innovations. 1) We extend the mathematical models to higher-order drug combination data analysis and implement dimension reduction techniques for visualizing the synergy landscape. 2) We provide a statistical analysis of drug combination synergy and sensitivity with confidence intervals and P values. 3) We incorporate a synergy barometer to harmonize multiple synergy scoring methods to provide a consensus metric for synergy. 4) We evaluate drug combination synergy and sensitivity to provide an unbiased interpretation of the clinical potential. 5) We enable fast annotation of drugs and cell lines, including their chemical and target information. These annotations will improve the interpretation of the mechanisms of action of drug combinations. To facilitate the use of the R package within the drug discovery community, we also provide a web server at www.synergyfinderplus.org as a user-friendly interface to enable a more flexible and versatile analysis of drug combination data.

75 citations


Journal ArticleDOI
TL;DR: ProteinBERT as discussed by the authors is a deep language model specifically designed for proteins, which combines language modeling with a novel task of Gene Ontology (GO) annotation prediction and achieves state-of-the-art performance.
Abstract: Abstract Summary Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. Availability and implementation Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert. Supplementary information Supplementary data are available at Bioinformatics online.

75 citations


Posted ContentDOI
11 Mar 2022-bioRxiv
TL;DR: The integrated Human Lung Cell Atlas (HLCA) is presented, combining 46 datasets of the human respiratory system into a single atlas spanning over 2.2 million cells from 444 individuals across health and disease.
Abstract: Organ- and body-scale cell atlases have the potential to transform our understanding of human biology. To capture the variability present in the population, these atlases must include diverse demographics such as age and ethnicity from both healthy and diseased individuals. The growth in both size and number of single-cell datasets, combined with recent advances in computational techniques, for the first time makes it possible to generate such comprehensive large-scale atlases through integration of multiple datasets. Here, we present the integrated Human Lung Cell Atlas (HLCA) combining 46 datasets of the human respiratory system into a single atlas spanning over 2.2 million cells from 444 individuals across health and disease. The HLCA contains a consensus re-annotation of published and newly generated datasets, resolving under- or misannotation of 59% of cells in the original datasets. The HLCA enables recovery of rare cell types, provides consensus marker genes for each cell type, and uncovers gene modules associated with demographic covariates and anatomical location within the respiratory system. To facilitate the use of the HLCA as a reference for single-cell lung research and allow rapid analysis of new data, we provide an interactive web portal to project datasets onto the HLCA. Finally, we demonstrate the value of the HLCA reference for interpreting disease-associated changes. Thus, the HLCA outlines a roadmap for the development and use of organ-scale cell atlases within the Human Cell Atlas.

55 citations


Journal ArticleDOI
TL;DR: A community-led effort involving Ensembl/ GENCODE, the HUGO Gene Nomenclature Committee (HGNC), UniProtKB, HUPO/ HPP and PeptideAtlas to produce a standardized catalog of 7,264 human Ribo-seq ORFs is outlined, outlining a path to bring protein-level evidence for Ribo’s ORFs into reference annotation databases; and a roadmap to facilitate research in the global community.

51 citations


Journal ArticleDOI
TL;DR: A spectrum of over 125 useful, complimentary free and open source software tools and libraries, wrote and made available through the multiple vcflib, bio-vcf, cyvcf2, hts-nim and slivar projects are presented.
Abstract: Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies—as well as in somatic and germline mutation studies. The VCF format can represent single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called and anchored against a reference genome. Here we present a spectrum of over 125 useful, complimentary free and open source software tools and libraries, we wrote and made available through the multiple vcflib, bio-vcf, cyvcf2, hts-nim and slivar projects. These tools are applied for comparison, filtering, normalisation, smoothing and annotation of VCF, as well as output of statistics, visualisation, and transformations of files variants. These tools run everyday in critical biomedical pipelines and countless shell scripts. Our tools are part of the wider bioinformatics ecosystem and we highlight best practices. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation through pangenome graph formats, variation that can not easily be represented by the VCF format.

Journal ArticleDOI
TL;DR: The UCSC Genome Browser (http://genome.ucsc.edu) as discussed by the authors is an omics data consolidator, graphical viewer, and general bioinformatics resource that continues to serve the community as it enters its 23rd year.
Abstract: Abstract The UCSC Genome Browser (https://genome.ucsc.edu) is an omics data consolidator, graphical viewer, and general bioinformatics resource that continues to serve the community as it enters its 23rd year. This year has seen an emphasis in clinical data, with new tracks and an expanded Recommended Track Sets feature on hg38 as well as the addition of a single cell track group. SARS-CoV-2 continues to remain a focus, with regular annotation updates to the browser and continued curation of our phylogenetic sequence placing tool, hgPhyloPlace, whose tree has now reached over 12M sequences. Our GenArk resource has also grown, offering over 2500 hubs and a system for users to request any absent assemblies. We have expanded our bigBarChart display type and created new ways to visualize data via bigRmsk and dynseq display. Displaying custom annotations is now easier due to our chromAlias system which eliminates the requirement for renaming sequence names to the UCSC standard. Users involved in data generation may also be interested in our new tools and trackDb settings which facilitate the creation and display of their custom annotations.

Journal ArticleDOI
TL;DR: This paper propose a representation based on lexicalized dependency paths (LDPs) coupled with an active learner for LDPs, which is applied to both simulated and real active learning.
Abstract: Active learning methods which present selected examples from the corpus for annotation provide more efficient learning of supervised relation extraction models, but they leave the developer in the unenviable role of a passive informant. To restore the developer’s proper role as a partner with the system, we must give the developer an ability to inspect the extraction model during development. We propose to make this possible through a representation based on lexicalized dependency paths (LDPs) coupled with an active learner for LDPs. We apply LDPs to both simulated and real active learning with ACE as evaluation and a year’s newswire for training and show that simulated active learning greatly reduces annotation cost while maintaining similar performance level of supervised learning, while real active learning yields comparable performance to the state-of-the-art using a small number of annotations.

Journal ArticleDOI
TL;DR: This paper introduces two novel elements to learn the video object segmentation model, the scribble attention module, which captures more accurate context information and learns an effective attention map to enhance the contrast between foreground and background, and the scribbles-supervised loss, which can optimize the unlabeled pixels and dynamically correct inaccurate segmented areas during the training stage.
Abstract: Recently, video object segmentation has received great attention in the computer vision community. Most of the existing methods heavily rely on the pixel-wise human annotations, which are expensive and time-consuming to obtain. To tackle this problem, we make an early attempt to achieve video object segmentation with scribble-level supervision, which can alleviate large amounts of human labor for collecting the manual annotation. However, using conventional network architectures and learning objective functions under this scenario cannot work well as the supervision information is highly sparse and incomplete. To address this issue, this paper introduces two novel elements to learn the video object segmentation model. The first one is the scribble attention module, which captures more accurate context information and learns an effective attention map to enhance the contrast between foreground and background. The other one is the scribble-supervised loss, which can optimize the unlabeled pixels and dynamically correct inaccurate segmented areas during the training stage. To evaluate the proposed method, we implement experiments on two video object segmentation benchmark datasets, YouTube-video object segmentation (VOS), and densely annotated video segmentation (DAVIS)-2017. We first generate the scribble annotations from the original per-pixel annotations. Then, we train our model and compare its test performance with the baseline models and other existing works. Extensive experiments demonstrate that the proposed method can work effectively and approach to the methods requiring the dense per-pixel annotations.

Journal ArticleDOI
TL;DR: In this article , the authors describe a dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam and release 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases.
Abstract: Abstract Most of the existing chest X-ray datasets include labels from a list of findings without specifying their locations on the radiographs. This limits the development of machine learning algorithms for the detection and localization of chest abnormalities. In this work, we describe a dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data, we release 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases. The released dataset is divided into a training set of 15,000 and a test set of 3,000. Each scan in the training set was independently labeled by 3 radiologists, while each scan in the test set was labeled by the consensus of 5 radiologists. We designed and built a labeling platform for DICOM images to facilitate these annotation procedures. All images are made publicly available in DICOM format along with the labels of both the training set and the test set.

Journal ArticleDOI
TL;DR: The FIVES dataset as discussed by the authors consists of 800 high-resolution multi-disease color fundus photographs with pixelwise manual annotation, and the annotation process was standardized through crowdsourcing among medical experts.
Abstract: Abstract Retinal vasculature provides an opportunity for direct observation of vessel morphology, which is linked to multiple clinical conditions. However, objective and quantitative interpretation of the retinal vasculature relies on precise vessel segmentation, which is time consuming and labor intensive. Artificial intelligence (AI) has demonstrated great promise in retinal vessel segmentation. The development and evaluation of AI-based models require large numbers of annotated retinal images. However, the public datasets that are usable for this task are scarce. In this paper, we collected a color fundus image vessel segmentation (FIVES) dataset. The FIVES dataset consists of 800 high-resolution multi-disease color fundus photographs with pixelwise manual annotation. The annotation process was standardized through crowdsourcing among medical experts. The quality of each image was also evaluated. To the best of our knowledge, this is the largest retinal vessel segmentation dataset for which we believe this work will be beneficial to the further development of retinal vessel segmentation.

Journal ArticleDOI
01 Apr 2022-Entropy
TL;DR: A comprehensive literature review of the top-performing SSL methods using auxiliary pretext and contrastive learning techniques, including how self-supervised methods compare to supervised ones, and then discusses both further considerations and ongoing challenges faced by SSL.
Abstract: Although deep learning algorithms have achieved significant progress in a variety of domains, they require costly annotations on huge datasets. Self-supervised learning (SSL) using unlabeled data has emerged as an alternative, as it eliminates manual annotation. To do this, SSL constructs feature representations using pretext tasks that operate without manual annotation, which allows models trained in these tasks to extract useful latent representations that later improve downstream tasks such as object classification and detection. The early methods of SSL are based on auxiliary pretext tasks as a way to learn representations using pseudo-labels, or labels that were created automatically based on the dataset’s attributes. Furthermore, contrastive learning has also performed well in learning representations via SSL. To succeed, it pushes positive samples closer together, and negative ones further apart, in the latent space. This paper provides a comprehensive literature review of the top-performing SSL methods using auxiliary pretext and contrastive learning techniques. It details the motivation for this research, a general pipeline of SSL, the terminologies of the field, and provides an examination of pretext tasks and self-supervised methods. It also examines how self-supervised methods compare to supervised ones, and then discusses both further considerations and ongoing challenges faced by SSL.

Journal ArticleDOI
TL;DR: NLM's conserved domain database (CDD) as discussed by the authors is a collection of protein domain and protein family models constructed as multiple sequence alignments, and its main purpose is to provide annotation for protein and translated nucleotide sequences.
Abstract: NLM's conserved domain database (CDD) is a collection of protein domain and protein family models constructed as multiple sequence alignments. Its main purpose is to provide annotation for protein and translated nucleotide sequences with the location of domain footprints and associated functional sites, and to define protein domain architecture as a basis for assigning gene product names and putative/predicted function. CDD has been available publicly for over 20 years and has grown substantially during that time. Maintaining an archive of pre-computed annotation continues to be a challenge and has slowed down the cadence of CDD releases. CDD curation staff builds hierarchical classifications of large protein domain families, adds models for novel domain families via surveillance of the protein 'dark matter' that currently lacks annotation, and now spends considerable effort on providing names and attribution for conserved domain architectures. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Journal ArticleDOI
TL;DR: In this article , an attention UW-Net is proposed to improve the accuracy and give a probabilistic map for automatic annotation from small data set to reduce the use of tedious and prone to error manual annotations from chest X-rays.

Proceedings ArticleDOI
09 Jun 2022
TL;DR: A novel framework is introduced, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline: task formulation, selection of annotators, platform and infrastructure choices, dataset analysis and evaluation, and dataset release and maintenance.
Abstract: Human annotated data plays a crucial role in machine learning (ML) research and development. However, the ethical considerations around the processes and decisions that go into dataset annotation have not received nearly enough attention. In this paper, we survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation. We synthesize these insights, and lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators’ lived experiences can impact their annotations, and (2) the relationship between the annotators and the crowdsourcing platforms, and what that relationship affords them. Finally, we introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline: task formulation, selection of annotators, platform and infrastructure choices, dataset analysis and evaluation, and dataset release and maintenance.

Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed PFmulDL, a new protein function annotation strategy, integrating multiple deep learning methods, which is capable of significantly elevating the prediction performance for the rare classes without sacrificing that for the major classes.

Journal ArticleDOI
TL;DR: An ecosystem of R packages, centered around the MetaboCoreUtils, MetaboAnnotation and CompoundDb packages that together provide a modular infrastructure for the annotation of untargeted metabolomics data, allowing to build reproducible annotation workflows tailored for and adapted to mostUntargeted LC-MS-based datasets.
Abstract: Liquid chromatography-mass spectrometry (LC-MS)-based untargeted metabolomics experiments have become increasingly popular because of the wide range of metabolites that can be analyzed and the possibility to measure novel compounds. LC-MS instrumentation and analysis conditions can differ substantially among laboratories and experiments, thus resulting in non-standardized datasets demanding customized annotation workflows. We present an ecosystem of R packages, centered around the MetaboCoreUtils, MetaboAnnotation and CompoundDb packages that together provide a modular infrastructure for the annotation of untargeted metabolomics data. Initial annotation can be performed based on MS1 properties such as m/z and retention times, followed by an MS2-based annotation in which experimental fragment spectra are compared against a reference library. Such reference databases can be created and managed with the CompoundDb package. The ecosystem supports data from a variety of formats, including, but not limited to, MSP, MGF, mzML, mzXML, netCDF as well as MassBank text files and SQL databases. Through its highly customizable functionality, the presented infrastructure allows to build reproducible annotation workflows tailored for and adapted to most untargeted LC-MS-based datasets. All core functionality, which supports base R data types, is exported, also facilitating its re-use in other R packages. Finally, all packages are thoroughly unit-tested and documented and are available on GitHub and through Bioconductor.

Journal ArticleDOI
31 Mar 2022
TL;DR: ProtTucker as mentioned in this paper uses single protein representations from protein Language Models (pLMs) for contrastive learning, which optimizes constraints captured by hierarchical classifications of protein 3D structures.
Abstract: Abstract Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the ‘midnight zone’ of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

Journal ArticleDOI
TL;DR: The Spoken British National Corpus 2014 (SBNC 2014) as discussed by the authors is an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK.
Abstract: Abstract This paper introduces the Spoken British National Corpus 2014, an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK, recorded in the years 2012–2016. After showing that a survey of the recent history of corpora of spoken British English justifies the compilation of this new corpus, we describe the main stages of the Spoken BNC2014’s creation: design, data and metadata collection, transcription, XML encoding, and annotation. In doing so we aim to (i) encourage users of the corpus to approach the data with sensitivity to the many methodological issues we identified and attempted to overcome while compiling the Spoken BNC2014, and (ii) inform (future) compilers of spoken corpora of the innovations we implemented to attempt to make the construction of corpora representing spontaneous speech in informal contexts more tractable, both logistically and practically, than in the past.

Journal ArticleDOI
TL;DR: A novel collaborative framework for engaging crowds of medical students and pathologists to produce quality labels for cell nuclei and results indicate that even noisy algorithmic suggestions do not adversely affect pathologist accuracy and can help non-experts improve annotation quality.
Abstract: Abstract Background Deep learning enables accurate high-resolution mapping of cells and tissue structures that can serve as the foundation of interpretable machine-learning models for computational pathology. However, generating adequate labels for these structures is a critical barrier, given the time and effort required from pathologists. Results This article describes a novel collaborative framework for engaging crowds of medical students and pathologists to produce quality labels for cell nuclei. We used this approach to produce the NuCLS dataset, containing >220,000 annotations of cell nuclei in breast cancers. This builds on prior work labeling tissue regions to produce an integrated tissue region- and cell-level annotation dataset for training that is the largest such resource for multi-scale analysis of breast cancer histology. This article presents data and analysis results for single and multi-rater annotations from both non-experts and pathologists. We present a novel workflow that uses algorithmic suggestions to collect accurate segmentation data without the need for laborious manual tracing of nuclei. Our results indicate that even noisy algorithmic suggestions do not adversely affect pathologist accuracy and can help non-experts improve annotation quality. We also present a new approach for inferring truth from multiple raters and show that non-experts can produce accurate annotations for visually distinctive classes. Conclusions This study is the most extensive systematic exploration of the large-scale use of wisdom-of-the-crowd approaches to generate data for computational pathology applications.

Journal ArticleDOI
Nicola Zamboni1
TL;DR: MSNovelist as mentioned in this paper combines fingerprint prediction with an encoder-decoder neural network to generate structures de novo solely from tandem mass spectrometry (MS 2 ) spectra.
Abstract: Abstract Current methods for structure elucidation of small molecules rely on finding similarity with spectra of known compounds, but do not predict structures de novo for unknown compound classes. We present MSNovelist, which combines fingerprint prediction with an encoder–decoder neural network to generate structures de novo solely from tandem mass spectrometry (MS 2 ) spectra. In an evaluation with 3,863 MS 2 spectra from the Global Natural Product Social Molecular Networking site, MSNovelist predicted 25% of structures correctly on first rank, retrieved 45% of structures overall and reproduced 61% of correct database annotations, without having ever seen the structure in the training phase. Similarly, for the CASMI 2016 challenge, MSNovelist correctly predicted 26% and retrieved 57% of structures, recovering 64% of correct database annotations. Finally, we illustrate the application of MSNovelist in a bryophyte MS 2 dataset, in which de novo structure prediction substantially outscored the best database candidate for seven spectra. MSNovelist is ideally suited to complement library-based annotation in the case of poorly represented analyte classes and novel compounds.

Journal ArticleDOI
TL;DR: A pseudo-data-generation algorithm that randomly replaces information in the whole domain improves the transfer learning ability of the standard extraction method for different types of tumor-related medical event extractions.
Abstract: The popularization of electronic clinical medical records makes it possible to use automated methods to extract high-value information from medical records quickly. As essential medical information, oncology medical events are composed of attributes that describe malignant tumors. In recent years, oncology medicine event extraction has become a research hotspot in academia. Many academic conferences publish it as an evaluation task and provide a series of high-quality annotation data. This article aims at the characteristics of discrete attributes of tumor-related medical events and proposes a medical event. The standard extraction method realizes the combined extraction of the primary tumor site and primary tumor size characteristics, as well as the extraction of tumor metastasis sites. In addition, given the problems of the small number and types of annotation texts for tumor-related medical events, a key-based approach is proposed. A pseudo-data-generation algorithm that randomly replaces information in the whole domain improves the transfer learning ability of the standard extraction method for different types of tumor-related medical event extractions. The proposed method won third place in the clinical medical event extraction and evaluation task of the CCKS2020 electronic medical record. A large number of experiments on the CCKS2020 dataset verify the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: The FIVES dataset as mentioned in this paper consists of 800 high-resolution multi-disease color fundus photographs with pixelwise manual annotation, and the annotation process was standardized through crowdsourcing among medical experts.
Abstract: Abstract Retinal vasculature provides an opportunity for direct observation of vessel morphology, which is linked to multiple clinical conditions. However, objective and quantitative interpretation of the retinal vasculature relies on precise vessel segmentation, which is time consuming and labor intensive. Artificial intelligence (AI) has demonstrated great promise in retinal vessel segmentation. The development and evaluation of AI-based models require large numbers of annotated retinal images. However, the public datasets that are usable for this task are scarce. In this paper, we collected a color fundus image vessel segmentation (FIVES) dataset. The FIVES dataset consists of 800 high-resolution multi-disease color fundus photographs with pixelwise manual annotation. The annotation process was standardized through crowdsourcing among medical experts. The quality of each image was also evaluated. To the best of our knowledge, this is the largest retinal vessel segmentation dataset for which we believe this work will be beneficial to the further development of retinal vessel segmentation.

Posted ContentDOI
21 Mar 2022-bioRxiv
TL;DR: It is demonstrated that this genome-wide constraint map provides an effective approach for characterizing the non-coding genome and improving the identification and interpretation of functional human genetic variation.
Abstract: The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders, but attempts to assess constraint for non-protein-coding regions have proven more difficult. Here we aggregate, process, and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD), the largest public open-access human genome reference dataset, and use this dataset to build a mutational constraint map for the whole genome. We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation across the genome. As expected, proteincoding sequences overall are under stronger constraint than non-coding regions. Within the noncoding genome, constrained regions are enriched for regulatory elements and variants implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association, and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained genes, while non-coding constraint captures additional functional information underrecognized by gene constraint metrics. We demonstrate that this genome-wide constraint map provides an effective approach for characterizing the non-coding genome and improving the identification and interpretation of functional human genetic variation.

Proceedings ArticleDOI
12 Jan 2022
TL;DR: Five different crowdworker-based human evaluation methods are compared and it is found that different methods are best depending on the types of models compared, with no clear winner across the board.
Abstract: At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.