scispace - formally typeset
Open AccessJournal ArticleDOI

An assessment of genome annotation coverage across the bacterial tree of life

TLDR
The analysis of bacterial genomes from the Genome Taxonomy Database revealed that 52 and 79 % of the average bacterial proteome could be functionally annotated based on protein and domain-based homology searches, respectively, highlighting the disparity in annotation coverage.
Abstract
Although gene-finding in bacterial genomes is relatively straightforward, the automated assignment of gene function is still challenging, resulting in a vast quantity of hypothetical sequences of unknown function. But how prevalent are hypothetical sequences across bacteria, what proportion of genes in different bacterial genomes remain unannotated, and what factors affect annotation completeness? To address these questions, we surveyed over 27 000 bacterial genomes from the Genome Taxonomy Database, and measured genome annotation completeness as a function of annotation method, taxonomy, genome size, 'research bias' and publication date. Our analysis revealed that 52 and 79 % of the average bacterial proteome could be functionally annotated based on protein and domain-based homology searches, respectively. Annotation coverage using protein homology search varied significantly from as low as 14 % in some species to as high as 98 % in others. We found that taxonomy is a major factor influencing annotation completeness, with distinct trends observed across the microbial tree (e.g. the lowest level of completeness was found in the Patescibacteria lineage). Most lineages showed a significant association between genome size and annotation incompleteness, likely reflecting a greater degree of uncharacterized sequences in 'accessory' proteomes than in 'core' proteomes. Finally, research bias, as measured by publication volume, was also an important factor influencing genome annotation completeness, with early model organisms showing high completeness levels relative to other genomes in their own taxonomic lineages. Our work highlights the disparity in annotation coverage across the bacterial tree of life and emphasizes a need for more experimental characterization of accessory proteomes as well as understudied lineages.

read more

Citations
More filters
Journal ArticleDOI

Next-generation genome annotation: we still struggle to get it right.

TL;DR: While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that has been used for the past two decades.

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification.

TL;DR: Bakta as discussed by the authors is a command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes, including the detection of small proteins taking into account replicon metadata.
Journal ArticleDOI

Addressing uncertainty in genome-scale metabolic model reconstruction and analysis

TL;DR: This article reviewed the major sources of uncertainty and survey existing approaches developed for representing and addressing them, and proposed a unified formal characterization of these uncertainties through probabilistic approaches and ensemble modeling.
Posted ContentDOI

Novel integrative elements and genomic plasticity in ocean ecosystems

TL;DR: The role of cargo-carrying transposons in the diversification and adaptation of marine cyanobacterium Prochlorococcus has been investigated in this paper, showing that the excision and integration of tycheposons at seven tRNA genes drive the remodeling of larger genomic islands containing most of Prochlor organisms' flexible genes.
Journal ArticleDOI

Phylogeny resolved, metabolism revealed: functional radiation within a widespread and divergent clade of sponge symbionts.

TL;DR: The deep branching of the UTethybacterales within the Gammaproteobacteria and their almost exclusive presence in sponges suggests they have entered a symbiosis with their host relatively early in evolutionary time and have subsequently functionally radiated.
References
More filters
Journal ArticleDOI

Prokka: Rapid Prokaryotic Genome Annotation

TL;DR: Prokka is introduced, a command line software tool to fully annotate a draft bacterial genome in about 10 min on a typical desktop computer, and produces standards-compliant output files for further analysis or viewing in genome browsers.
Journal ArticleDOI

Fast and sensitive protein alignment using DIAMOND

TL;DR: DIAMOND is introduced, an open-source algorithm based on double indexing that is 20,000 times faster than BLASTX on short reads and has a similar degree of sensitivity.
Journal ArticleDOI

Prodigal: prokaryotic gene recognition and translation initiation site identification

TL;DR: This work developed a new gene prediction algorithm called Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm), which achieved good results compared to existing methods, and it is believed it will be a valuable asset to automated microbial annotation pipelines.
Journal ArticleDOI

KEGG for linking genomes to life and the environment

TL;DR: KEGG PATHWAY is now supplemented with a new global map of metabolic pathways, which is essentially a combined map of about 120 existing pathway maps, and the KEGG resource is being expanded to suit the needs for practical applications.
Journal ArticleDOI

The Pfam protein families database: towards a more sustainable future

TL;DR: Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set, and the facility to view the relationship between families within a clan has been improved by the introduction of a new tool.
Related Papers (5)