scispace - formally typeset
Search or ask a question

Showing papers on "Genome published in 2022"


Journal ArticleDOI
TL;DR: The Omicron variant is exceptional for carrying over 30 mutations in the spike glycoprotein, which are predicted to influence antibody neutralization and spike function as discussed by the authors , highlighting the rapid spread in regions with high levels of population immunity.
Abstract: The SARS-CoV-2 epidemic in southern Africa has been characterized by three distinct waves. The first was associated with a mix of SARS-CoV-2 lineages, while the second and third waves were driven by the Beta (B.1.351) and Delta (B.1.617.2) variants, respectively1-3. In November 2021, genomic surveillance teams in South Africa and Botswana detected a new SARS-CoV-2 variant associated with a rapid resurgence of infections in Gauteng province, South Africa. Within three days of the first genome being uploaded, it was designated a variant of concern (Omicron, B.1.1.529) by the World Health Organization and, within three weeks, had been identified in 87 countries. The Omicron variant is exceptional for carrying over 30 mutations in the spike glycoprotein, which are predicted to influence antibody neutralization and spike function4. Here we describe the genomic profile and early transmission dynamics of Omicron, highlighting the rapid spread in regions with high levels of population immunity.

948 citations


Journal ArticleDOI
01 Apr 2022-Science
TL;DR: The T2T-CHM13-T2T Consortium presented a complete 3.055 billion-base pair sequence of a human genome, including gapless assemblies for all chromosomes except Y, corrected errors in the prior references, and introduced nearly 200 million base pairs of sequence containing gene predictions, 99 of which are predicted to be protein coding as discussed by the authors .
Abstract: Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.

717 citations


Journal ArticleDOI
TL;DR: An increasing number of eukaryotic genomes have been included in KEGG for better representation of organisms in the taxonomic tree, and the Brite hierarchy viewer is used for taxonomy mapping.
Abstract: Abstract KEGG (https://www.kegg.jp) is a manually curated database resource integrating various biological objects categorized into systems, genomic, chemical and health information. Each object (database entry) is identified by the KEGG identifier (kid), which generally takes the form of a prefix followed by a five-digit number, and can be retrieved by appending /entry/kid in the URL. The KEGG pathway map viewer, the Brite hierarchy viewer and the newly released KEGG genome browser can be launched by appending /pathway/kid, /brite/kid and /genome/kid, respectively, in the URL. Together with an improved annotation procedure for KO (KEGG Orthology) assignment, an increasing number of eukaryotic genomes have been included in KEGG for better representation of organisms in the taxonomic tree. Multiple taxonomy files are generated for classification of KEGG organisms and viruses, and the Brite hierarchy viewer is used for taxonomy mapping, a variant of Brite mapping in the new KEGG Mapper suite. The taxonomy mapping enables analysis of, for example, how functional links of genes in the pathway and physical links of genes on the chromosome are conserved among organism groups.

520 citations


Journal ArticleDOI
TL;DR: In this article , shotgun metagenomics allowed the rapid reconstruction and phylogenomic characterization of the first monkeypox outbreak genome sequences, showing that this MPXV belongs to clade 3 and that the outbreak most likely has a single origin.
Abstract: Abstract The largest monkeypox virus (MPXV) outbreak described so far in non-endemic countries was identified in May 2022 (refs. 1–6 ). In this study, shotgun metagenomics allowed the rapid reconstruction and phylogenomic characterization of the first MPXV outbreak genome sequences, showing that this MPXV belongs to clade 3 and that the outbreak most likely has a single origin. Although 2022 MPXV (lineage B.1) clustered with 2018–2019 cases linked to an endemic country, it segregates in a divergent phylogenetic branch, likely reflecting continuous accelerated evolution. An in-depth mutational analysis suggests the action of host APOBEC3 in viral evolution as well as signs of potential MPXV human adaptation in ongoing microevolution. Our findings also indicate that genome sequencing may provide resolution to track the spread and transmission of this presumably slow-evolving double-stranded DNA virus.

386 citations


Journal ArticleDOI
TL;DR: The current review article aims to analyze and summarize information data about the biological characteristics of amino acid mutations, the epidemic characteristics, immune escape, and vaccine reactivity of the Omicron variant, hoping to provide a scientific reference for monitoring, prevention, and vaccines development strategies for the OMicron variant.
Abstract: Recently, the severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) Omicron variant (B.1.1.529) was first identified in Botswana in November 2021. It was first reported to the World Health Organization (WHO) on November 24. On November 26, 2021, according to the advice of scientists who are part of the WHO's Technical Advisory Group on SARS‐CoV‐2 Virus Evolution (TAG‐VE), the WHO defined the strain as a variant of concern (VOC) and named it Omicron. Compared to the other four VOCs (Alpha, Beta, Gamma, and Delta), the Omicron variant was the most highly mutated strain, with 50 mutations accumulated throughout the genome. The Omicron variant contains at least 32 mutations in the spike protein, which was twice as many as the Delta variant. Studies have shown that carrying many mutations can increase infectivity and immune escape of the Omicron variant compared with the early wild‐type strain and the other four VOCs. The Omicron variant is becoming the dominant strain in many countries worldwide and brings new challenges to preventing and controlling coronavirus disease 2019 (COVID‐19). The current review article aims to analyze and summarize information data about the biological characteristics of amino acid mutations, the epidemic characteristics, immune escape, and vaccine reactivity of the Omicron variant, hoping to provide a scientific reference for monitoring, prevention, and vaccine development strategies for the Omicron variant.

250 citations


Journal ArticleDOI
TL;DR: In this paper , a large survey of de novo mutations in the plant Arabidopsis thaliana was conducted and it was shown that mutations occur less often in functionally constrained regions of the genome.
Abstract: Abstract Since the first half of the twentieth century, evolutionary theory has been dominated by the idea that mutations occur randomly with respect to their consequences 1 . Here we test this assumption with large surveys of de novo mutations in the plant Arabidopsis thaliana . In contrast to expectations, we find that mutations occur less often in functionally constrained regions of the genome—mutation frequency is reduced by half inside gene bodies and by two-thirds in essential genes. With independent genomic mutation datasets, including from the largest Arabidopsis mutation accumulation experiment conducted to date, we demonstrate that epigenomic and physical features explain over 90% of variance in the genome-wide pattern of mutation bias surrounding genes. Observed mutation frequencies around genes in turn accurately predict patterns of genetic polymorphisms in natural Arabidopsis accessions ( r = 0.96). That mutation bias is the primary force behind patterns of sequence evolution around genes in natural accessions is supported by analyses of allele frequencies. Finally, we find that genes subject to stronger purifying selection have a lower mutation rate. We conclude that epigenome-associated mutation bias 2 reduces the occurrence of deleterious mutations in Arabidopsis , challenging the prevailing paradigm that mutation is a directionless force in evolution.

161 citations


Posted ContentDOI
23 Dec 2022
TL;DR: MitoHiFi as discussed by the authors is a tool for mitochondrial genome assembly using HiFi reads, which has been used to assemble mitochondrial genomes from a wide phylogenetic range of taxa from Pacbio HiFi data.
Abstract: Abstract Background PacBio high fidelity (HiFi) sequencing reads are both long (15-20 kb) and highly accurate (>Q20). Because of these properties, they have revolutionised genome assembly leading to more accurate and contiguous genomes. In eukaryotes the mitochondrial genome is sequenced alongside the nuclear genome often at very high coverage. A dedicated tool for mitochondrial genome assembly using HiFi reads is still missing. Results MitoHiFi was developed within the Darwin Tree of Life Project to assemble mitochondrial genomes from the HiFi reads generated for target species. The input for MitoHiFi is either the raw reads or the assembled contigs, and the tool outputs a mitochondrial genome sequence fasta file along with annotation of protein and RNA genes. Variants arising from heteroplasmy are assembled independently, and nuclear insertions of mitochondrial sequences are identified and not used in organellar genome assembly. MitoHiFi has been used to assemble 374 mitochondrial genomes (369 from 12 phyla and 39 orders of Metazoa and from 6 species of Fungi) for the Darwin Tree of Life Project, the Vertebrate Genomes Project and the Aquatic Symbiosis Genome Project. Inspection of 60 mitochondrial genomes assembled with MitoHiFi for species that already have reference sequences in public databases showed the widespread presence of previously unreported repeats. Conclusions MitoHiFi is able to assemble mitochondrial genomes from a wide phylogenetic range of taxa from Pacbio HiFi data. MitoHiFi is written in python and is freely available on github ( https://github.com/marcelauliano/MitoHiFi ). MitoHiFi is available with its dependencies as a singularity image on github (ghcr.io/marcelauliano/mitohifi:master).

150 citations


Journal ArticleDOI
01 Sep 2022-Cell
TL;DR: In this paper , a high-coverage 3,202-sample WGS 1kGP resource was presented, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina.

137 citations


Journal ArticleDOI
TL;DR: STRING as mentioned in this paper collects and integrates protein-protein interactions, both physical interactions as well as functional associations, from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources.
Abstract: Abstract Much of the complexity within cells arises from functional and regulatory interactions among proteins. The core of these interactions is increasingly known, but novel interactions continue to be discovered, and the information remains scattered across different database resources, experimental modalities and levels of mechanistic detail. The STRING database (https://string-db.org/) systematically collects and integrates protein–protein interactions—both physical interactions as well as functional associations. The data originate from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources. All of these interactions are critically assessed, scored, and subsequently automatically transferred to less well-studied organisms using hierarchical orthology information. The data can be accessed via the website, but also programmatically and via bulk downloads. The most recent developments in STRING (version 12.0) are: (i) it is now possible to create, browse and analyze a full interaction network for any novel genome of interest, by submitting its complement of encoded proteins, (ii) the co-expression channel now uses variational auto-encoders to predict interactions, and it covers two new sources, single-cell RNA-seq and experimental proteomics data and (iii) the confidence in each experimentally derived interaction is now estimated based on the detection method used, and communicated to the user in the web-interface. Furthermore, STRING continues to enhance its facilities for functional enrichment analysis, which are now fully available also for user-submitted genomes.

127 citations


Journal ArticleDOI
24 May 2022-Science
TL;DR: PyR0, a hierarchical Bayesian multinomial logistic regression model that infers relative prevalence of all viral lineages across geographic regions, detects lineages increasing in prevalence, and identifies mutations relevant to fitness, is developed.
Abstract: Repeated emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants with increased fitness underscores the value of rapid detection and characterization of new lineages. We have developed PyR0, a hierarchical Bayesian multinomial logistic regression model that infers relative prevalence of all viral lineages across geographic regions, detects lineages increasing in prevalence, and identifies mutations relevant to fitness. Applying PyR0 to all publicly available SARS-CoV-2 genomes, we identify numerous substitutions that increase fitness, including previously identified spike mutations and many nonspike mutations within the nucleocapsid and nonstructural proteins. PyR0 forecasts growth of new lineages from their mutational profile, ranks the fitness of lineages as new sequences become available, and prioritizes mutations of biological and public health concern for functional characterization. Description First off the COVID block The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been characterized by waves of transmission initiated by new variants replacing older ones. Given this pattern of emergence, there is an obvious need for the early detection of novel variants to prevent excess deaths. Obermeyer et al. have developed a Bayesian model to compare relative transmissibility of all viral lineages. Using this model, emerging lineages can be spotted together with the mutations that contribute toward transmissibility, not only in Spike, but also in other viral proteins. The model can prioritize lineages as they emerge for public health concern. —CA A Bayesian hierarchical model of all SARS-CoV-2 viral genomes predicts lineage transmissibility and identifies associated mutations.

127 citations


Journal ArticleDOI
TL;DR: The Human Pangenome Reference Consortium (HPRC) as discussed by the authors aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity.
Abstract: The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goal of constructing the highest-possible quality human pangenome reference. Our goal is to improve data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.

Journal ArticleDOI
TL;DR: In this paper , the complete plastid genome of Onobrychis gaubae, endemic to Iran, was sequenced using Illumina paired-end sequencing and was compared with previously known genomes of the IRLC species of legumes.
Abstract: Plastome (Plastid genome) sequences provide valuable markers for surveying evolutionary relationships and population genetics of plant species. Papilionoideae (papilionoids) has different nucleotide and structural variations in plastomes, which makes it an ideal model for genome evolution studies. Therefore, by sequencing the complete chloroplast genome of Onobrychis gaubae in this study, the characteristics and evolutionary patterns of plastome variations in IR-loss clade were compared.In the present study, the complete plastid genome of O. gaubae, endemic to Iran, was sequenced using Illumina paired-end sequencing and was compared with previously known genomes of the IRLC species of legumes. The O. gaubae plastid genome was 122,688 bp in length and included a large single-copy (LSC) region of 81,486 bp, a small single-copy (SSC) region of 13,805 bp and one copy of the inverted repeat (IRb) of 29,100 bp. The genome encoded 110 genes, including 76 protein-coding genes, 30 transfer RNA (tRNA) genes and four ribosome RNA (rRNA) genes and possessed 83 simple sequence repeats (SSRs) and 50 repeated structures with the highest proportion in the LSC. Comparative analysis of the chloroplast genomes across IRLC revealed three hotspot genes (ycf1, ycf2, clpP) which could be used as DNA barcode regions. Moreover, seven hypervariable regions [trnL(UAA)-trnT(UGU), trnT(GGU)-trnE(UUC), ycf1, ycf2, ycf4, accD and clpP] were identified within Onobrychis, which could be used to distinguish the Onobrychis species. Phylogenetic analyses revealed that O. gaubae is closely related to Hedysarum. The complete O. gaubae genome is a valuable resource for investigating evolution of Onobrychis species and can be used to identify related species.Our results reveal that the plastomes of the IRLC are dynamic molecules and show multiple gene losses and inversions. The identified hypervariable regions could be used as molecular markers for resolving phylogenetic relationships and species identification and also provide new insights into plastome evolution across IRLC.

Journal ArticleDOI
TL;DR: The authors performed whole-genome sequencing of 208 intestinal crypts from 56 individuals to study the landscape of somatic mutation across 16 mammalian species and found that mutagenesis was dominated by seemingly endogenous mutational processes in all species, including 5-methylcytosine deamination and oxidative damage.
Abstract: Abstract The rates and patterns of somatic mutation in normal tissues are largely unknown outside of humans 1–7 . Comparative analyses can shed light on the diversity of mutagenesis across species, and on long-standing hypotheses about the evolution of somatic mutation rates and their role in cancer and ageing. Here we performed whole-genome sequencing of 208 intestinal crypts from 56 individuals to study the landscape of somatic mutation across 16 mammalian species. We found that somatic mutagenesis was dominated by seemingly endogenous mutational processes in all species, including 5-methylcytosine deamination and oxidative damage. With some differences, mutational signatures in other species resembled those described in humans 8 , although the relative contribution of each signature varied across species. Notably, the somatic mutation rate per year varied greatly across species and exhibited a strong inverse relationship with species lifespan, with no other life-history trait studied showing a comparable association. Despite widely different life histories among the species we examined—including variation of around 30-fold in lifespan and around 40,000-fold in body mass—the somatic mutation burden at the end of lifespan varied only by a factor of around 3. These data unveil common mutational processes across mammals, and suggest that somatic mutation rates are evolutionarily constrained and may be a contributing factor in ageing.

Journal ArticleDOI
TL;DR: In this article , the authors identified diverse recombination events between two Omicron major subvariants (BA.1 and BA.2) and other variants of concern (VOCs) and variants of interest (VOIs), suggesting that co-infection and subsequent genome recombination play important roles in the ongoing evolution of SARS-CoV-2.
Abstract: The current pandemic of COVID-19 is fueled by more infectious emergent Omicron variants. Ongoing concerns of emergent variants include possible recombinants, as genome recombination is an important evolutionary mechanism for the emergence and re-emergence of human viral pathogens. In this study, we identified diverse recombination events between two Omicron major subvariants (BA.1 and BA.2) and other variants of concern (VOCs) and variants of interest (VOIs), suggesting that co-infection and subsequent genome recombination play important roles in the ongoing evolution of SARS-CoV-2. Through scanning high-quality completed Omicron spike gene sequences, 18 core mutations of BA.1 (frequency >99%) and 27 core mutations of BA.2 (nine more than BA.1) were identified, of which 15 are specific to Omicron. BA.1 subvariants share nine common amino acid mutations (three more than BA.2) in the spike protein with most VOCs, suggesting a possible recombination origin of Omicron from these VOCs. There are three more Alpha-related mutations in BA.1 than BA.2, and BA.1 is phylogenetically closer to Alpha than other variants. Revertant mutations are found in some dominant mutations (frequency >95%) in the BA.1. Most notably, multiple characteristic amino acid mutations in the Delta spike protein have been also identified in the "Deltacron"-like Omicron Variants isolated since November 11, 2021 in South Africa, which implies the recombination events occurring between the Omicron and Delta variants. Monitoring the evolving SARS-CoV-2 genomes especially for recombination is critically important for recognition of abrupt changes to viral attributes including its epitopes which may call for vaccine modifications.

Journal ArticleDOI
TL;DR: In this paper , the authors analyzed active enhancers across human organs based on the analysis of both eRNA transcription (FANTOM5) and chromatin architecture (ENCODE consortium data sets) and showed that most enhancers active in a particular organ are also active in other organs.
Abstract: Enhancers are regulatory elements of genomes that determine spatio-temporal patterns of gene expression. The human genome contains a vast number of enhancers, which largely outnumber protein-coding genes. Historically, enhancers have been regarded as highly tissue-specific. However, recent evidence has demonstrated that many enhancers are pleiotropic, with activity in multiple developmental contexts. Yet, the extent and impact of pleiotropy remain largely unexplored. In this study we analyzed active enhancers across human organs based on the analysis of both eRNA transcription (FANTOM5 consortium data sets) and chromatin architecture (ENCODE consortium data sets). We show that pleiotropic enhancers are pervasive in the human genome and that most enhancers active in a particular organ are also active in other organs. In addition, our analysis suggests that the proportion of context-specific enhancers of a given organ is explained, at least in part, by the proportion of context-specific genes in that same organ. The notion that such a high proportion of human enhancers can be pleiotropic suggests that small regions of regulatory DNA contain abundant regulatory information and that these regions evolve under important evolutionary constraints.

Journal ArticleDOI
01 Apr 2022-Science
TL;DR: In this paper , a complete, telomere-to-telomere human genome assembly (T2T-CHM13) has enabled the comprehensively characterize pericentromeric and centromeric repeats, which constitute 6.2% of the genome.
Abstract: Existing human genome assemblies have almost entirely excluded repetitive sequences within and near centromeres, limiting our understanding of their organization, evolution, and functions, which include facilitating proper chromosome segregation. Now, a complete, telomere-to-telomere human genome assembly (T2T-CHM13) has enabled us to comprehensively characterize pericentromeric and centromeric repeats, which constitute 6.2% of the genome (189.9 megabases). Detailed maps of these regions revealed multimegabase structural rearrangements, including in active centromeric repeat arrays. Analysis of centromere-associated sequences uncovered a strong relationship between the position of the centromere and the evolution of the surrounding DNA through layered repeat expansions. Furthermore, comparisons of chromosome X centromeres across a diverse panel of individuals illuminated high degrees of structural, epigenetic, and sequence variation in these complex and rapidly evolving regions.

Journal ArticleDOI
TL;DR: In this article , the authors show that common single-nucleotide polymorphisms (SNPs) are predicted to collectively explain 40-50% of phenotypic variation in human height, but identifying the specific variants and associated regions requires huge sample sizes.
Abstract: Common single-nucleotide polymorphisms (SNPs) are predicted to collectively explain 40-50% of phenotypic variation in human height, but identifying the specific variants and associated regions requires huge sample sizes1. Here, using data from a genome-wide association study of 5.4 million individuals of diverse ancestries, we show that 12,111 independent SNPs that are significantly associated with height account for nearly all of the common SNP-based heritability. These SNPs are clustered within 7,209 non-overlapping genomic segments with a mean size of around 90 kb, covering about 21% of the genome. The density of independent associations varies across the genome and the regions of increased density are enriched for biologically relevant genes. In out-of-sample estimation and prediction, the 12,111 SNPs (or all SNPs in the HapMap 3 panel2) account for 40% (45%) of phenotypic variance in populations of European ancestry but only around 10-20% (14-24%) in populations of other ancestries. Effect sizes, associated regions and gene prioritization are similar across ancestries, indicating that reduced prediction accuracy is likely to be explained by linkage disequilibrium and differences in allele frequency within associated regions. Finally, we show that the relevant biological pathways are detectable with smaller sample sizes than are needed to implicate causal genes and variants. Overall, this study provides a comprehensive map of specific genomic regions that contain the vast majority of common height-associated variants. Although this map is saturated for populations of European ancestry, further research is needed to achieve equivalent saturation in other ancestries.

Journal ArticleDOI
TL;DR: In this paper , a tool that automatically detects known antiviral systems in prokaryotic genomes is presented. But it is not suitable for large-scale genomic analysis of antiviral defense systems.
Abstract: Bacteria and archaea have developed multiple antiviral mechanisms, and genomic evidence indicates that several of these antiviral systems co-occur in the same strain. Here, we introduce DefenseFinder, a tool that automatically detects known antiviral systems in prokaryotic genomes. We use DefenseFinder to analyse 21000 fully sequenced prokaryotic genomes, and find that antiviral strategies vary drastically between phyla, species and strains. Variations in composition of antiviral systems correlate with genome size, viral threat, and lifestyle traits. DefenseFinder will facilitate large-scale genomic analysis of antiviral defense systems and the study of host-virus interactions in prokaryotes.

Journal ArticleDOI
TL;DR: In this article , a phylogenomic analysis of available Monkeypox virus (MPXV) genomes was performed to determine their evolution and diversity, which revealed that all MPXV genomes grouped into three monophyletic clades: two previously characterized clades and a newly emerging clade harboring genomes from the ongoing 2022 multi-country outbreak with 286 genomes comprising the hmpXV-1A clade and the newly classified lineages.

Journal ArticleDOI
01 Jan 2022-Cell
TL;DR: In this paper , a global genomic and epigenetic map of transcriptionally active and silent proviral species and evaluate their longitudinal evolution in persons receiving suppressive ART was presented, showing that proviral transcriptional activity is associated with activating epigenetic chromatin features in linear proximity of integration sites and in their inter-and intrachromosomal contact regions.

Journal ArticleDOI
TL;DR: Anantharaman et al. as mentioned in this paper proposed METABOLIC, a scalable software to advance microbial ecology and biogeochemistry studies using genomes at the resolution of individual organisms and/or microbial communities.
Abstract: Advances in microbiome science are being driven in large part due to our ability to study and infer microbial ecology from genomes reconstructed from mixed microbial communities using metagenomics and single-cell genomics. Such omics-based techniques allow us to read genomic blueprints of microorganisms, decipher their functional capacities and activities, and reconstruct their roles in biogeochemical processes. Currently available tools for analyses of genomic data can annotate and depict metabolic functions to some extent; however, no standardized approaches are currently available for the comprehensive characterization of metabolic predictions, metabolite exchanges, microbial interactions, and microbial contributions to biogeochemical cycling.We present METABOLIC (METabolic And BiogeOchemistry anaLyses In miCrobes), a scalable software to advance microbial ecology and biogeochemistry studies using genomes at the resolution of individual organisms and/or microbial communities. The genome-scale workflow includes annotation of microbial genomes, motif validation of biochemically validated conserved protein residues, metabolic pathway analyses, and calculation of contributions to individual biogeochemical transformations and cycles. The community-scale workflow supplements genome-scale analyses with determination of genome abundance in the microbiome, potential microbial metabolic handoffs and metabolite exchange, reconstruction of functional networks, and determination of microbial contributions to biogeochemical cycles. METABOLIC can take input genomes from isolates, metagenome-assembled genomes, or single-cell genomes. Results are presented in the form of tables for metabolism and a variety of visualizations including biogeochemical cycling potential, representation of sequential metabolic transformations, community-scale microbial functional networks using a newly defined metric "MW-score" (metabolic weight score), and metabolic Sankey diagrams. METABOLIC takes ~ 3 h with 40 CPU threads to process ~ 100 genomes and corresponding metagenomic reads within which the most compute-demanding part of hmmsearch takes ~ 45 min, while it takes ~ 5 h to complete hmmsearch for ~ 3600 genomes. Tests of accuracy, robustness, and consistency suggest METABOLIC provides better performance compared to other software and online servers. To highlight the utility and versatility of METABOLIC, we demonstrate its capabilities on diverse metagenomic datasets from the marine subsurface, terrestrial subsurface, meadow soil, deep sea, freshwater lakes, wastewater, and the human gut.METABOLIC enables the consistent and reproducible study of microbial community ecology and biogeochemistry using a foundation of genome-informed microbial metabolism, and will advance the integration of uncultivated organisms into metabolic and biogeochemical models. METABOLIC is written in Perl and R and is freely available under GPLv3 at https://github.com/AnantharamanLab/METABOLIC . Video abstract.

Journal ArticleDOI
01 Jan 2022-Cell
TL;DR: Wang et al. as mentioned in this paper presented a 25.4-Gb chromosome-level assembly of Chinese pine (Pinus tabuliformis) and revealed that its genome size is mostly attributable to huge intergenic regions and long introns with high transposable element (TE) content.

Journal ArticleDOI
TL;DR: In this article , the authors show that the Oxford Nanopore R10.4 can be used to generate near-finished microbial genomes from isolates or metagenomes without short-read or reference polishing.
Abstract: Abstract Long-read Oxford Nanopore sequencing has democratized microbial genome sequencing and enables the recovery of highly contiguous microbial genomes from isolates or metagenomes. However, to obtain near-finished genomes it has been necessary to include short-read polishing to correct insertions and deletions derived from homopolymer regions. Here, we show that Oxford Nanopore R10.4 can be used to generate near-finished microbial genomes from isolates or metagenomes without short-read or reference polishing.

Journal ArticleDOI
TL;DR: In this paper , the analysis of whole-genome sequencing of 150,119 individuals from the UK Biobank has been presented to characterize selection based on sequence variation within a population through a depletion rank score of windows along the genome.
Abstract: Detailed knowledge of how diversity in the sequence of the human genome affects phenotypic diversity depends on a comprehensive and reliable characterization of both sequences and phenotypic variation. Over the past decade, insights into this relationship have been obtained from whole-exome sequencing or whole-genome sequencing of large cohorts with rich phenotypic data1,2. Here we describe the analysis of whole-genome sequencing of 150,119 individuals from the UK Biobank3. This constitutes a set of high-quality variants, including 585,040,410 single-nucleotide polymorphisms, representing 7.0% of all possible human single-nucleotide polymorphisms, and 58,707,036 indels. This large set of variants allows us to characterize selection based on sequence variation within a population through a depletion rank score of windows along the genome. Depletion rank analysis shows that coding exons represent a small fraction of regions in the genome subject to strong sequence conservation. We define three cohorts within the UK Biobank: a large British Irish cohort, a smaller African cohort and a South Asian cohort. A haplotype reference panel is provided that allows reliable imputation of most variants carried by three or more sequenced individuals. We identified 895,055 structural variants and 2,536,688 microsatellites, groups of variants typically excluded from large-scale whole-genome sequencing studies. Using this formidable new resource, we provide several examples of trait associations for rare variants with large effects not found previously through studies based on whole-exome sequencing and/or imputation.


Journal ArticleDOI
01 Apr 2022-Science
TL;DR: The T2T-CHM13 reference as discussed by the authors has been shown to universally improve read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively.
Abstract: Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.

Journal ArticleDOI
TL;DR: In this article , the authors describe an algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents.
Abstract: Routine haplotype-resolved genome assembly from single samples remains an unresolved problem. Here we describe an algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents. Applied to human and other vertebrate samples, our algorithm consistently outperforms existing single-sample assembly pipelines and generates assemblies of similar quality to the best pedigree-based assemblies.

Journal ArticleDOI
TL;DR: The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses as discussed by the authors .
Abstract: Abstract Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.

Journal ArticleDOI
01 Jul 2022-Cell
TL;DR: In this paper , the authors perform genome-scale Perturb-seq targeting all expressed genes with CRISPR interference (CRISPRi) across >2.5 million human cells.

Journal ArticleDOI
Giulio Formenti, Kathrin Theissinger, Carlos Fernandes, Iliana Bista, Aureliano Bombarely, Christoph Bleidorn, Claudio Ciofi, Angelica Crottini, José Alberto Godoy Godoy, Jacob Höglund, Joanna Malukiewicz, Alice Mouton, Rebekah A. Oomen, Sadye Paez, Per J. Palsbøll, Christophe Pampoulie, Hannes Svardal, Constantina Theofanopoulou, Jan de Vries, Ann-Marie Waldvogel, Guojie Zhang, Camila J. Mazzoni, Miklós Bálint, Fedor Čiampor, J. Hoglund, María José Ruiz-López, Goujie Zhang, Erich D. Jarvis, Sargis A. Aghayan, Tyler Alioto, Isabel Almudi, Nadir Alvarez, Paulo C. Alves, Isabel R. Amorim, Agostinho Antunes, Paula Arribas, Petr Baldrian, Paul R. Berg, Giorgio Bertorelle, Astrid Böhne, Andrea Bonisoli-Alquati, Ljudevit Luka Boštjančić, Bastien Boussau, Catherine Breton, Elena Buzan, Paula F. Campos, Carlos Carreras, Luis Filipe Castro, Luis J. Chueca, Elena Conti, Robert Cook-Deegan, Daniel Croll, Mónica V. Cunha, Frédéric Delsuc, Alice B. Dennis, Dimitar Dimitrov, Rui Faria, Adrien Favre, Olivier Fedrigo, Rosa Fernández, Gentile Francesco Ficetola, Jean-François Flot, Toni Gabaldón, Dolores R. Galea Agius, Guido Roberto Gallo, Alice Maria Giani, M. Thomas P. Gilbert, Tine Grebenc, Katerina Guschanski, Romain Guyot, Bernhard Hausdorf, Oliver Hawlitschek, Peter D. Heintzman, Berthold Heinze, Michael Hiller, Martin Husemann, Alessio Iannucci, Iker Irisarri, Kjetill S. Jakobsen, Sissel Jentoft, Peter Klinga, Agnieszka Kloch, Claudius F. Kratochwil, Henrik Kusche, Kara K S Layton, Jennifer A. Leonard, Emmanuelle Lerat, Gianni Liti, Tereza Manousaki, Tomas Marques-Bonet, Pável Matos-Maraví, Michael Matschiner, Florian Maumus, Ann M Mc Cartney, Shai Meiri, José Melo-Ferreira, Ximo Mengual, Michael T. Monaghan, Matteo Montagna, Robert W. Mysłajek, Marco T. Neiber, Violaine Nicolas, Marta Novo, Petar Ozretić, Ferran Palero, Lucian Pârvulescu, Marta Pascual, Octávio S. Paulo, Martina Pavlek, Cinta Pegueroles, Loïc Pellissier, Graziano Pesole, Craig R. Primmer, Ana Riesgo, Lukas Rüber, Diego Rubolini, Daniel Salvi, Ole Seehausen, Matthias Seidel, Simona Secomandi, Bruno Studer, Spyros Theodoridis, Marco Thines, Lara Urban, Anti Vasemägi, Adriana Vella, Noel Vella, Sonja C. Vernes, Cristiano Vernesi, David R. Vieites, Robert M. Waterhouse, Christopher W. Wheat, Gert Wörheide, Yannick Wurm, Gabrielle Zammit 
TL;DR: In this article , a large-scale generation of reference genomes representing global biodiversity is discussed. But the authors focus on the large-size generation of the reference genomes and do not discuss how to generate reference genomes for the conservation genomics.
Abstract: Progress in genome sequencing now enables the large-scale generation of reference genomes. Various international initiatives aim to generate reference genomes representing global biodiversity. These genomes provide unique insights into genomic diversity and architecture, thereby enabling comprehensive analyses of population and functional genomics, and are expected to revolutionize conservation genomics.