scispace - formally typeset
Search or ask a question

Showing papers by "Richard Durbin published in 2019"


Journal ArticleDOI
13 Jun 2019-Nature
TL;DR: Analysis of 34 newly recovered ancient genomes from northeastern Siberia reveal at least three major migration events in the late Pleistocene population history of the region, including an initial peopling by a previously unknown Palaeolithic population of ‘Ancient North Siberians’ and a Holocene migration of other East Asian-related peoples, which generated the mosaic genetic make-up of contemporary peoples.
Abstract: Northeastern Siberia has been inhabited by humans for more than 40,000 years but its deep population history remains poorly understood. Here we investigate the late Pleistocene population history of northeastern Siberia through analyses of 34 newly recovered ancient genomes that date to between 31,000 and 600 years ago. We document complex population dynamics during this period, including at least three major migration events: an initial peopling by a previously unknown Palaeolithic population of ‘Ancient North Siberians’ who are distantly related to early West Eurasian hunter-gatherers; the arrival of East Asian-related peoples, which gave rise to ‘Ancient Palaeo-Siberians’ who are closely related to contemporary communities from far-northeastern Siberia (such as the Koryaks), as well as Native Americans; and a Holocene migration of other East Asian-related peoples, who we name ‘Neo-Siberians’, and from whom many contemporary Siberians are descended. Each of these population expansions largely replaced the earlier inhabitants, and ultimately generated the mosaic genetic make-up of contemporary peoples who inhabit a vast area across northern Eurasia and the Americas. Analyses of 34 ancient genomes from northeastern Siberia, dating to between 31,000 and 600 years ago, reveal at least three major migration events in the late Pleistocene population history of the region.

211 citations


Journal ArticleDOI
18 Jan 2019-Genes
TL;DR: A high-quality de novo genome assembly from a single Anopheles coluzzii mosquito, using a modified SMRTbell library construction protocol without DNA shearing and size selection, which puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.
Abstract: A high-quality reference genome is a fundamental resource for functional genetics, comparative genomics, and population genomics, and is increasingly important for conservation biology. PacBio Single Molecule, Real-Time (SMRT) sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives, however, relatively high DNA input requirements (~5 µg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that have lower DNA content, or on projects with limited input DNA for other reasons. Here we present a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System with chemistry 3.0 and software v6.0, generating, on average, 25 Gb of sequence per SMRT Cell with 20 h movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting curated assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes were present and full-length). In addition, this single-insect assembly now places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference. We were also able to resolve maternal and paternal haplotypes for over 1/3 of the genome. By sequencing and assembling material from a single diploid individual, only two haplotypes were present, simplifying the assembly process compared to samples from multiple pooled individuals. The method presented here can be applied to samples with starting DNA amounts as low as 100 ng per 1 Gb genome size. This new low-input approach puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.

104 citations


Journal ArticleDOI
TL;DR: A scalable implementation of the graph extension of the positional Burrows–Wheeler transform is developed and an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes is developed.
Abstract: Motivation The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. Results We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. Availability and implementation Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. Supplementary information Supplementary data are available at Bioinformatics online.

55 citations


Posted ContentDOI
27 Jun 2019-bioRxiv
TL;DR: Analyses of high-coverage genome sequences from 54 diverse human populations reveal an excess of previously undocumented private genetic variation in southern and central Africa and in Oceania and the Americas, but an absence of fixed, private variants between major geographical regions.
Abstract: Genome sequences from diverse human groups are needed to understand the structure of genetic variation in our species and the history of, and relationships between, different populations. We present 929 high-coverage genome sequences from 54 diverse human populations, 26 of which are physically phased using linked-read sequencing. Analyses of these genomes reveal an excess of previously undocumented private genetic variation in southern and central Africa and in Oceania and the Americas, but an absence of fixed, private variants between major geographical regions. We also find deep and gradual population separations within Africa, contrasting population size histories between hunter-gatherer and agriculturalist groups in the last 10,000 years, a potentially major population growth episode after the peopling of the Americas, and a contrast between single Neanderthal but multiple Denisovan source populations contributing to present-day human populations. We also demonstrate benefits to the study of population relationships of genome sequences over ascertained array genotypes. These genome sequences are freely available as a resource with no access or analysis restrictions.

43 citations


Posted ContentDOI
18 Aug 2019-bioRxiv
TL;DR: The results reinforce the role of ancestral hybridisation in explosive diversification by demonstrating its significance in one of the largest recent vertebrate adaptive radiations of cichlid fishes in East Afrian Lake Malawi.
Abstract: The adaptive radiation of cichlid fishes in East Afrian Lake Malawi encompasses over 500 species that are believed to have evolved within the last 800 thousand years from a common founder population. It has been proposed that hybridisation between ancestral lineages can provide the genetic raw material to fuel such exceptionally high diversification rates, and evidence for this has recently been presented for the Lake Victoria Region cichlid superflock. Here we report that Lake Malawi cichlid genomes also show evidence of hybridisation between two lineages that split 3-4 million years ago, today represented by Lake Victoria cichlids and the riverine Astatotilapia sp. ‘ruaha blue’. The two ancestries in Malawi cichlid genomes are present in large blocks of several kilobases, but there is little variation in this pattern between Malawi cichlid species, suggesting that the large-scale mosaic structure of the genomes was largely established prior to the radiation. Nevertheless, tens of thousands of polymorphic variants apparently derived from the hybridisation are interspersed in the genomes. These loci show a striking excess of differentiation across ecological subgroups in the Lake Malawi cichlid assemblage, and parental alleles sort differentially into benthic and pelagic Malawi cichlid lineages, consistent with strong differential selection on these loci during species divergence. Furthermore, these loci are enriched for genes involved in immune response and vision, including opsin genes previously identified as important for speciation. Our results reinforce the role of ancestral hybridisation in explosive diversification by demonstrating its significance in one of the largest recent vertebrate adaptive radiations.

39 citations


Journal ArticleDOI
TL;DR: This study establishes a strategy for examining the genetic basis of inter-individual variability in cell behavior and identifies genes that correlate in expression with intrinsic and extrinsic PEER factors and associate outlier cell behavior with genes containing rare deleterious non-synonymous SNVs.

31 citations


Journal ArticleDOI
TL;DR: This work has shown that on the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy.
Abstract: Motivation The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details). Availability and implementation Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble. Supplementary information Supplementary data are available at Bioinformatics online.

21 citations


Posted ContentDOI
14 Jul 2019-bioRxiv
TL;DR: Souporcell is a novel method to cluster cells using only the genetic variants detected within the scRNAseq reads that achieves high accuracy on genotype clustering, doublet detection, and ambient RNA estimation as demonstrated across a wide range of challenging scenarios.
Abstract: A popular design for scRNAseq experiments is to multiplex cells from different donors, as this strategy avoids batch effects, reduces costs, and improves doublet detection. Using variants in the reads, it is possible to assign cells to genotypes. The first tool in this space, demuxlet, assigns cells based on genotypes known a priori, but more recently tools not requiring this information have become available including sc_split and vireo. However, none of these methods have been validated across a wide range of sample parameters, types, and species. Further, none of these tools model an important confounder of the data, ambient RNA caused by cell lysis prior to cell partitioning. We present souporcell, a robust method to cluster cells by their genetic variants without a genotype reference and show that it outperforms existing methods on clustering accuracy, doublet detection, and genotyping across a wide range of challenging scenarios while accurately estimating the amount of ambient RNA in the sample.

18 citations


Posted ContentDOI
14 Aug 2019-bioRxiv
TL;DR: This work presents a novel tool “purge_dups” that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps and shows that it can reduce heterozygus duplication and increase assembly continuity while maintaining completeness of the primary assembly.
Abstract: Motivation: Rapid development in long read sequencing and scaffolding technologies is enabling increased efficiency in the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either only focus on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results: Here we present a novel tool "purge_dups" that automatically identifies haplotigs and also heterozygous overlaps between primary contigs, using both sequence similarity and read depth, and removes the duplicated regions. Through comparison with the current standard, purge_haplotigs, on three de novo assemblies, we demonstrate that purge_dups can reduce heterozygous duplication in assemblies effectively while maintaining completeness of the primary assembly. It can also benefit the scaffolding process by increasing continuity of the scaffolds. Moreover, purge_dups is fully automatic and can be easy integrated into assembly pipelines.

17 citations


Posted ContentDOI
24 Feb 2019-bioRxiv
TL;DR: A scalable implementation of the graph extension of the positional Burrows–Wheeler transform (GBWT) is developed and an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes is developed.
Abstract: Motivation The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are nonbiological, unlikely recombinations of true haplotypes. Results We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform (GBWT). We demonstrate the scalability of the new implementation by building a whole-genome index of the 5,008 haplotypes of the 1000 Genomes Project, and an index of all 108,070 TOPMed Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. Availability Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt, and https://github.com/jltsiren/gcsa2. Contact jouni.siren@iki.fi Supplementary information Supplementary data are available.

8 citations


Posted ContentDOI
26 Sep 2019-bioRxiv
TL;DR: In this article, the authors used vg to align multiple previously published aDNA samples to a variation graph containing 1000 Genome Project variants, and compared these with the same data aligned with bwa to the human linear reference genome.
Abstract: Background During the last decade, the analysis of ancient DNA (aDNA) sequence has become a powerful tool for the study of past human populations. However, the degraded nature of aDNA means that aDNA sequencing reads are short, single-ended and frequently mutated by post-mortem chemical modifications. All these features decrease read mapping accuracy and increase reference bias, in which reads containing non-reference alleles are less likely to be mapped than those containing reference alleles. Recently, alternative approaches for read mapping and genetic variation analysis have been developed that replace the linear reference by a variation graph which includes all the alternative variants at each genetic locus. Here, we evaluate the use of variation graph software vg to avoid reference bias for ancient DNA. Results We used vg to align multiple previously published aDNA samples to a variation graph containing 1000 Genome Project variants, and compared these with the same data aligned with bwa to the human linear reference genome. We show that use of vg leads to a much more balanced allelic representation at polymorphic sites and better variant detection in comparison with bwa, especially in the presence of post-mortem changes, effectively removing reference bias. A recently published approach that filters bwa alignments using modified reads also removes bias, but has lower sensitivity than vg. Conclusions Our findings demonstrate that aligning aDNA sequences to variation graphs allows recovering a higher fraction of non-reference variation and effectively mitigates the impact of reference bias in population genetics analyses using aDNA, while retaining mapping sensitivity.


Posted ContentDOI
06 Feb 2019-bioRxiv
TL;DR: This work describes a new approach for fast and scalable generation of local tree topologies relating large numbers of haplotypes based on a data structure which it calls tree consistent, a modification of data structure introduced by R. Durbin (2014).
Abstract: Estimation of the relationship between DNA sequences is one of the most important problems in genomics. Understanding these relationships is central to demographic inference, correction of population structure in GWAS, identifying signals of selection etc. The data structure containing the full information about sample genealogy is called the ancestral recombination graph (ARG). However, ARG inference is a very difficult problem, not least due to a very complex state space. In this work we describe a new approach for fast and scalable generation of local tree topologies relating large numbers of haplotypes. Our method is closely related to the estimation of ARG, and captures both local and global properties of an ARG. It is based on a data structure which we call tree consistent PBWT, a modification of PBWT data structure introduced by R. Durbin (2014). We also explore some methods to estimate the quality of the generated tree topologies and to make inferences based on them. At the end we discuss a probabilistic model which could potentially lead to the estimation of ARG node times.

Posted ContentDOI
17 Mar 2019-bioRxiv
TL;DR: A novel pedigree-graph-based approach to diploid assembly using accurate Illumina data and long-read Pacific Biosciences data from all related individuals is presented, thereby generalizing previous work on single individuals.
Abstract: Motivation Reconstructing high-quality haplotype-resolved assemblies for related individuals of various species has important applications in understanding Mendelian diseases along with evolutionary and comparative genomics. Through major genomics sequencing efforts such as the Personal Genome Project, the Vertebrate Genome Project (VGP), the Earth Biogenome Project (EBP) and the Genome in a Bottle project (GIAB), a variety of sequencing datasets from mother-father-child trios of various diploid species are becoming available. Current trio assembly approaches are not designed to incorporate long-read sequencing data from parents in a trio, and therefore require relatively high coverages of costly long-read data to produce high-quality assemblies. Thus, building a trio-aware assembler capable of producing accurate and chromosomal-scale diploid genomes in a pedigree, while being cost-effective in terms of sequencing costs, is a pressing need of the genomics community. Results We present a novel pedigree-graph-based approach to diploid assembly using accurate Illumina data and long-read Pacific Biosciences (PacBio) data from all related individuals, thereby generalizing our previous work on single individuals. We demonstrate the effectiveness of our pedigree approach on a simulated trio of pseudo-diploid yeast genomes with different heterozygosity rates, and real data from Arabidopsis Thaliana. We show that we require as little as 30× coverage Illumina data and 15× PacBio data from each individual in a trio to generate chromosomal-scale phased assemblies. Additionally, we show that we can detect and phase variants from generated phased assemblies. Availability https://github.com/shilpagarg/WHdenovo Contact shilpa_garg@hms.harvard.edu, gchurch@genetics.med.harvard.edu

Journal ArticleDOI
TL;DR: An efficient set of computational tools, rkmh, for analyzing complex mixed infections of related viruses based on sequence data, makes extensive use of MinHash similarity measures, and includes utilities for removing host DNA and classifying reads by type, lineage, and sublineage.
Abstract: Human papillomavirus (HPV) is a common sexually transmitted infection associated with cervical cancer that frequently occurs as a coinfection of types and subtypes. Highly similar sublineages that show over 100-fold differences in cancer risk are not distinguishable in coinfections with current typing methods. We describe an efficient set of computational tools, rkmh, for analyzing complex mixed infections of related viruses based on sequence data. rkmh makes extensive use of MinHash similarity measures, and includes utilities for removing host DNA and classifying reads by type, lineage, and sublineage. We show that rkmh is capable of assigning reads to their HPV type as well as HPV16 lineage and sublineages. Accurate read classification enables estimates of percent composition when there are multiple infecting lineages or sublineages. While we demonstrate rkmh for HPV with multiple sequencing technologies, it is also applicable to other mixtures of related sequences.

Journal ArticleDOI
TL;DR: An open-source C++ library for GFA and a set of utilities for summarizing and manipulating the format to encourage further adoption in high-performance software.
Abstract: Summary GFA has emerged as a standard format for the exchange of genome assemblies and sequence graphs. To encourage further adoption in high-performance software we have developed an open-source C++ library for GFA and a set of utilities for summarizing and manipulating the format. Availability The gfakluge source code is freely available under the MIT license at https://github.com/edawson/gfakluge. It has been tested on both Mac OS X and Linux.