scispace - formally typeset
Search or ask a question

Showing papers by "Richard Durbin published in 2018"


Journal ArticleDOI
TL;DR: A perspective on the Earth BioGenome Project (EBP), a moonshot for biology that aims to sequence, catalog, and characterize the genomes of all of Earth’s eukaryotic biodiversity over a period of 10 years, is presented.
Abstract: Increasing our understanding of Earth’s biodiversity and responsibly stewarding its resources are among the most crucial scientific and social challenges of the new millennium. These challenges require fundamental new knowledge of the organization, evolution, functions, and interactions among millions of the planet’s organisms. Herein, we present a perspective on the Earth BioGenome Project (EBP), a moonshot for biology that aims to sequence, catalog, and characterize the genomes of all of Earth’s eukaryotic biodiversity over a period of 10 years. The outcomes of the EBP will inform a broad range of major issues facing humanity, such as the impact of climate change on biodiversity, the conservation of endangered species and ecosystems, and the preservation and enhancement of ecosystem services. We describe hurdles that the project faces, including data-sharing policies that ensure a permanent, freely available resource for future scientific discovery while respecting access and benefit sharing guidelines of the Nagoya Protocol. We also describe scientific and organizational challenges in executing such an ambitious project, and the structure proposed to achieve the project’s goals. The far-reaching potential benefits of creating an open digital repository of genomic information for life on Earth can be realized only by a coordinated international effort.

560 citations


Journal ArticleDOI
TL;DR: Vg as discussed by the authors is a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome, which provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference.
Abstract: Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual's genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large-scale structural variation such as inversions and duplications. Previous graph genome software implementations have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at a gigabase scale, or at the topological complexity of de novo assemblies.

408 citations


Journal ArticleDOI
TL;DR: In this paper, the authors characterized the genomic diversity of cichlid fish in Lake Malawi by sequencing 134 individuals covering 73 species across all major lineages and found that the average sequence divergence between species pairs is only 0.1-0.25%.
Abstract: The hundreds of cichlid fish species in Lake Malawi constitute the most extensive recent vertebrate adaptive radiation. Here we characterize its genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. The average sequence divergence between species pairs is only 0.1-0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times. Common signatures of selection on visual and oxygen transport genes shared by distantly related deep-water species point to both adaptive introgression and independent selection. These findings enhance our understanding of genomic processes underlying rapid species diversification, and provide a platform for future genetic analysis of the Malawi radiation.

311 citations


Journal ArticleDOI
TL;DR: It is argued that the chronology and physical diversity of Pleistocene human fossils and the African archaeological record support an emerging view of a highly structured African prehistory that should be considered in human evolutionary inferences, prompting new interpretations, questions, and interdisciplinary research directions.
Abstract: We challenge the view that our species, Homo sapiens, evolved within a single population and/or region of Africa. The chronology and physical diversity of Pleistocene human fossils suggest that morphologically varied populations pertaining to the H. sapiens clade lived throughout Africa. Similarly, the African archaeological record demonstrates the polycentric origin and persistence of regionally distinct Pleistocene material culture in a variety of paleoecological settings. Genetic studies also indicate that present-day population structure within Africa extends to deep times, paralleling a paleoenvironmental record of shifting and fractured habitable zones. We argue that these fields support an emerging view of a highly structured African prehistory that should be considered in human evolutionary inferences, prompting new interpretations, questions, and interdisciplinary research directions.

278 citations


Journal ArticleDOI
29 Jun 2018-Science
TL;DR: Analysis of ancient whole-genome sequences from across Inner Asia and Anatolia shows that the Botai people associated with the earliest horse husbandry derived from a hunter-gatherer population deeply diverged from the Yamnaya, and suggests distinct migrations bringing West Eurasian ancestry into South Asia before and after, but not at the time of, YamNaya culture.
Abstract: The Yamnaya expansions from the western steppe into Europe and Asia during the Early Bronze Age (~3000 BCE) are believed to have brought with them Indo-European languages and possibly horse husbandry. We analyze 74 ancient whole-genome sequences from across Inner Asia and Anatolia and show that the Botai people associated with the earliest horse husbandry derived from a hunter-gatherer population deeply diverged from the Yamnaya. Our results also suggest distinct migrations bringing West Eurasian ancestry into South Asia before and after but not at the time of Yamnaya culture. We find no evidence of steppe ancestry in Bronze Age Anatolia from when Indo-European languages are attested there. Thus, in contrast to Europe, Early Bronze Age Yamnaya-related migrations had limited direct genetic impact in Asia.

273 citations


Journal ArticleDOI
TL;DR: These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids, suggesting a possible role for this gene in the regulation of brain development and the identification of regions with the greatest sequence diversity between strains.
Abstract: We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development.

138 citations


Journal ArticleDOI
TL;DR: A new approach to detect segments of individual genomes of archaic origin without using an archaic reference genome based on a hidden Markov model that identifies genomic regions with a high density of single nucleotide variants not seen in unadmixed populations is presented.
Abstract: Human populations outside of Africa have experienced at least two bouts of introgression from archaic humans, from Neanderthals and Denisovans In Papuans there is prior evidence of both these introgressions Here we present a new approach to detect segments of individual genomes of archaic origin without using an archaic reference genome The approach is based on a hidden Markov model that identifies genomic regions with a high density of single nucleotide variants (SNVs) not seen in unadmixed populations We show using simulations that this provides a powerful approach to identifying segments of archaic introgression with a low rate of false detection, given data from a suitable outgroup population is available, without the archaic introgression but containing a majority of the variation that arose since initial separation from the archaic lineage Furthermore our approach is able to infer admixture proportions and the times both of admixture and of initial divergence between the human and archaic populations We apply the model to detect archaic introgression in 89 Papuans and show how the identified segments can be assigned to likely Neanderthal or Denisovan origin We report more Denisovan admixture than previous studies and find a shift in size distribution of fragments of Neanderthal and Denisovan origin that is compatible with a difference in admixture time Furthermore, we identify small amounts of Denisova ancestry in South East Asians and South Asians

68 citations


Posted ContentDOI
23 Mar 2018-bioRxiv
TL;DR: A new method for efficiently computing the expected SFS and linear functionals of it, for demographies described by general directed acyclic graphs is presented, which can scale to more populations than previously possible for complex demographic histories including admixture.
Abstract: The sample frequency spectrum (SFS), or histogram of allele counts, is an important summary statistic in evolutionary biology, and is often used to infer the history of population size changes, migrations, and other demographic events affecting a set of populations. The expected multipopulation SFS under a given demographic model can be efficiently computed when the populations in the model are related by a tree, scaling to hundreds of populations. Admixture, back-migration, and introgression are common natural processes that violate the assumption of a tree-like population history, however, and until now the expected SFS could be computed for only a handful of populations when the demographic history is not a tree. In this article, we present a new method for efficiently computing the expected SFS and linear functionals of it, for demographies described by general directed acyclic graphs. This method can scale to more populations than previously possible for complex demographic histories including admixture. We apply our method to an 8-population SFS to estimate the timing and strength of a proposed "basal Eurasian" admixture event in human history. We implement and release our method in a new open-source software package momi2.

61 citations


Journal ArticleDOI
TL;DR: A novel graph‐based approach to diploid assembly, which combines accurate Illumina data and long‐read Pacific Biosciences data, is presented and it is shown that the approach has the ability to detect and phase structural variants.
Abstract: Motivation Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community. Results We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants. Availability and implementation https://github.com/whatshap/whatshap. Supplementary information Supplementary data are available at Bioinformatics online.

58 citations


Posted ContentDOI
22 Oct 2018-bioRxiv
TL;DR: 34 ancient genome sequences are analyzed, revealing that the population history of northeastern Siberia was highly dynamic throughout the Late Pleistocene and Holocene, with earlier, once widespread populations being replaced by distinct peoples.
Abstract: Far northeastern Siberia has been occupied by humans for more than 40 thousand years. Yet, owing to a scarcity of early archaeological sites and human remains, its population history and relationship to ancient and modern populations across Eurasia and the Americas are poorly understood. Here, we report 34 ancient genome sequences, including two from fragmented milk teeth found at the ~31.6 thousand-year-old (kya) Yana RHS site, the earliest and northernmost Pleistocene human remains found. These genomes reveal complex patterns of past population admixture and replacement events throughout northeastern Siberia, with evidence for at least three large-scale human migrations into the region. The first inhabitants, a previously unknown population of "Ancient North Siberians" (ANS), represented by Yana RHS, diverged ~38 kya from Western Eurasians, soon after the latter split from East Asians. Between 20 and 11 kya, the ANS population was largely replaced by peoples with ancestry from East Asia, giving rise to ancestral Native Americans and "Ancient Paleosiberians" (AP), represented by a 9.8 kya skeleton from Kolyma River. AP are closely related to the Siberian ancestors of Native Americans, and ancestral to contemporary communities such as Koryaks and Itelmen. Paleoclimatic modelling shows evidence for a refuge during the last glacial maximum (LGM) in southeastern Beringia, suggesting Beringia as a possible location for the admixture forming both ancestral Native Americans and AP. Between 11 and 4 kya, AP were in turn largely replaced by another group of peoples with ancestry from East Asia, the "Neosiberians" from which many contemporary Siberians derive. We detect additional gene flow events in both directions across the Bering Strait during this time, influencing the genetic composition of Inuit, as well as Na Dene-speaking Northern Native Americans, whose Siberian-related ancestry components is closely related to AP. Our analyses reveal that the population history of northeastern Siberia was highly dynamic, starting in the Late Pleistocene and continuing well into the Late Holocene. The pattern observed in northeastern Siberia, with earlier, once widespread populations being replaced by distinct peoples, seems to have taken place across northern Eurasia, as far west as Scandinavia.

54 citations


Posted ContentDOI
19 Dec 2018-bioRxiv
TL;DR: A high-quality de novo genome assembly from a single Anopheles coluzzii mosquito is presented, which places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference.
Abstract: A high-quality reference genome is a fundamental resource for functional genetics, comparative genomics, and population genomics, and is increasingly important for conservation biology. PacBio Single Molecule, Real-Time (SMRT) sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives, however, relatively high DNA input requirements (~5 μg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that have lower DNA content, or on projects with limited input DNA for other reasons. Here we present a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System with chemistry 3.0 and software v6.0, generating, on average, 25 Gb of sequence per SMRT Cell with 20 hour movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting curated assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes are present and full-length). In addition, this single-insect assembly now places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference. We were also able to resolve maternal and paternal haplotypes for over 1/3 of the genome. By sequencing and assembling material from a single diploid individual, only two haplotypes are present, simplifying the assembly process compared to samples from multiple pooled individuals. The method presented here can be applied to samples with starting DNA amounts as low as 100 ng per 1 Gb genome size. This new low-input approach puts PacBio- based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.

Posted ContentDOI
12 Feb 2018-bioRxiv
TL;DR: High quality collection of genomes revealed a previously unannotated gene (Efcab3-like) encoding 5,874 amino acids, one of the largest known in the rodent lineage, and Interestingly, Efcab 3-like−/− mice exhibit severe size anomalies in four regions of the brain suggesting a mechanism of EfcAB3- like regulating brain development.
Abstract: The most commonly employed mammalian model organism is the laboratory mouse. A wide variety of genetically diverse inbred mouse strains, representing distinct physiological states, disease susceptibilities, and biological mechanisms have been developed over the last century. We report full length draft de novo genome assemblies for 16 of the most widely used inbred strains and reveal for the first time extensive strain-specific haplotype variation. We identify and characterise 2,567 regions on the current Genome Reference Consortium mouse reference genome exhibiting the greatest sequence diversity between strains. These regions are enriched for genes involved in defence and immunity, and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. Several immune related loci, some in previously identified QTLs for disease response have novel haplotypes not present in the reference that may explain the phenotype. We used these genomes to improve the mouse reference genome resulting in the completion of 10 new gene structures, and 62 new coding loci were added to the reference genome annotation. Notably this high quality collection of genomes revealed a previously unannotated gene (Efcab3-like) encoding 5,874 amino acids, one of the largest known in the rodent lineage. Interestingly, Efcab3-like-/- mice exhibit severe size anomalies in four regions of the brain suggesting a mechanism of Efcab3-like regulating brain development.

Posted ContentDOI
16 Mar 2018-bioRxiv
TL;DR: A new approach to detect segments of individual genomes of archaic origin without using an archaic reference genome based on a hidden Markov model that identifies genomic regions with a high density of single nucleotide variants not seen in unadmixed populations is presented.
Abstract: Human populations out of Africa have experienced at least two bouts of introgression from archaic humans, Neandertal and Denisovans. In Papuans there is prior evidence of both these introgressions. Here we present a new approach to detect segments of individual genomes of archaic origin without using an archaic reference genome. The approach is based on the detection of genomic regions with a high SNV density of SNVs not seen in unadmixed populations. We show using simulations that this provides a powerful approach to identifying segments of archaic introgression with a small rate of false detection. Furthermore our approach is able to accurately infer admixture proportions and divergence time of human and archaic populations. We apply the model to detect archaic introgression in 89 Papuans and show how the identified segments can be assigned to likely Neandertal or Denisovan origin. We report more Denisovan admixture than previous studies and directly find a shift in size distribution of fragments of Neandertal and Denisovan origin that is compatible with a difference in admixture time. Furthermore we identify small amounts of Denisova ancestry in West Eurasians, South East Asians and South Asians.

Posted ContentDOI
21 Mar 2018-bioRxiv
TL;DR: The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values, offering an opportunity for space saving, and a 17 fold reduction in quality storage can be achieved while maintaining variant calling accuracy.
Abstract: Motivation: The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results: On the CHM1+CHM13 test set, a 17 fold reduction in quality storage can be achieved while maintaining variant calling accuracy. Availability: Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble.

Posted Content
TL;DR: In this paper, a scalable implementation of the graph extension of the positional Burrows-Wheeler transform is presented for k-mer indexing without losing any k-mers in the haplotypes.
Abstract: The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows--Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.

Posted ContentDOI
10 May 2018-bioRxiv
TL;DR: This study identifies rare, deleterious SNVs in the coding sequence of genes involved in ECM adhesion that occurred in cell lines that were outliers for one or more phenotypes such as cell spreading and correlated with altered germ layer differentiation on micropatterned surfaces.
Abstract: Large cohorts of human induced pluripotent stem cells (iPSCs) from healthy donors are a potentially powerful tool for investigating the relationship between genetic variants and cellular phenotypes. Here we integrate high content imaging, gene expression and DNA sequence datasets from over 100 human iPSC lines to explore the genetic basis of inter-individual variability in cell behaviour. By applying a dimensionality reduction approach, Probabilistic Estimation of Expression Residuals (PEER), we extracted factors that captured the effects of intrinsic (genetic) and extrinsic (environmental) conditions. We identified genes that correlated in expression with intrinsic and extrinsic PEER factors and mapped outlier cell behaviour to expression of genes containing rare deleterious SNVs. Our study thus establishes a strategy for determining the genetic basis of inter-individual variability in cell behaviour.

Proceedings ArticleDOI
01 Jan 2018
TL;DR: In this article, a scalable implementation of the graph extension of the positional Burrows-Wheeler transform is presented for k-mer indexing without losing any k-mers in the haplotypes.
Abstract: The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.

Posted ContentDOI
20 Mar 2018-bioRxiv
TL;DR: In this article, the authors integrated high content imaging, gene expression and DNA sequence datasets for over 100 human iPSC lines to identify the genetic basis of inter-individual variability in cell behaviour.
Abstract: Large cohorts of human iPSCs from healthy donors are potentially a powerful tool for investigating the relationship between genetic variants and cellular phenotypes. Here we integrate high content imaging, gene expression and DNA sequence datasets for over 100 human iPSC lines to identify the genetic basis of inter-individual variability in cell behaviour. By applying a dimensionality reduction approach, Probabilistic Estimation of Expression Residuals (PEER), we identified genes that correlated in expression with intrinsic (genetic) and extrinsic (ECM) factors. However, variation in mRNA levels could not account for outlier cell behaviour. Instead, we identified rare, deleterious SNVs in the coding sequence of genes involved in ECM adhesion that occurred in cell lines that were outliers for one or more phenotypes such as cell spreading. These also correlated with altered germ layer differentiation on micropatterned surfaces. Our study thus establishes a strategy for integrating genetic and cell biological measurements for high-throughput analysis.