scispace - formally typeset
Search or ask a question

Showing papers by "Richard Durbin published in 2014"


Journal ArticleDOI
TL;DR: Results from applying multiple sequentially Markovian coalescent (MSMC) to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 Years ago.
Abstract: The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model ancestral relationships under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20,000-30,000 years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The multiple sequentially Markovian coalescent (MSMC) analyzes the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 years ago, including the bottleneck in the peopling of the Americas and separations within Africa, East Asia and Europe.

866 citations


Journal ArticleDOI
TL;DR: A theory of haplotype matching using suffix array ideas is developed, which should scale too much larger datasets than those currently handled by genotype algorithms, and also includes some proposals about how these approaches could be used for imputation and phasing.
Abstract: Motivation: Over the last few years, methods based on suffix arrays using the Burrows–Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on statistical methods for phasing and imputation, based on probabilistic matching to hidden Markov model representations of the reference data, which while powerful are much less computationally efficient. Here a theory of haplotype matching using suffix array ideas is developed, which should scale too much larger datasets than those currently handled by genotype algorithms. Results: Given M sequences with N bi-allelic variable sites, an O(NM) algorithm to derive a representation of the data based on positional prefix arrays is given, which is termed the positional Burrows–Wheeler transform (PBWT). On large datasets this compresses with run-length encoding by more than a factor of a hundred smaller than using gzip on the raw data. Using this representation a method is given to find all maximal haplotype matches within the set in O(NM )t ime rather than O(NM 2 ) as expected from naive pairwise comparison, and also a fast algorithm, empirically independent of M given sufficient memory for indexes, to find maximal matches between a new sequence and the set. The discussion includes some proposals about how these approaches could be used for imputation and phasing. Availability: http://github.com/richarddurbin/pbwt

394 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compared exome sequence data on 3,000 Finns to the same number of non-Finnish Europeans and discovered that the average Finn has more low-frequency loss-of-function variants and complete gene knockouts.
Abstract: Exome sequencing studies in complex diseases are challenged by the allelic heterogeneity, large number and modest effect sizes of associated variants on disease risk and the presence of large numbers of neutral variants, even in phenotypically relevant genes. Isolated populations with recent bottlenecks offer advantages for studying rare variants in complex diseases as they have deleterious variants that are present at higher frequencies as well as a substantial reduction in rare neutral variation. To explore the potential of the Finnish founder population for studying low-frequency (0.5-5%) variants in complex diseases, we compared exome sequence data on 3,000 Finns to the same number of non-Finnish Europeans and discovered that, despite having fewer variable sites overall, the average Finn has more low-frequency loss-of-function variants and complete gene knockouts. We then used several well-characterized Finnish population cohorts to study the phenotypic effects of 83 enriched loss-of-function variants across 60 phenotypes in 36,262 Finns. Using a deep set of quantitative traits collected on these cohorts, we show 5 associations (p<5×10⁻⁸) including splice variants in LPA that lowered plasma lipoprotein(a) levels (P = 1.5×10⁻¹¹⁷). Through accessing the national medical records of these participants, we evaluate the LPA finding via Mendelian randomization and confirm that these splice variants confer protection from cardiovascular disease (OR = 0.84, P = 3×10⁻⁴), demonstrating for the first time the correlation between very low levels of LPA in humans with potential therapeutic implications for cardiovascular diseases. More generally, this study articulates substantial advantages for studying the role of rare variation in complex phenotypes in founder populations like the Finns and by combining a unique population genetic history with data from large population cohorts and centralized research access to National Health Registers.

367 citations


Journal ArticleDOI
TL;DR: It is found that genome content variation, in the form of presence or absence as well as copy number of genetic material, is higher inside S. cerevisiae than within S. paradoxus, despite genetic distances as measured in single-nucleotide polymorphisms being vastly smaller within the former species.
Abstract: The question of how genetic variation in a population influences phenotypic variation and evolution is of major importance in modern biology. Yet much is still unknown about the relative functional importance of different forms of genome variation and how they are shaped by evolutionary processes. Here we address these questions by population level sequencing of 42 strains from the budding yeast Saccharomyces cerevisiae and its closest relative S. paradoxus. We find that genome content variation, in the form of presence or absence as well as copy number of genetic material, is higher within S. cerevisiae than within S. paradoxus, despite genetic distances as measured in single-nucleotide polymorphisms being vastly smaller within the former species. This genome content variation, as well as loss-of-function variation in the form of premature stop codons and frameshifting indels, is heavily enriched in the subtelomeres, strongly reinforcing the relevance of these regions to functional evolution. Genes affected by these likely functional forms of variation are enriched for functions mediating interaction with the external environment (sugar transport and metabolism, flocculation, metal transport, and metabolism). Our results and analyses provide a comprehensive view of genomic diversity in budding yeast and expose surprising and pronounced differences between the variation within S. cerevisiae and that within S. paradoxus. We also believe that the sequence data and de novo assemblies will constitute a useful resource for further evolutionary and population genomics studies.

278 citations


Posted ContentDOI
21 May 2014-bioRxiv
TL;DR: Results from applying Multiple Sequentially Markovian Coalescent (MSMC) to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,500 years ago.
Abstract: The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model their ancestral relationship under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20-30 thousand years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The Multiple Sequentially Markovian Coalescent (MSMC) analyses the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,000 years ago, including the bottleneck in the peopling of the Americas, and separations within Africa, East Asia and Europe.

189 citations


Journal ArticleDOI
25 Apr 2014-eLife
TL;DR: This work identifies a candidate set of 508 variance associated SNPs from lymphoblastoid cell lines and shows that GxE plays a role in ∼70% of these associations, and investigates 57 epistatic interactions that replicated in a smaller dataset, explaining on average 4.3% of phenotypic variance.
Abstract: Every person has two copies of each gene: one is inherited from their mother and the other from their father. These two copies are often not identical because there can be many different variants of the same gene in the human population. Traits (such as height, body mass and risk of disease) vary from one person to the next—and for many traits this variation depends in part on the different gene variants that each person has inherited. Studies seeking to find the differences in DNA that can predict this variation have often assumed that the changes in DNA act on traits independently of the effect of environment and of other genetic variants. In contrast, studies with animals have shown that some genetic variants can interact to produce a bigger (or smaller) effect than would be expected from simply ‘adding together’ their individual effects—a phenomenon called epistasis. But how much does epistasis contribute to variation in human traits, if at all? This question has been much disputed, and is difficult to test, not least because of the sheer number of interactions to assess: tens of millions of changes in DNA have been observed in the human genome, and so there are many more than billions of possible combinations of these changes to investigate. Here, Brown et al. have examined the sequences of all the genes that were expressed in cells taken from a cohort of twins and searched for genetic variants that show these epistatic interactions. By studying gene expression, which can be greatly affected by small changes in the DNA code, Brown et al. were able to identify 508 variants that had a bigger than expected effect on the level of gene expression. This may be a sign that these variants act in combinations: if within one genome a variant increased expression and in another it decreased expression, then this would cause greater variation in gene expression. Further investigation of these 508 variants led to the discovery of 256 examples of epistasis, and 57 of these were replicated in samples from another cohort. Brown et al. calculated that these epistatic interactions explained up to 16% of the variation in gene expression. Furthermore, as well as being involved in epistatic interactions, about 70% of the genetic variants that had an effect on the variation in gene expression were also involved in interactions between genes and the environment. In addition to showing that epistasis contributes to variation in human traits, the work of Brown et al. could help to uncover interactions behind complex traits—beyond the expression level of a gene—that could not previously be investigated.

144 citations


Journal ArticleDOI
TL;DR: A novel method, TelSeq, is reported to measure average telomere length from whole genome or exome shotgun sequence data and results correlate with Southern blot measurements of the mean length of terminal restriction fragments and display age-dependent attrition comparably well as mTRFs.
Abstract: Telomeres play a key role in replicative ageing and undergo age-dependent attrition in vivo H ere, we report a novel method, TelSeq, to measure average telomere length from whole genome or exome shotgun sequence data In 260 leukocyte samples, we show that TelSeq results correlate with Southern blot measurements of the mean length of terminal restriction fragments (mTRFs) and display age-dependent attrition comparably well as mTRFs

140 citations


Journal ArticleDOI
TL;DR: It is demonstrated that while sites with low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps.
Abstract: Background: Population differentiation has proved to be effective for identifying loci under geographically localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes. Results: We demonstrate that while sites with low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively. Conclusions: We identify known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.

86 citations


Journal ArticleDOI
TL;DR: It is shown that a single predicted splice donor variant is responsible for association signals and is independent of known common variants, and an independent relationship between rs138326449 and high-density lipoprotein (HDL) levels is suggested.
Abstract: The analysis of rich catalogues of genetic variation from population-based sequencing provides an opportunity to screen for functional effects. Here we report a rare variant in ​APOC3 (rs138326449-A, minor allele frequency ~0.25% (UK)) associated with plasma triglyceride (TG) levels (−1.43 s.d. (s.e.=0.27 per minor allele (P-value=8.0 × 10−8)) discovered in 3,202 individuals with low read-depth, whole-genome sequence. We replicate this in 12,831 participants from five additional samples of Northern and Southern European origin (−1.0 s.d. (s.e.=0.173), P-value=7.32 × 10−9). This is consistent with an effect between 0.5 and 1.5 mmol l−1 dependent on population. We show that a single predicted splice donor variant is responsible for association signals and is independent of known common variants. Analyses suggest an independent relationship between rs138326449 and high-density lipoprotein (HDL) levels. This represents one of the first examples of a rare, large effect variant identified from whole-genome sequencing at a population scale.

65 citations


Journal ArticleDOI
TL;DR: The data reveal extensive genetic effects on CTCF binding, both direct and indirect, and identify a diversity of patterns of C TCF binding on the X chromosome.
Abstract: Associating genetic variation with quantitative measures of gene regulation offers a way to bridge the gap between genotype and complex phenotypes. In order to identify quantitative trait loci (QTLs) that influence the binding of a transcription factor in humans, we measured binding of the multifunctional transcription and chromatin factor CTCF in 51 HapMap cell lines. We identified thousands of QTLs in which genotype differences were associated with differences in CTCF binding strength, hundreds of them confirmed by directly observable allele-specific binding bias. The majority of QTLs were either within 1 kb of the CTCF binding motif, or in linkage disequilibrium with a variant within 1 kb of the motif. On the X chromosome we observed three classes of binding sites: a minority class bound only to the active copy of the X chromosome, the majority class bound to both the active and inactive X, and a small set of female-specific CTCF sites associated with two non-coding RNA genes. In sum, our data reveal extensive genetic effects on CTCF binding, both direct and indirect, and identify a diversity of patterns of CTCF binding on the X chromosome.

59 citations


Journal ArticleDOI
TL;DR: This paper describes the mechanism and the specific criteria, which must be fulfilled in order for a finding and participant to qualify for feedback, and could be used by future research consortia, and may also assist in the development of sound principles for dealing with CSFs.
Abstract: Recent advances in sequencing technology allow data on the human genome to be generated more quickly and in greater detail than ever before. Such detail includes findings that may be of significance to the health of the research participant involved. Although research studies generally do not feed back information on clinically significant findings (CSFs) to participants, this stance is increasingly being questioned. There may be difficulties and risks in feeding clinically significant information back to research participants, however, the UK10K consortium sought to address these by creating a detailed management pathway. This was not intended to create any obligation upon the researchers to feed back any CSFs they discovered. Instead, it provides a mechanism to ensure that any such findings can be passed on to the participant where appropriate. This paper describes this mechanism and the specific criteria, which must be fulfilled in order for a finding and participant to qualify for feedback. This mechanism could be used by future research consortia, and may also assist in the development of sound principles for dealing with CSFs.

Journal ArticleDOI
01 Dec 2014-Genetics
TL;DR: This work extends classic theory to founder populations, giving the covariance between individuals due to epistasis of any order, and derives a recently proposed estimator of the narrow sense heritability as a corollary, and extends the variance decomposition to include dominance.
Abstract: Genetic association studies have explained only a small proportion of the estimated heritability of complex traits, leaving the remaining heritability "missing." Genetic interactions have been proposed as an explanation for this, because they lead to overestimates of the heritability and are hard to detect. Whether this explanation is true depends on the proportion of variance attributable to genetic interactions, which is difficult to measure in outbred populations. Founder populations exhibit a greater range of kinship than outbred populations, which helps in fitting the epistatic variance. We extend classic theory to founder populations, giving the covariance between individuals due to epistasis of any order. We recover the classic theory as a limit, and we derive a recently proposed estimator of the narrow sense heritability as a corollary. We extend the variance decomposition to include dominance. We show in simulations that it would be possible to estimate the variance from pairwise interactions with samples of a few thousand from strongly bottlenecked human founder populations, and we provide an analytical approximation of the standard error. Applying these methods to 46 traits measured in a yeast (Saccharomyces cerevisiae) cross, we estimate that pairwise interactions explain 10% of the phenotypic variance on average and that third- and higher-order interactions explain 14% of the phenotypic variance on average. We search for third-order interactions, discovering an interaction that is shared between two traits. Our methods will be relevant to future studies of epistatic variance in founder populations and crosses.

Posted Content
TL;DR: An efficient dynamic programming algorithm is proposed that can assign haplogroups by maximum likelihood, and represent the uncertainty in assignment to both genotype and low-coverage sequencing data, and it is shown that it can assignHaplogroups accurately and with high resolution.
Abstract: Low-coverage short-read resequencing experiments have the potential to expand our understanding of Y chromosome haplogroups. However, the uncertainty associated with these experiments mean that haplogroups must be assigned probabilistically to avoid false inferences. We propose an efficient dynamic programming algorithm that can assign haplogroups by maximum likelihood, and represent the uncertainty in assignment. We apply this to both genotype and low-coverage sequencing data, and show that it can assign haplogroups accurately and with high resolution. The method is implemented as the program YFitter, which can be downloaded from this http URL


Journal ArticleDOI
TL;DR: In the version of this article initially published, in Table 1, Steven Salzberg should have been listed as the second, and not the last, of the creators of the Cufflinks software.
Abstract: Nat. Biotechnol. 31, 894–897 (2013); published online 8 October 2013; corrected after print 9 May 2014 In the version of this article initially published, in Table 1, Steven Salzberg should have been listed as the second, and not the last, of the creators of the Cufflinks software. The error has been corrected in the HTML and PDF versions of the article.

Posted ContentDOI
19 Oct 2014-bioRxiv
TL;DR: A model where ASE requires genetic variability in cis, a difference in the sequence of both alleles, but the magnitude of the ASE effect depends on trans genetic and environmental factors that interact with the cis genetic variants is proposed.
Abstract: Understanding the genetic architecture of gene expression is an intermediate step to understand the genetic architecture of complex diseases. RNA-seq technologies have improved the quantification of gene expression and allow to measure allelic specific expression (ASE)1-3. ASE is hypothesized to result from the direct effect of cis regulatory variants, but a proper estimation of the causes of ASE has not been performed to date. In this study we take advantage of a sample of twins to measure the relative contribution of genetic and environmental effects on ASE and we found substantial effects of gene x gene (GxG) and gene x environment (GxE) interactions. We propose a model where ASE requires genetic variability in cis, a difference in the sequence of both alleles, but the magnitude of the ASE effect depends on trans genetic and environmental factors that interact with the cis genetic variants. We uncover large GxG and GxE effects on gene expression and likely complex phenotypes that currently remain elusive.