scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Haplotype phasing: existing methods and new developments.

01 Oct 2011-Nature Reviews Genetics (Nature Publishing Group)-Vol. 12, Iss: 10, pp 703-714
TL;DR: The haplotype phasing methods that are available are assessed, focusing in particular on statistical methods, and the practical aspects of their application are discussed, and recent developments that may transform this field are described.
Abstract: Determination of haplotype phase is becoming increasingly important as we enter the era of large-scale sequencing because many of its applications, such as imputing low-frequency variants and characterizing the relationship between genetic variation and disease susceptibility, are particularly relevant to sequence data. Haplotype phase can be generated through laboratory-based experimental methods, or it can be estimated using computational approaches. We assess the haplotype phasing methods that are available, focusing in particular on statistical methods, and we discuss the practical aspects of their application. We also describe recent developments that may transform this field, particularly the use of identity-by-descent for computational phasing.
Citations
More filters
Journal ArticleDOI
TL;DR: A new phasing algorithm, Eagle2, is introduced that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform.
Abstract: Po-Ru Loh, Alkes Price and colleagues present Eagle2, a reference-based phasing algorithm that allows for highly accurate and efficient phasing of genotypes across a broad range of cohort sizes. They demonstrate an approximately 10% improvement in accuracy and 20% improvement in speed compared to a competing method, SHAPEIT2.

1,246 citations

Journal ArticleDOI
TL;DR: The proposed method efficiently makes use of information from close and distant relatives for accurate genotype imputation and is fast, owing to its deterministic nature and, therefore, it can easily be used in large data sets where the use of other methods is impractical.
Abstract: Genotype imputation can help reduce genotyping costs particularly for implementation of genomic selection In applications entailing large populations, recovering the genotypes of untyped loci using information from reference individuals that were genotyped with a higher density panel is computationally challenging Popular imputation methods are based upon the Hidden Markov model and have computational constraints due to an intensive sampling process A fast, deterministic approach, which makes use of both family and population information, is presented here All individuals are related and, therefore, share haplotypes which may differ in length and frequency based on their relationships The method starts with family imputation if pedigree information is available, and then exploits close relationships by searching for long haplotype matches in the reference group using overlapping sliding windows The search continues as the window size is shrunk in each chromosome sweep in order to capture more distant relationships The proposed method gave higher or similar imputation accuracy than Beagle and Impute2 in cattle data sets when all available information was used When close relatives of target individuals were present in the reference group, the method resulted in higher accuracy compared to the other two methods even when the pedigree was not used Rare variants were also imputed with higher accuracy Finally, computing requirements were considerably lower than those of Beagle and Impute2 The presented method took 28 minutes to impute from 6 k to 50 k genotypes for 2,000 individuals with a reference size of 64,429 individuals The proposed method efficiently makes use of information from close and distant relatives for accurate genotype imputation In addition to its high imputation accuracy, the method is fast, owing to its deterministic nature and, therefore, it can easily be used in large data sets where the use of other methods is impractical

766 citations


Cites background from "Haplotype phasing: existing methods..."

  • ...Browning B, Browning S: A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals....

    [...]

  • ...Browning and Browning [14] also found that when parents were included in the reference group, phasing accuracy using population haplotype frequency information was substantially higher....

    [...]

  • ...Li L, Li Y, Browning SR, Browning BL, Slater AJ, Kong X, Aponte JL, Mooser VE, Chissoe SL, Whittaker JC, Nelson MR, Ehm MG: Performance of genotype imputation for rare variants identified in exons and flanking regions of genes....

    [...]

  • ...However, they can still capture close relationships between individuals by finding long shared haplotypes [14]....

    [...]

  • ...Browning SR, Browning BL: Haplotype phasing: existing methods and new developments....

    [...]

Journal ArticleDOI
01 Jun 2013-Genetics
TL;DR: Refined IBD allows for IBD reporting on a haplotype level, which facilitates determination of multi-individual IBD and allows for haplotype-based downstream analyses and is implemented in Beagle version 4.
Abstract: Segments of indentity-by-descent (IBD) detected from high-density genetic data are useful for many applications, including long-range phase determination, phasing family data, imputation, IBD mapping, and heritability analysis in founder populations. We present Refined IBD, a new method for IBD segment detection. Refined IBD achieves both computational efficiency and highly accurate IBD segment reporting by searching for IBD in two steps. The first step (identification) uses the GERMLINE algorithm to find shared haplotypes exceeding a length threshold. The second step (refinement) evaluates candidate segments with a probabilistic approach to assess the evidence for IBD. Like GERMLINE, Refined IBD allows for IBD reporting on a haplotype level, which facilitates determination of multi-individual IBD and allows for haplotype-based downstream analyses. To investigate the properties of Refined IBD, we simulate SNP data from a model with recent superexponential population growth that is designed to match United Kingdom data. The simulation results show that Refined IBD achieves a better power/accuracy profile than fastIBD or GERMLINE. We find that a single run of Refined IBD achieves greater power than 10 runs of fastIBD. We also apply Refined IBD to SNP data for samples from the United Kingdom and from Northern Finland and describe the IBD sharing in these data sets. Refined IBD is powerful, highly accurate, and easy to use and is implemented in Beagle version 4.

524 citations


Cites background or methods or result from "Haplotype phasing: existing methods..."

  • ...…(Browning and Browning 2012), including long-range phase determination (Kong et al. 2008), phasing family data (S. R. Browning and B. L. Browning 2011), imputation (Jonsson et al. 2012), detecting signals of natural selection (Albrechtsen et al. 2009; Cai et al. 2011; Han and Abney…...

    [...]

  • ...Detectable IBD segments are ubiquitous in genome-wide SNP data from population samples (B. L. Browning and S. R. Browning 2011)....

    [...]

  • ...Here we found IBD at a rate of 0.0041 (probability that a randomly chosen pair of individuals has detectable IBD at a randomly chosen position), whereas the previous rate was 0.00035 (B. L. Browning and S. R. Browning 2011)....

    [...]

  • ...This dictionary approach was adopted by fastIBD (B. L. Browning and S. R. Browning 2011) and is used here to detect candidate IBD tracts for evaluation by the Refined IBD algorithm....

    [...]

  • ...Browning, B. L., and S. R. Browning, 2011 A fast, powerful method for detecting identity by descent....

    [...]

Journal ArticleDOI
TL;DR: SMC++ is presented, a new statistical tool capable of analyzing orders of magnitude more samples than existing methods while requiring only unphased genomes and employing a novel spline regularization scheme that greatly reduces estimation error.
Abstract: It has recently been demonstrated that inference methods based on genealogical processes with recombination can uncover past population history in unprecedented detail. However, these methods scale poorly with sample size, limiting resolution in the recent past, and they require phased genomes, which contain switch errors that can catastrophically distort the inferred history. Here we present SMC++, a new statistical tool capable of analyzing orders of magnitude more samples than existing methods while requiring only unphased genomes (its results are independent of phasing). SMC++ can jointly infer population size histories and split times in diverged populations, and it employs a novel spline regularization scheme that greatly reduces estimation error. We apply SMC++ to analyze sequence data from over a thousand human genomes in Africa and Eurasia, hundreds of genomes from a Drosophila melanogaster population in Africa, and tens of genomes from zebra finch and long-tailed finch populations in Australia.

513 citations

Journal ArticleDOI
TL;DR: This paper describes the first software natively capable of using paired‐end sequencing to derive short contigs from de novo RAD data, and shows that the latest version of Stacks is highly accurate and outperforms other software in assembling and genotyping paired‐ end de noVO data sets.
Abstract: For half a century population genetics studies have put type II restriction endonucleases to work. Now, coupled with massively-parallel, short-read sequencing, the family of RAD protocols that wields these enzymes has generated vast genetic knowledge from the natural world. Here, we describe the first software natively capable of using paired-end sequencing to derive short contigs from de novo RAD data. Stacks version 2 employs a de Bruijn graph assembler to build and connect contigs from forward and reverse reads for each de novo RAD locus, which it then uses as a reference for read alignments. The new architecture allows all the individuals in a metapopulation to be considered at the same time as each RAD locus is processed. This enables a Bayesian genotype caller to provide precise SNPs, and a robust algorithm to phase those SNPs into long haplotypes, generating RAD loci that are 400-800 bp in length. To prove its recall and precision, we tested the software with simulated data and compared reference-aligned and de novo analyses of three empirical data sets. Our study shows that the latest version of Stacks is highly accurate and outperforms other software in assembling and genotyping paired-end de novo data sets.

479 citations


Cites methods from "Haplotype phasing: existing methods..."

  • ...Stacks v2 implements a read‐based phasing approach (as opposed to statistical phasing; Browning & Browning, 2011) that relies on the co‐observation, in a given read (or read pair), of the alleles at several SNPs....

    [...]

  • ...This is in contrast with statistical phasing methods in which an indi‐ vidual's haplotypes are estimated in relation to a panel of haplotypes observed at the population level (Browning & Browning, 2011)....

    [...]

References
More filters
Journal ArticleDOI
Eric S. Lander1, Lauren Linton1, Bruce W. Birren1, Chad Nusbaum1  +245 moreInstitutions (29)
15 Feb 2001-Nature
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

22,269 citations

Journal ArticleDOI
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

20,557 citations


"Haplotype phasing: existing methods..." refers methods in this paper

  • ...and the 'read-backed phasing' algorithm that is incorporated into the Genome Analysis Tool Kit softwar...

    [...]

Journal ArticleDOI
TL;DR: The main innovations of the new version of the Arlequin program include enhanced outputs in XML format, the possibility to embed graphics displaying computation results directly into output files, and the implementation of a new method to detect loci under selection from genome scans.
Abstract: We present here a new version of the Arlequin program available under three different forms: a Windows graphical version (Winarl35), a console version of Arlequin (arlecore), and a specific console version to compute summary statistics (arlsumstat). The command-line versions run under both Linux and Windows. The main innovations of the new version include enhanced outputs in XML format, the possibility to embed graphics displaying computation results directly into output files, and the implementation of a new method to detect loci under selection from genome scans. Command-line versions are designed to handle large series of files, and arlsumstat can be used to generate summary statistics from simulated data sets within an Approximate Bayesian Computation framework.

13,581 citations


"Haplotype phasing: existing methods..." refers methods in this paper

  • ...Many software implementations of the EM algorithm exist, including Arlequi...

    [...]

Journal ArticleDOI
Paul Burton1, David Clayton2, Lon R. Cardon, Nicholas John Craddock3  +192 moreInstitutions (4)
07 Jun 2007-Nature
TL;DR: This study has demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiple disease phenotypes; generated a genome-wide genotype database for future studies of common diseases in the British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratification in theBritish population is generally modest.
Abstract: There is increasing evidence that genome-wide association ( GWA) studies represent a powerful approach to the identification of genes involved in common human diseases. We describe a joint GWA study ( using the Affymetrix GeneChip 500K Mapping Array Set) undertaken in the British population, which has examined similar to 2,000 individuals for each of 7 major diseases and a shared set of similar to 3,000 controls. Case-control comparisons identified 24 independent association signals at P < 5 X 10(-7): 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a large number of further signals ( including 58 loci with single-point P values between 10(-5) and 5 X 10(-7)) likely to yield additional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizes observed at most loci identified. This study thus represents a thorough validation of the GWA approach. It has also demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in the British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiology of these important disorders. We anticipate that our data, results and software, which will be widely available to other investigators, will provide a powerful resource for human genetics research.

9,244 citations