scispace - formally typeset

Journal ArticleDOI

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

01 Sep 2007-American Journal of Human Genetics (Elsevier)-Vol. 81, Iss: 3, pp 559-575

...read more

Citations
More filters

Journal ArticleDOI
TL;DR: The main innovations of the new version of the Arlequin program include enhanced outputs in XML format, the possibility to embed graphics displaying computation results directly into output files, and the implementation of a new method to detect loci under selection from genome scans.
Abstract: We present here a new version of the Arlequin program available under three different forms: a Windows graphical version (Winarl35), a console version of Arlequin (arlecore), and a specific console version to compute summary statistics (arlsumstat). The command-line versions run under both Linux and Windows. The main innovations of the new version include enhanced outputs in XML format, the possibility to embed graphics displaying computation results directly into output files, and the implementation of a new method to detect loci under selection from genome scans. Command-line versions are designed to handle large series of files, and arlsumstat can be used to generate summary statistics from simulated data sets within an Approximate Bayesian Computation framework.

11,882 citations


Cites methods from "PLINK: A Tool Set for Whole-Genome ..."

  • ...Some software packages (e.g. plink Purcell et al. 2007) have been specifically developed to both handle such huge data sets and to directly perform statistical analyses on the data....

    [...]


Journal ArticleDOI
TL;DR: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility, and for the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
Abstract: Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O √ n -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

4,519 citations


Cites background from "PLINK: A Tool Set for Whole-Genome ..."

  • ...9’s core functional domains are unchanged from that of its predecessor—data management, summary statistics, population stratification, association analysis, identity-by-descent estimation [1] —and it is usable as a drop-in replacement in most cases, requiring no changes to existing scripts....

    [...]


Journal ArticleDOI
TL;DR: The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets and focuses on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation.
Abstract: For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the “missing heritability” problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.

4,497 citations


Cites methods from "PLINK: A Tool Set for Whole-Genome ..."

  • ...Two estimates have been used: one based on the variance of additive genetic values (diagonal of the SNP-derived GRM) and the other based on SNP homozygosity (implemented in PLINK).(25) Let (1 – pi) 2 þ pi(1 – pi)F, 2pi(1 – pi)(1 – F), and pi 2 þ pi(1 – pi)F be the frequencies of the three genotypes of a SNP i and let hi 1⁄4 2pi(1 – pi)....

    [...]


Journal ArticleDOI
18 Oct 2007-Nature
TL;DR: The Phase II HapMap is described, which characterizes over 3.1 million human single nucleotide polymorphisms genotyped in 270 individuals from four geographically diverse populations and includes 25–35% of common SNP variation in the populations surveyed, and increased differentiation at non-synonymous, compared to synonymous, SNPs is demonstrated.
Abstract: We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.

4,408 citations


Journal ArticleDOI
Shaun Purcell1, Shaun Purcell2, Naomi R. Wray3, Jennifer Stone2, Jennifer Stone1, Peter M. Visscher, Michael Conlon O'Donovan4, Patrick F. Sullivan5, Pamela Sklar2, Pamela Sklar1, Douglas M. Ruderfer, Andrew McQuillin, Derek W. Morris6, Colm O'Dushlaine6, Aiden Corvin6, Peter Holmans4, Stuart MacGregor3, Hugh Gurling, Douglas Blackwood7, Nicholas John Craddock5, Michael Gill6, Christina M. Hultman8, Christina M. Hultman9, George Kirov4, Paul Lichtenstein9, Walter J. Muir7, Michael John Owen4, Carlos N. Pato10, Edward M. Scolnick1, Edward M. Scolnick2, David St Clair, Nigel Williams4, Lyudmila Georgieva4, Ivan Nikolov4, Nadine Norton4, Hywel Williams4, Draga Toncheva, Vihra Milanova, Emma Flordal Thelander9, Patrick Sullivan11, Elaine Kenny6, Emma M. Quinn6, Khalid Choudhury12, Susmita Datta12, Jonathan Pimm12, Srinivasa Thirumalai13, Vinay Puri12, Robert Krasucki12, Jacob Lawrence12, Digby Quested14, Nicholas Bass12, Caroline Crombie15, Gillian Fraser15, Soh Leh Kuan, Nicholas Walker, Kevin A. McGhee7, Ben S. Pickard16, P. Malloy7, Alan W Maclean7, Margaret Van Beck7, Michele T. Pato10, Helena Medeiros10, Frank A. Middleton17, Célia Barreto Carvalho10, Christopher P. Morley17, Ayman H. Fanous, David V. Conti10, James A. Knowles10, Carlos Ferreira, António Macedo18, M. Helena Azevedo18, Andrew Kirby2, Andrew Kirby1, Manuel A. R. Ferreira2, Manuel A. R. Ferreira1, Mark J. Daly2, Mark J. Daly1, Kimberly Chambert1, Finny G Kuruvilla1, Stacey Gabriel1, Kristin G. Ardlie1, Jennifer L. Moran1 
06 Aug 2009-Nature
TL;DR: The extent to which common genetic variation underlies the risk of schizophrenia is shown, using two analytic approaches, and the major histocompatibility complex is implicate, which is shown to involve thousands of common alleles of very small effect.
Abstract: Schizophrenia is a severe mental disorder with a lifetime risk of about 1%, characterized by hallucinations, delusions and cognitive deficits, with heritability estimated at up to 80%(1,2). We performed a genome-wide association study of 3,322 European individuals with schizophrenia and 3,587 controls. Here we show, using two analytic approaches, the extent to which common genetic variation underlies the risk of schizophrenia. First, we implicate the major histocompatibility complex. Second, we provide molecular genetic evidence for a substantial polygenic component to the risk of schizophrenia involving thousands of common alleles of very small effect. We show that this component also contributes to the risk of bipolar disorder, but not to several non-psychiatric diseases.

4,174 citations


References
More filters

Journal ArticleDOI
Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

71,936 citations


"PLINK: A Tool Set for Whole-Genome ..." refers methods in this paper

  • ...org The American Journal of Human Genetics Volume 81 September 2007 565 multiple-test corrections are also available, including those based on Bonferroni correction and false-discovery rate.(31) IBD estimation....

    [...]


Journal ArticleDOI
01 Jun 2000-Genetics
Abstract: We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci— e.g. , seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/~pritch/home.html.

25,033 citations


Book
01 Jan 1981
Abstract: Preface.Preface to the Second Edition.Preface to the First Edition.1. An Introduction to Applied Probability.2. Statistical Inference for a Single Proportion.3. Assessing Significance in a Fourfold Table.4. Determining Sample Sizes Needed to Detect a Difference Between Two Proportions.5. How to Randomize.6. Comparative Studies: Cross-Sectional, Naturalistic, or Multinomial Sampling.7. Comparative Studies: Prospective and Retrospective Sampling.8. Randomized Controlled Trials.9. The Comparison of Proportions from Several Independent Samples.10. Combining Evidence from Fourfold Tables.11. Logistic Regression.12. Poisson Regression.13. Analysis of Data from Matched Samples.14. Regression Models for Matched Samples.15. Analysis of Correlated Binary Data.16. Missing Data.17. Misclassification Errors: Effects, Control, and Adjustment.18. The Measurement of Interrater Agreement.19. The Standardization of Rates.Appendix A. Numerical Tables.Appendix B. The Basic Theory of Maximum Likelihood Estimation.Appendix C. Answers to Selected Problems.Author Index.Subject Index.

16,098 citations


Journal ArticleDOI
TL;DR: Haploview is a software package that provides computation of linkage disequilibrium statistics and population haplotype patterns from primary genotype data in a visually appealing and interactive interface.
Abstract: Summary: Research over the last few years has revealed significant haplotype structure in the human genome. The characterization of these patterns, particularly in the context of medical genetic association studies, is becoming a routine research activity. Haploview is a software package that provides computation of linkage disequilibrium statistics and population haplotype patterns from primary genotype data in a visually appealing and interactive interface. Availability: http://www.broad.mit.edu/mpg/haploview/ Contact: jcbarret@broad.mit.edu

13,185 citations