FaST linear mixed models for genome-wide association studies.

doi:10.1038/NMETH.1681

Home
/
Papers
/
FaST linear mixed models for genome-wide association studies.

Journal Article•DOI•

FaST linear mixed models for genome-wide association studies.

Christoph Lippert¹, Jennifer Listgarten², Ying Liu², Carl M. Kadie², Robert I. Davidson², David Heckerman² - Show less +2 more•Institutions (2)

Max Planck Society¹, Microsoft²

01 Oct 2011-Nature Methods (Nature Pub. Group)-Vol. 8, Iss: 10, pp 833-835

TL;DR: This work describes factored spectrally transformed linear mixed models (FaST-LMM), an algorithm for genome-wide association studies (GWAS) that scales linearly with cohort size in both run time and memory use.

read less

Abstract: We describe factored spectrally transformed linear mixed models (FaST-LMM), an algorithm for genome-wide association studies (GWAS) that scales linearly with cohort size in both run time and memory use. On Wellcome Trust data for 15,000 individuals, FaST-LMM ran an order of magnitude faster than current efficient algorithms. Our algorithm can analyze data for 120,000 individuals in just a few hours, whereas current algorithms fail on data for even 20,000 individuals (http://mscompbio.codeplex.com/).

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

疟原虫var基因转换速率变化导致抗原变异[英]／Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A

[...]

宁北芳, 朱淮民

28 Jul 2005

TL;DR: PfPMP1）与感染红细胞、树突状组胞以及胎盘的单个或多个受体作用，在黏附及免疫逃避中起关键的作�ly.

...read moreread less

Abstract: 抗原变异可使得多种致病微生物易于逃避宿主免疫应答。表达在感染红细胞表面的恶性疟原虫红细胞表面蛋白1（PfPMP1）与感染红细胞、内皮细胞、树突状细胞以及胎盘的单个或多个受体作用，在黏附及免疫逃避中起关键的作用。每个单倍体基因组var基因家族编码约60种成员，通过启动转录不同的var基因变异体为抗原变异提供了分子基础。

...read moreread less

18,940 citations

Journal Article•DOI•

LD score regression distinguishes confounding from polygenicity in genome-wide association studies :

[...]

Brendan Bulik-Sullivan¹, Po-Ru Loh¹, Hilary K. Finucane¹, Stephan Ripke¹, Jian Yang², Nick Patterson³, Mark J. Daly¹, Alkes L. Price¹, Benjamin M. Neale¹ - Show less +5 more•Institutions (3)

Harvard University¹, University of Queensland², Broad Institute³

02 Feb 2015-Nature Genetics

TL;DR: It is found that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size, and the LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control.

...read moreread less

Abstract: Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.

...read moreread less

3,708 citations

Journal Article•DOI•

10 Years of GWAS Discovery: Biology, Function, and Translation

[...]

Peter M. Visscher¹, Naomi R. Wray¹, Qian Zhang¹, Pamela Sklar², Mark I. McCarthy³, Matthew A. Brown⁴, Jian Yang¹ - Show less +3 more•Institutions (4)

University of Queensland¹, Icahn School of Medicine at Mount Sinai², Wellcome Trust Centre for Human Genetics³, Queensland University of Technology⁴

06 Jul 2017-American Journal of Human Genetics

TL;DR: The remarkable range of discoveriesGWASs has facilitated in population and complex-trait genetics, the biology of diseases, and translation toward new therapeutics are reviewed.

...read moreread less

Abstract: Application of the experimental design of genome-wide association studies (GWASs) is now 10 years old (young), and here we review the remarkable range of discoveries it has facilitated in population and complex-trait genetics, the biology of diseases, and translation toward new therapeutics. We predict the likely discoveries in the next 10 years, when GWASs will be based on millions of samples with array data imputed to a large fully sequenced reference panel and on hundreds of thousands of samples with whole-genome sequencing data.

...read moreread less

2,669 citations

Journal Article•DOI•

Genome-wide efficient mixed-model analysis for association studies.

[...]

Xiang Zhou¹, Matthew Stephens¹•Institutions (1)

University of Chicago¹

01 Jul 2012-Nature Genetics

TL;DR: This method is approximately n times faster than the widely used exact method known as efficient mixed-model association (EMMA), where n is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.

...read moreread less

Abstract: Linear mixed models have attracted considerable attention recently as a powerful and effective tool for accounting for population stratification and relatedness in genetic association tests. However, existing methods for exact computation of standard test statistics are computationally impractical for even moderate-sized genome-wide association studies. To address this issue, several approximate methods have been proposed. Here, we present an efficient exact method, which we refer to as genome-wide efficient mixed-model association (GEMMA), that makes approximations unnecessary in many contexts. This method is approximately n times faster than the widely used exact method known as efficient mixed-model association (EMMA), where n is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.

...read moreread less

2,334 citations

Journal Article•DOI•

Efficient Bayesian mixed-model analysis increases association power in large cohorts

[...]

Po-Ru Loh¹, George Tucker¹, Brendan Bulik-Sullivan¹, Bjarni J. Vilhjálmsson², Bjarni J. Vilhjálmsson¹, Hilary K. Finucane³, Rany M. Salem⁴, Daniel I. Chasman⁵, Paul M. Ridker⁵, Benjamin M. Neale¹, Benjamin M. Neale², Bonnie Berger³, Nick Patterson², Alkes L. Price¹ - Show less +10 more•Institutions (5)

Harvard University¹, Broad Institute², Massachusetts Institute of Technology³, Boston Children's Hospital⁴, Brigham and Women's Hospital⁵

01 Mar 2015-Nature Genetics

TL;DR: BOLT-LMM is presented, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes.

...read moreread less

Abstract: Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts and may not optimize power. All existing methods require time cost O(MN(2)) (where N is the number of samples and M is the number of SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here we present a far more efficient mixed-model association method, BOLT-LMM, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to 9 quantitative traits in 23,294 samples from the Women's Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for genome-wide association studies in large cohorts.

...read moreread less

1,232 citations

Cites background from "FaST linear mixed models for genome..."

...The first perspective is motivated by the observation that in human genetics applications, the denominator of the prospective statistic in equation (5), xtest'Vxtest, is nearly independent of the SNP xtest Loh et al....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

疟原虫var基因转换速率变化导致抗原变异[英]／Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A

[...]

宁北芳, 朱淮民

28 Jul 2005

TL;DR: PfPMP1）与感染红细胞、树突状组胞以及胎盘的单个或多个受体作用，在黏附及免疫逃避中起关键的作�ly.

...read moreread less

18,940 citations

Journal Article•DOI•

Principal components analysis corrects for stratification in genome-wide association studies

[...]

Alkes L. Price¹, Alkes L. Price², Nick Patterson¹, Robert M. Plenge³, Robert M. Plenge¹, Michael E. Weinblatt³, Nancy A. Shadick³, David Reich², David Reich¹ - Show less +5 more•Institutions (3)

Broad Institute¹, Harvard University², Brigham and Women's Hospital³

23 Jul 2006-Nature Genetics

TL;DR: This work describes a method that enables explicit detection and correction of population stratification on a genome-wide scale and uses principal components analysis to explicitly model ancestry differences between cases and controls.

...read moreread less

Abstract: Population stratification—allele frequency differences between cases and controls due to systematic ancestry differences—can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker’s variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers. Population stratification—allele frequency differences between cases and controls due to systematic ancestry differences—can cause spurious associations in disease studies 1‐8 . Because the effects of stratification vary in proportion to the number of samples 9 , stratification will be an increasing problem in the large-scale association studies of the future, which will analyze thousands of samples in an effort to detect common genetic variants of weak effect. The two prevailing methods for dealing with stratification are genomic control and structured association 9‐14 . Although genomic control and structured association have proven useful in a variety of contexts, they have limitations. Genomic control corrects for stratification by adjusting association statistics at each marker by a uniform overall inflation factor. However, some markers differ in their allele frequencies across ancestral populations more than others. Thus, the uniform adjustment applied by genomic control may be insufficient at markers having unusually strong differentiation across ancestral populations and may be superfluous at markers devoid of such differentiation, leading to a loss in power. Structured association uses a program such as STRUCTURE 15 to assign the samples to discrete subpopulation clusters and then aggregates evidence of association within each cluster. If fractional membership in more than one cluster is allowed, the method cannot currently be applied to genome-wide association studies because of its intensive computational cost on large data sets. Furthermore, assignments of individuals to clusters are highly sensitive to the number of clusters, which is not well defined 14,16 .

...read moreread less

9,387 citations

"FaST linear mixed models for genome..." refers methods in this paper

...2 Relationship between spectral decomposition and singular value decomposition for the RRM and other factored genetic similarity matrices Before we discuss the low-rank version of FaST-LMM, it will be useful to review the relationship between spectral decomposition and singular value decomposition (SVD) for matrices, for which the factorization K = WW is known, such as the RRM or the Eigenstrat covariance matrix [3]....
[...]

Journal Article•DOI•

Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls

[...]

Paul Burton¹, David Clayton², Lon R. Cardon, Nicholas John Craddock³ +192 more•Institutions (4)

07 Jun 2007-Nature

TL;DR: This study has demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiple disease phenotypes; generated a genome-wide genotype database for future studies of common diseases in the British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratification in theBritish population is generally modest.

...read moreread less

Abstract: There is increasing evidence that genome-wide association ( GWA) studies represent a powerful approach to the identification of genes involved in common human diseases. We describe a joint GWA study ( using the Affymetrix GeneChip 500K Mapping Array Set) undertaken in the British population, which has examined similar to 2,000 individuals for each of 7 major diseases and a shared set of similar to 3,000 controls. Case-control comparisons identified 24 independent association signals at P < 5 X 10(-7): 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a large number of further signals ( including 58 loci with single-point P values between 10(-5) and 5 X 10(-7)) likely to yield additional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizes observed at most loci identified. This study thus represents a thorough validation of the GWA approach. It has also demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in the British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiology of these important disorders. We anticipate that our data, results and software, which will be widely available to other investigators, will provide a powerful resource for human genetics research.

...read moreread less

9,244 citations

Journal Article•DOI•

XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance.

[...]

R. A. Fisher

01 Jan 1919-Transactions of the Royal Society of Edinburgh

TL;DR: In this paper, it was shown that the variance of a human measurement from its mean follows the Normal Law of Errors, and that the variability may be measured by the standard deviation corresponding to the square root of the mean square error.

...read moreread less

Abstract: Several attempts have already been made to interpret the well-established results of biometry in accordance with the Mendelian scheme of inheritance. It is here attempted to ascertain the biometrical properties of a population of a more general type than has hitherto been examined, inheritance in which follows this scheme. It is hoped that in this way it will be possible to make a more exact analysis of the causes of human variability. The great body of available statistics show us that the deviations of a human measurement from its mean follow very closely the Normal Law of Errors, and, therefore, that the variability may be uniformly measured by the standard deviation corresponding to the square root of the mean square error. When there are two independent causes of variability capable of producing in an otherwise uniform population distributions with standard deviations σ1 and σ2, it is found that the distribution, when both causes act together, has a standard deviation . It is therefore desirable in analysing the causes of variability to deal with the square of the standard deviation as the measure of variability. We shall term this quantity the Variance of the normal population to which it refers, and we may now ascribe to the constituent causes fractions or percentages of the total variance which they together produce. It is desirable on the one hand that the elementary ideas at the basis of the calculus of correlations should be clearly understood, and easily expressed in ordinary language, and on the other that loose phrases about the “percentage of causation,” which obscure the essential distinction between the individual and the population, should be carefully avoided.

...read moreread less

3,800 citations

"FaST linear mixed models for genome..." refers background in this paper

...,) IBD [11, 10] and the realized relationship matrix (RRM) [9, 10, 12] and have been estimated...
[...]

Journal Article•DOI•

Common SNPs explain a large proportion of the heritability for human height

[...]

Jian Yang¹, Beben Benyamin¹, Brian P. McEvoy¹, Scott D. Gordon¹, Anjali K. Henders¹, Dale R. Nyholt¹, Pamela A. F. Madden², Andrew C. Heath², Nicholas G. Martin¹, Grant W. Montgomery¹, Michael E. Goddard³, Peter M. Visscher¹ - Show less +8 more•Institutions (3)

QIMR Berghofer Medical Research Institute¹, Washington University in St. Louis², University of Melbourne³

01 Jul 2010-Nature Genetics

TL;DR: Evidence is provided that the remaining heritability is due to incomplete linkage disequilibrium between causal variants and genotyped SNPs, exacerbated by causal variants having lower minor allele frequency than the SNPs explored to date.

...read moreread less

Abstract: SNPs discovered by genome-wide association studies (GWASs) account for only a small fraction of the genetic variation of complex traits in human populations. Where is the remaining heritability? We estimated the proportion of variance for human height explained by 294,831 SNPs genotyped on 3,925 unrelated individuals using a linear model analysis, and validated the estimation method with simulations based on the observed genotype data. We show that 45% of variance can be explained by considering all SNPs simultaneously. Thus, most of the heritability is not missing but has not previously been detected because the individual effects are too small to pass stringent significance tests. We provide evidence that the remaining heritability is due to incomplete linkage disequilibrium between causal variants and genotyped SNPs, exacerbated by causal variants having lower minor allele frequency than the SNPs explored to date.

...read moreread less

3,759 citations

"FaST linear mixed models for genome..." refers background in this paper

...When the RRM is used, however, we can perform the k-spectral decomposition more efficiently by circumventing the construction of K, because the singular vectors of the data matrix are the same as the eigenvectors of the RRM constructed from that data (e.g., [15])....
[...]
...Figure 1b, which shows runtimes for this analysis, highlights the linear dependence of the computations on the number of individuals when the numbers of individuals exceeds the 8K SNPs used to construct the RRM....
[...]
...This case will occur when the RRM is used and the number of (linearly independent) SNPs used to estimate it, sc = k, is smaller than n....
[...]
...The RRM has this property as do other matrices....
[...]
...Such measures have been based on (e.g.,) IBD [11, 10] and the realized relationship matrix (RRM) [9, 10, 12] and have been estimated 1These authors contributed equally to this work. with a relatively small sample (200-2000) of markers [2, 4]....
[...]