scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Methodological implementation of mixed linear models in multi-locus genome-wide association studies.

TL;DR: A fast multi‐locus random‐SNP‐effect EMMA (FASTmrEMMA) model for GWAS, built on random single nucleotide polymorphism (SNP) effects and a new algorithm that whitens the covariance matrix of the polygenic matrix K and environmental noise, and specifies the number of nonzero eigenvalues as one.
Abstract: The mixed linear model has been widely used in genome-wide association studies (GWAS), but its application to multi-locus GWAS analysis has not been explored and assessed. Here, we implemented a fast multi-locus random-SNP-effect EMMA (FASTmrEMMA) model for GWAS. The model is built on random single nucleotide polymorphism (SNP) effects and a new algorithm. This algorithm whitens the covariance matrix of the polygenic matrix K and environmental noise, and specifies the number of nonzero eigenvalues as one. The model first chooses all putative quantitative trait nucleotides (QTNs) with ≤ 0.005 P-values and then includes them in a multi-locus model for true QTN detection. Owing to the multi-locus feature, the Bonferroni correction is replaced by a less stringent selection criterion. Results from analyses of both simulated and real data showed that FASTmrEMMA is more powerful in QTN detection and model fit, has less bias in QTN effect estimation and requires a less running time than existing single- and multi-locus methods, such as empirical Bayes, settlement of mixed linear model under progressively exclusive relationship (SUPER), efficient mixed model association (EMMA), compressed MLM (CMLM) and enriched CMLM (ECMLM). FASTmrEMMA provides an alternative for multi-locus GWAS.

Content maybe subject to copyright    Report

Methodological implementation of mixed linear models
in multi-l ocus genome-wide association studies
Yang-Jun Wen, Hanwen Zhang, Yuan-Li Ni, Bo Huang, Jin Zhang, Jian-Ying
Feng, Shi-Bo Wang, Jim M. Dunwell, Yuan-Ming Zhang and Rongling Wu
Corresponding authors: Yuan-Ming Zhang, College of Agriculture, Nanjing Agricultural University, Nanjing 210095, China. Tel.: þ086 13505161564; Fax:
þ086 25 84399091. E-mail: soyzhang@njau.edu.cn; College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China. Tel.:
þ086 13505161564. E-mail: soyzhang@mail.hzau.edu.cn; Rongling Wu, Center for Statistical Genetics, Pennsylvania State University, Hershey, PA 17033,
USA. Tel.: þ001 717 531 2037; Fax: þ001 717 531 0480. E-mail: rwu@phs.psu.edu
Abstract
The mixed linear model has been widely used in genome-wide association studies (GWAS), but its application to multi-locus
GWAS analysis has not been explored and assessed. Here, we implemented a fast multi-locus random-SNP-effect EMMA
(FASTmrEMMA) model for GWAS. The model is built on random single nucleotide polymorphism (SNP) effects and a new al-
gorithm. This algorithm whitens the covariance matrix of the polygenic matrix K and environmental noise, and specifies the
number of nonzero eigenvalues as one. The model first chooses all putative quantitative trait nucleotides (QTNs) with 0.005
P-values and then includes them in a multi-locus model for true QTN detection. Owing to the multi-locus feature, the
Bonferroni correction is replaced by a less stringent selection criterion. Results from analyses of both simulated and real data
showed that FASTmrEMMA is more powerful in QTN detection and model fit, has less bias in QTN effect estimation and
requires a less running time than existing single- and multi-locus methods, such as empirical Bayes, settlement of mixed
linear model under progressively exclusive relationship (SUPER), efficient mixed model association (EMMA), compressed
MLM (CMLM) and enriched CMLM (ECMLM). FASTmrEMMA provides an alternative for multi-locus GWAS.
Key words: genome-wide association study; mixed linear model; multi-locus model; random effect
Introduction
Genome-wide association studies (GWAS) have been widely used
in the genetic dissection of quantitative traits in human, animal
and plant genetics, especially in combination with the output of
genomic sequencing technologies. The most popular method for
GWAS is the mixed linear model (MLM) method [1, 2]becauseofits
demonstrated effectiveness in correcting the inflation from many
small genetic effects (polygenic background) and controlling the
bias of population stratification [3–7]. Since the MLM of Yu et al. [2]
Yang-Jun Wen is a Ph D candidate in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, China.
Hanwen Zhang is a bachelor student in the Faculty of Applied Science at the University of British Columbia, Canada.
Yuan-Li Ni is a Master student in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, China.
Bo Huang is a Master student in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, China.
Jin Zhang is an associate professor in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, China.
Jian-Ying Feng is a lecturer in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, China.
Shi-Bo Wang is a postdoctoral research fellow in the College of Plant Science and Technology at Huazhong Agricultural University, China.
Jim M. Dunwell is a full professor in the School of Agriculture, Policy and Development at the University of Reading, United Kingdom.
Yuan-Ming Zhang is a full professor in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, Nanjing, China
and Chutian Scholar Professor of Statistical Genomics in the College of Plant Science and Technology at Huazhong Agricultural University, Wuhan, China.
Rongling Wu is Distinguished Professor of Public Health Sciences and Statistics and the Director of the Center for Statistical Genetics at The Pennsylvania
State University, USA. He found the Center for Computational Biology at Beijing Forestry University, China.
Submitted: 24 October 2016; Received (in revised form): 15 December 2016
V
C
The Author 2017. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/
licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
For commercial re-use, please contact journals.permissions@oup.com
700
Briefings in Bioinformatics, 19(4), 2018, 700–712
doi: 10.1093/bib/bbw145
Advance Access Publication Date: 1 February 2017
Software Review
Downloaded from https://academic.oup.com/bib/article/19/4/700/2965637 by guest on 21 August 2022

was published, many MLM-based methods have been proposed.
However, most of them comprise a one-dimensional genome scan
by testing one marker at a time, which is involved in multiple test
correction for the threshold value of significance test. The widely
used Bonferroni correction is often too conservative to detect
many important loci for quantitative traits.
Most quantitative traits are controlled by a few genes with large
effects and numerous polygenes with minor effects. However, the
current one-dimensional genome scan approaches for GWAS do
not match the true genetic model for these traits. To overcome this
issue, multi-locus methodologies have been developed; for ex-
ample, Bayesian least absolute shrinkage and selection operator
(LASSO) [8], adaptive mixed LASSO [9], penalized Logistic regression
[1011], Elastic-Net [12], empirical Bayes (E-BAYES) [13]andE-
BAYES LASSO [14]. If the number of markers is several times larger
than sample size, all marker effects can be included in one single
model and estimated in an unbiased way. If the number of markers
is many times larger than sample size, however, these shrinkage
approaches will fail. In this situation, we should consider how to re-
duce the number of marker effects in the multi-locus genetic
model. For example, Zhou et al. [15] developed a Bayesian sparse
linear mixed model, and Moser et al. [16] proposed a Bayesian mix-
ture model. Under these models, two to four common components
in the mixture distribution were considered and only a few vari-
ance components were estimated. Although about 500 effects in
the genetic model are finally considered after several rounds of
Gibbs sampling, the computing time becomes a major concern for
these Bayesian approaches. Recently, Segura et al. [17]andWang
et al. [7] have proposed multi-locus MLM approaches. However, fur-
ther refinement for fast algorithm is needed.
Zhang et al.’s [1] MLM method treated the quantitative trait
nucleotide (QTN) effect as being random, in which three compo-
nent variances owing to QTNs, polygenes and residual errors need
to be estimated. If the number of effects is large, this calculation
takes a long time. To reduce computing time and increase power
in QTN detection, a compressed MLM (CMLM) with a population
parameters previously determined (P3D) algorithm [18] and an en-
riched CMLM (ECMLM) [19] have been proposed. On the other
hand, Kang et al. [3] proposed an efficient mixed model association
(EMMA), and other authors suggested alternatives, such as EMMA
eXpedited (EMMAX) [20], FaST-LMM [21], FaST-LMM-Select [22],
genome-wide EMMA [4] and genome-wide rapid association using
mixed model and regression-Gamma (GRAMMAR-Gamma) [23].
Recently, settlement of mixed linear model under progressively
exclusive relationship (SUPER) [24] has been developed based on
FaST-LMM. Among the above fast methods, the SNP effect was
treated as being fixed. Goddard et al. [25] noted that a random-
marker model has several advantages, compared with the fixed
model [7, 26, 27]. For example, the random model approach will
shrink the estimated SNP effects toward zero. However, Goddard
et al. [25] did not provide an efficient computational algorithm to
estimate marker effects.
In this article, we describe a new method that can quickly
scan each random-effect marker throughout the genome by
constructing a fast and new matrix transformation for the three
component variances. Then, all the putative QTNs with 0.005
P-values were placed into one multi-locus genetic model and
these QTN effects were estimated by EM empirical Bayes (EMEB)
[28] for true QTN identification. This new method, called fast
multi-locus random-SNP-effect EMMA (FASTmrEMMA), was
validated by analysis of real data from Arabidopsis [29] and by a
series of simulation studies and compared with the other meth-
ods, such as E-BAYES (multi-locus model) [30], SUPER, EMMA,
ECMLM and CMLM (single-locus model).
Statistical approaches for GWAS
Fast multi-locus random-SNP-effect EMMA
FASTmrEMMA (Appendix A) is a multi-locus two-stage GWAS
approach. In the first stage, SNP effect was treated as random
and minor part of SNPs were picked up based on the prior prem-
ise that most SNPs should have no effect on the quantitative
traits. Meanwhile, three techniques were implemented to save
running time. First, a new matrix transformation was used to
multiply original MLM and its purpose is to whiten the covari-
ance matrix of the polygenic matrix K and environmental noise.
Then, a polygenic-to-residual variance ratio under the null hy-
pothesis was fixed in all the single marker genome tests. Finally,
the number of nonzero eigenvalues was specified as one. In the
second stage, all the selected SNP effects in the first stage were
placed into one multi-locus model and then estimated by
expectation and maximization empirical Bayes (EMEB) [28] for
true QTN identification. The new method has been implemented
in R and its software can be downloaded from https://cran.r-pro
ject.org/web/packages/mrMLM/index.html.
E-BAYES
E-BAYES is an existing multi-locus Bayesian approach imple-
mented by the SAS program [30], and was used as a gold stand-
ard for multi-locus model comparison. In this method, all the
SNP-effect variances are simultaneously estimated. Owing to the
multi-locus nature, Bonferroni correction is replaced by a less
stringent selection criterion. The critical value of P-value in the
significance test is set at 0.05 in three simulation experiments.
EMMA
EMMA is an existing single-locus genome scan method for
GWAS [3], and a fixed model version of the original MLM, in
which QTN effect is treated as a fixed effect with no prior distri-
bution assigned. The method was implemented by the R soft-
ware package EMMA (http://mouse.cs.ucla.edu/emma/).
CMLM and ECMLM
CMLM [18] and ECMLM [19] are existing single-locus genome
scan methods for GWAS. CMLM decreases the effective sample
size by clustering individuals into groups and eliminates the
need to re-compute variance components. ECMLM chooses the
best combination of three kinship algorithms and eight group-
ing algorithms to increases statistical power. The two methods
are also the fixed model version of the original MLM and ap-
proximation algorithm for SNP effect estimation.
SUPER
FaST-LMM [21] is a newly developed algorithm in GWAS that
can solve the computational problem, but requires that the
number of SNPs be less than the number of individuals. To over-
come this shortcoming, SUPER [24] extracts a small subset of
SNPs and uses them in the FaST-LMM. This SUPER not only re-
tains the computational advantage of the FaST-LMM but also re-
markably increases statistical power.
All ECMLM, CMLM and SUPER were implemented in the R
software package GAPIT (http://zzlab.net/GAPIT).
The methodological comparison for the above approaches is
listed in Table 1.
Methodological implementation of mixed linear models | 701
Downloaded from https://academic.oup.com/bib/article/19/4/700/2965637 by guest on 21 August 2022

Table 1. Comparison of six methods and their softwares for GWAS
Case FASTmrEMMA E-BAYES EMMA CMLM ECMLM SUPER
Model Multi-locus model Multi-locus model Single-locus model Single-locus model Single-locus model Single-locus model
QTN effect Random Random Fixed Fixed Fixed Fixed
Polygenic back-
ground control
Yes No Yes Yes Yes Yes
Population structure
control
Yes No Yes Yes Yes Yes
Number of variance
components
Three No. of effects Two Two Two Two
Polygenic-to-re-
sidual variance
ratio
Fixed NA NA Fixed Fixed NA
Significant critical
value
LOD (logarithm of odds)¼3 P-value¼0.05 P-value¼0.05/p, where p is no. of markers P-value¼0.05/pP-value¼0.05/pP-value¼0.05/p
Transformation ma-
trix and
performances
Q
1
K
1
2
r
Q
T
1
where
Q
1
K
1
2
r
Q
T
1

Q
1
K
1
2
r
Q
T
1

¼
b
k
g
ZKZ
T
þ I
n
Covariance matrix of the polygenic
matrix K and environmental noise
are whitened.Number of nonzero
eigenvalues is specified as one.
Shrinkage is select-
ive. Large effects
subject to virtually
no shrinkage
while small effects
are shrunken to
zero.
U
T
R
where
SHS ¼ U
R
diag k
1
þ d; ; k
nq
þ d

U
T
R
H ¼ ZKZ
T
þ dI and S ¼ I XX
T
X

1
X
T
One-dimensional optimization by
deriving the likelihood as a function of
QTN-to-residual variance ratio.
Kinship among individ-
uals is replaced by the
kinship among
groups.Fit the groups
as the random effect,
and estimates popu-
lation parameters
only once and then
fixes them to test gen-
etic markers.
Kinship among individ-
uals is replaced by the
kinship among
groups.Chooses the
best combination be-
tween kinship algo-
rithms and grouping
algorithms.
Dramatically re-
duces the number
of markers used to
define individual
relationships, and
uses them in
FaST-LMM.
Running time Fast Depend on the num-
ber of effects.
Slow Fast Fast Moderate
Software Web site https://cran.r-project.org/web/pack
ages/mrMLM/index.html
http://statgen.ucr.
edu/software.html
http://mouse.cs.ucla.edu/emma/ http://zzlab.net/GAPIT http://zzlab.net/GAPIT http://zzlab.net/
GAPIT
702 | Wen et al.
Downloaded from https://academic.oup.com/bib/article/19/4/700/2965637 by guest on 21 August 2022

Results
Fast multi-locus random-SNP-effect EMMA
Estimation of the QTN variance
FASTmrEMMA (Appendix A) is a new algorithm that can ap-
proximate the estimation of QTN variance. Thus, we need to
know whether this approximation has a significant effect on
the estimate of QTN variance. To answer this question, four
flowering time traits in Arabidopsis [29] (Appendix B) were re-
analyzed by FASTmrEMMA and an exact method implemented
by PROC MIXED in SAS. The estimates for QTN variance are
listed in Figure 1 and Supplementary Table S1. As a result,
the relative error between the two methods ranged from 0.0% to
24.09%, and the average was 1.60%, indicating no effect on the
QTN variance estimate using FASTmrEMMA under the condi-
tions of this simulation.
To confirm the effectiveness of FASTmrEMMA, three Monte
Carlo simulation experiments (Appendix C) were carried out and
the simulation procedures were almost same as those in Wang
et al. [7]. In the three experiments, various backgrounds (no, poly-
genes and epistasis) were simulated to conduct sensitivity ana-
lysis. Each sample in these simulation experiments was analyzed
by six methods. In the six methods, FASTmrEMMA is also a new
multi-locus algorithm within the framework of MLM, E-BAYES
[30] is an existing multi-locus approach under the framework of
Bayesian statistics and SUPER, EMMA, ECMLM and CMLM are the
existing single-locus GWAS methods.
Statistical power for QTN detection
In the above three simulation experiments, the power for each
QTN was defined as the proportion of samples where the QTN
was detected (the P-value is smaller than the designated thresh-
old). When only six QTNs were simulated in the first experi-
ment, the power in the detection of each QTN was higher for
FASTmrEMMA than for the others (Figure 2A; Supplementary
Table S2). When a polygenic background (h
2
pg
¼ 0:092) was added
to the first experiment, a similar trend was observed (Figure 2B;
Supplementary Table S2). When the polygenic background was
changed into an epistatic background (h
2
epi
¼ 0:15), the results
were also similar to those in the first experiment (Figure 2C;
Supplementary Table S2). These results demonstrate the high-
est power of FASTmrEMMA across all the approaches under
various genetic backgrounds, although the other methods are
also robust under these backgrounds.
Accuracy for estimated QTN effects
We used the average, mean squared error (MSE) and mean abso-
lute deviation (MAD) to measure the accuracy of an estimated
QTN effect. We evaluated the accuracies for the estimates of all
the six simulated QTNs across all the six methods. As a result,
the estimate of each QTN effect from FASTmrEMMA was much
closer to the true value than the estimates obtained from the
other methods. On these occasions (QTN numbers 1 and 4), the
averages from E-BAYES were closer to the true value than those
from FASTmrEMMA in three simulation experiments
Figure 1. Comparison of the QTN-variance estimates between fast multi-locus random-SNP-effect EMMA (FASTmrEMMA) and one exact algorithm implemented by
PROC MIXED in SAS. LD: days to flowering under long days; SDV: days to flowering under short days with vernalization; 8W GH LN: leaf number at flowering with
8 weeks vernalization, greenhouse; and 8W GH FT: days to flowering, 8 weeks vernalization, greenhouse.
Methodological implementation of mixed linear models | 703
Downloaded from https://academic.oup.com/bib/article/19/4/700/2965637 by guest on 21 August 2022

(Supplementary Table S2). The MSE and MAD for each QTN ef-
fect were significantly less from FASTmrEMMA than from the
others with two exceptions for QTN number 6, E-BAYES method
had slightly higher accuracy than FASTmrEMMA method in the
first and second simulation experiments (Figure 2D–I;
Supplementary Table S2). These results indicate that a higher
accuracy for the estimate of QTN effect can be achieved using
FASTmrEMMA than using the other methods.
False-positive rate and receiver operating characteristic curve
All the false QTNs, detected by the six methods, in three simula-
tion experiments were used to calculate the empirical false-
positive rates of the six methods. These results are listed in
Supplementary Table S3. In these three simulation experi-
ments, the empirical false-positive rates of the six methods
were between 0.357 and 7.785 (1E-4), and had the same order
of magnitude. ECMLM has the lowest false-positive rate fol-
lowed by CMLM, FASTmrEMMA and EMMA methods, and SUPER
has the maximum false-positive rate followed by E-BAYES
method.
A receiver operating characteristic curve is a plot of the stat-
istical power against the controlled type I error. This curve is
frequently used to compare different methods for their efficien-
cies in the detection of significant effects; the higher the curve,
the better is the method. When 11 probability levels for signifi-
cance, between 1E-8 to 1E-3, were inserted, the corresponding
powers were calculated in the first simulation experiment. The
results are shown in Figure 3. Among the six approaches,
clearly, FASTmrEMMA method is the best one and the next one
is E-BAYES.
Computing time
In each of the three simulation experiments, computing times
for the six methods were recorded and are listed in
Supplementary Table S4. In summary, FASTmrEMMA has the
least computing time followed by ECMLM, E-BAYES, CMLM and
SUPER methods, and EMMA has the maximum computing time.
Real data analysis in Arabidopsis
To validate FASTmrEMMA, this new method along with E-
BAYES, SUPER, EMMA, ECMLM and CMLM was used to re-
analyze the Arabidopsis data [29] for days to flowering under
long days (LD), days to flowering under short days with
Figure 2. Comparison of FASTmrEMMA with the single- and multi-locus approaches under various genetic backgrounds. The single-locus model approaches include
SUPER, EMMA, ECMLM and CMLM, and the multi-locus approach has E-BAYES. The powers are presented in AC, MSEs are showed in DF and MADs are listed in GI.
Six QTNs (A, D and G), six QTNs plus polygenes (B, E and H) and six QTNs plus three epistasis (C, F and I) were simulated, respectively, in the first to third simulation
experiments.
704 | Wen et al.
Downloaded from https://academic.oup.com/bib/article/19/4/700/2965637 by guest on 21 August 2022

Citations
More filters
Journal ArticleDOI
15 Mar 2017-Heredity
TL;DR: Results from simulation studies showed that pLARmEB was more powerful in QTN detection and more accurate in Q TN effect estimation, had less false positive rate and required less computing time than Bayesian hierarchical generalized linear model, efficient mixed model association (EMMA) and least angle regression plus empirical Bayes.
Abstract: Multilocus genome-wide association studies (GWAS) have become the state-of-the-art procedure to identify quantitative trait nucleotides (QTNs) associated with complex traits. However, implementation of multilocus model in GWAS is still difficult. In this study, we integrated least angle regression with empirical Bayes to perform multilocus GWAS under polygenic background control. We used an algorithm of model transformation that whitened the covariance matrix of the polygenic matrix K and environmental noise. Markers on one chromosome were included simultaneously in a multilocus model and least angle regression was used to select the most potentially associated single-nucleotide polymorphisms (SNPs), whereas the markers on the other chromosomes were used to calculate kinship matrix as polygenic background control. The selected SNPs in multilocus model were further detected for their association with the trait by empirical Bayes and likelihood ratio test. We herein refer to this method as the pLARmEB (polygenic-background-control-based least angle regression plus empirical Bayes). Results from simulation studies showed that pLARmEB was more powerful in QTN detection and more accurate in QTN effect estimation, had less false positive rate and required less computing time than Bayesian hierarchical generalized linear model, efficient mixed model association (EMMA) and least angle regression plus empirical Bayes. pLARmEB, multilocus random-SNP-effect mixed linear model and fast multilocus random-SNP-effect EMMA methods had almost equal power of QTN detection in simulation experiments. However, only pLARmEB identified 48 previously reported genes for 7 flowering time-related traits in Arabidopsis thaliana.

129 citations


Cites background or methods or result from "Methodological implementation of mi..."

  • ...Although other multilocus approaches have also been proposed by Segura et al. (2012), Moser et al. (2015), Liu et al. (2016), Wang et al. (2016) and Wen et al. (2017), now further refinement and studies are still needed....

    [...]

  • ...To control polygenic background, we adopted the model transformation of Wen et al. (2017) that whitens the covariance matrix of the polygenic matrix K and residual noise....

    [...]

  • ...Wang et al. (2016) suggested mrMLM and Wen et al. (2017) proposed FASTmrEMMA....

    [...]

  • ...The AIC or BIC values of FASTmrEMMA in Wen et al. (2017) and mrRMLM in Wang et al. (2016) are different from the corresponding values in this study....

    [...]

  • ...¼ Q1K 1 2 rQT1 Q1K 1 2 rQT1 ð3Þ where QB is orthogonal, Λr is a diagonal matrix with positive eigenvalues, r=Rank(B), Q1 and Q2 are the n× r and n× (n− r) block matrices of QB, respectively, and 0 is the corresponding block zero matrix (Wen et al., 2017)....

    [...]

Journal ArticleDOI
TL;DR: Genome-wide association studies (GWAS) have developed into a powerful and ubiquitous tool for the investigation of complex traits as discussed by the authors, enabling the detection of genomic variants associated with either traditional agronomic phenotypes or biochemical and molecular phenotypes.
Abstract: Genome-wide association studies (GWAS) have developed into a powerful and ubiquitous tool for the investigation of complex traits. In large part, this was fueled by advances in genomic technology, enabling us to examine genome-wide genetic variants across diverse genetic materials. The development of the mixed model framework for GWAS dramatically reduced the number of false positives compared with naive methods. Building on this foundation, many methods have since been developed to increase computational speed or improve statistical power in GWAS. These methods have allowed the detection of genomic variants associated with either traditional agronomic phenotypes or biochemical and molecular phenotypes. In turn, these associations enable applications in gene cloning and in accelerated crop breeding through marker assisted selection or genetic engineering. Current topics of investigation include rare-variant analysis, synthetic associations, optimizing the choice of GWAS model, and utilizing GWAS results to advance knowledge of biological processes. Ongoing research in these areas will facilitate further advances in GWAS methods and their applications.

110 citations

Journal ArticleDOI
TL;DR: Since the establishment of the mixed linear model (MLM) method for genome-wide association studies (GWAS) by Zhang et al. (2005), a series of new MLM-based methods have been proposed, i.e., mrMLM (Wang et al., 2016), ISIS EMBLASSO (Tamba etAl., 2017), pLARmEB (Zhang et al, 2017), FASTmrEMMA (Wen and Tamba, 2018
Abstract: Since the establishment of the mixed linear model (MLM) method for genome-wide association studies (GWAS) by Zhang et al. (2005) and Yu et al. (2006), a series of new MLM-based methods have been proposed (Feng et al., 2016). These methods have been widely used in genetic dissection of complex and omics-related traits (Figure 1), especially in conjunction with the development of advanced genomic sequencing technologies. However, most existing methods are based on single marker association in genome-wide scans with population structure and polygenic background controls. To control false positive rate, Bonferroni correction for multiple tests is frequently adopted. This stringent correction results in the exclusion of important loci, especially for large experimental error inherent in field experiments of crop genetics. To address this issue, multilocus GWAS methodologies have been recommended, i.e., mrMLM (Wang et al., 2016), ISIS EMBLASSO (Tamba et al., 2017), pLARmEB (Zhang et al., 2017), FASTmrEMMA (Wen et al., 2018a), pKWmEB (Ren et al., 2018), and FASTmrMLM (Zhang and Tamba, 2018). Here we summarize their advantages and potential limitations for using these methods (Table 1).

104 citations

Journal ArticleDOI
TL;DR: As variation in targeted traits is essential for GWAS, metabolic diversity and its rise at both the population and individual levels is reviewed and the current knowledge on mGWAS-based multi-dimensional analysis and emerging insights into the diversity of metabolism are addressed.
Abstract: Plants have served as sources providing humans with metabolites for food and nutrition, biomaterials for living, and treatment for pain and disease. Plants produce a huge array of metabolites, with an immense diversity at both the population and individual levels. Dissection of the genetic bases for metabolic diversity has attracted increasing research attention. The concept of genome-wide association study (GWAS) was extended to studies on the diversity of plant metabolome that benefitted from the development of mass-spectrometry-based analytical systems and genome sequencing technologies. Metabolic genome-wide association study (mGWAS) is one of the most powerful tools for global identification of genetic determinants for diversity of plant metabolism. Recently, mGWAS has been performed for various species with continuous improvements, providing deeper insights into the genetic bases of metabolic diversity. In this review, we discuss fully the achievements to date and remaining challenges that are associated with both mGWAS and mGWAS-based multi-dimensional analysis. We begin with a summary of GWAS and its development based on statistical methods and populations. As variation in targeted traits is essential for GWAS, we review metabolic diversity and its rise at both the population and individual levels. Subsequently, the application of mGWAS for plants and its corresponding achievements are fully discussed. We address the current knowledge on mGWAS-based multi-dimensional analysis and emerging insights into the diversity of metabolism.

99 citations

Journal ArticleDOI
01 Mar 2018-Heredity
TL;DR: An integrated nonparametric method for multi-locus GWAS that effectively controlled false positive rate, although a less stringent significance criterion was adopted, and retained the high power of Kruskal–Wallis test, and provided QTN effect estimates.
Abstract: Although nonparametric methods in genome-wide association studies (GWAS) are robust in quantitative trait nucleotide (QTN) detection, the absence of polygenic background control in single-marker association in genome-wide scans results in a high false positive rate. To overcome this issue, we proposed an integrated nonparametric method for multi-locus GWAS. First, a new model transformation was used to whiten the covariance matrix of polygenic matrix K and environmental noise. Using the transferred model, Kruskal-Wallis test along with least angle regression was then used to select all the markers that were potentially associated with the trait. Finally, all the selected markers were placed into multi-locus model, these effects were estimated by empirical Bayes, and all the nonzero effects were further identified by a likelihood ratio test for true QTN detection. This method, named pKWmEB, was validated by a series of Monte Carlo simulation studies. As a result, pKWmEB effectively controlled false positive rate, although a less stringent significance criterion was adopted. More importantly, pKWmEB retained the high power of Kruskal-Wallis test, and provided QTN effect estimates. To further validate pKWmEB, we re-analyzed four flowering time related traits in Arabidopsis thaliana, and detected some previously reported genes that were not identified by the other methods.

97 citations


Cites background from "Methodological implementation of mi..."

  • ...It should be noted that model (7) includes QTN variation and normal residual error (Wen et al. 2017)....

    [...]

  • ...where QB is orthogonal, Λr is a diagonal matrix with positive eigen values, r= Rank(B), Q1 and Q2 are the n× r and n× (n− r) block matrices of QB, and 0 is the corresponding block zero matrix (Wen et al. 2017)....

    [...]

  • ...2016), FASTmrEMMA (Wen et al. 2017), ISIS EM-BLASSO (Tamba et al....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets and focuses on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation.
Abstract: For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the “missing heritability” problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.

5,867 citations

Journal ArticleDOI
TL;DR: It is found that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size, and the LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control.
Abstract: Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.

3,708 citations

Journal ArticleDOI
TL;DR: A unified mixed-model approach to account for multiple levels of relatedness simultaneously as detected by random genetic markers is developed and provides a powerful complement to currently available methods for association mapping.
Abstract: As population structure can result in spurious associations, it has constrained the use of association studies in human and plant genetics. Association mapping, however, holds great promise if true signals of functional association can be separated from the vast number of false signals generated by population structure. We have developed a unified mixed-model approach to account for multiple levels of relatedness simultaneously as detected by random genetic markers. We applied this new approach to two samples: a family-based sample of 14 human families, for quantitative gene expression dissection, and a sample of 277 diverse maize inbred lines with complex familial relationships and population structure, for quantitative trait dissection. Our method demonstrates improved control of both type I and type II error rates over other methods. As this new method crosses the boundary between family-based and structured association samples, it provides a powerful complement to currently available methods for association mapping.

3,467 citations


"Methodological implementation of mi..." refers background or methods in this paper

  • ...The most popular method for GWAS is the mixed linear model (MLM) method [1, 2] because of its demonstrated effectiveness in correcting the inflation from many small genetic effects (polygenic background) and controlling the bias of population stratification [3–7]....

    [...]

  • ...;wcÞ is an n c matrix of covariates (fixed effects) including a column vector of 1, population structure [2] or principle component [37] may be incorporated into W and a is a c 1 vector of fixed effects including the intercept; X is an n 1 vector of marker genotypes, and b Nð0; r(2)bÞ is random effect of putative QTN; Z is an n m design matrix, u MVNmð0; r(2)gKÞ is an m 1 vector of polygenic effects; K is a known m m relatedness matrix; and e MVNnð0; r(2)eInÞ is an n 1 vector of residual errors, r(2)e is the variance of residual error, In is an n n identity matrix and MVN denotes multivariate normal distribution....

    [...]

  • ...Many methods for calculating kinship matrix Km m from a large number of markers have been proposed, such as identical-by-state approach [2, 3, 7, 26, 43]....

    [...]

Journal ArticleDOI
TL;DR: Examining the expression patterns of large gene families, it is found that they are often more similar than would be expected by chance, indicating that many gene families have been co-opted for specific developmental processes.
Abstract: Regulatory regions of plant genes tend to be more compact than those of animal genes, but the complement of transcription factors encoded in plant genomes is as large or larger than that found in those of animals. Plants therefore provide an opportunity to study how transcriptional programs control multicellular development. We analyzed global gene expression during development of the reference plant Arabidopsis thaliana in samples covering many stages, from embryogenesis to senescence, and diverse organs. Here, we provide a first analysis of this data set, which is part of the AtGenExpress expression atlas. We observed that the expression levels of transcription factor genes and signal transduction components are similar to those of metabolic genes. Examining the expression patterns of large gene families, we found that they are often more similar than would be expected by chance, indicating that many gene families have been co-opted for specific developmental processes.

2,510 citations

Journal ArticleDOI
TL;DR: This method is approximately n times faster than the widely used exact method known as efficient mixed-model association (EMMA), where n is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.
Abstract: Linear mixed models have attracted considerable attention recently as a powerful and effective tool for accounting for population stratification and relatedness in genetic association tests. However, existing methods for exact computation of standard test statistics are computationally impractical for even moderate-sized genome-wide association studies. To address this issue, several approximate methods have been proposed. Here, we present an efficient exact method, which we refer to as genome-wide efficient mixed-model association (GEMMA), that makes approximations unnecessary in many contexts. This method is approximately n times faster than the widely used exact method known as efficient mixed-model association (EMMA), where n is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.

2,334 citations


"Methodological implementation of mi..." refers methods in this paper

  • ...[3] proposed an efficient mixed model association (EMMA), and other authors suggested alternatives, such as EMMA eXpedited (EMMAX) [20], FaST-LMM [21], FaST-LMM-Select [22], genome-wide EMMA [4] and genome-wide rapid association using mixed model and regression-Gamma (GRAMMAR-Gamma) [23]....

    [...]

  • ...Therefore, the overall time complexity for the first step of FASTmrEMMA is Oðmn2 þ pn2 þ ptnÞ, compared with Oðmn2 þ pmn2 þ ptnÞ for EMMA [4], where t is the number of optimization iterations required for the NR method (quadratic rate of convergence)....

    [...]

  • ...According to the descriptions for the single-locus genome scan algorithm in previous GWAS studies [3, 4, 40, 41], log-likelihood and restricted loglikelihood functions for the model (A....

    [...]

  • ...In the current methods, including EMMA [3], CMLM/P3D [18], ECMLM [19], EMMAX [20], FaST-LMM [21], FaST-LMM-Select [22], SUPER [24], GEMMA [4] and GRAMMA-Gamma [23], b is treated as a fixed effect, from which it is relatively easy to estimate r(2)g and r(2)e ....

    [...]

  • ...As described in GEMMA, Zhou and Stephens [4] first obtained the first and second derivatives for kb, and then conducted eigen (or spectral) decomposition....

    [...]