Methodological implementation of mixed linear models in multi-locus genome-wide association studies.

doi:10.1093/BIB/BBW145

Methodological implementation of mixed linear models

in multi-l ocus genome-wide association studies

Yang-Jun Wen, Hanwen Zhang, Yuan-Li Ni, Bo Huang, Jin Zhang, Jian-Ying

Feng, Shi-Bo Wang, Jim M. Dunwell, Yuan-Ming Zhang and Rongling Wu

Corresponding authors: Yuan-Ming Zhang, College of Agriculture, Nanjing Agricultural University, Nanjing 210095, China. Tel.: þ086 13505161564; Fax:

þ086 25 84399091. E-mail: soyzhang@njau.edu.cn; College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China. Tel.:

þ086 13505161564. E-mail: soyzhang@mail.hzau.edu.cn; Rongling Wu, Center for Statistical Genetics, Pennsylvania State University, Hershey, PA 17033,

USA. Tel.: þ001 717 531 2037; Fax: þ001 717 531 0480. E-mail: rwu@phs.psu.edu

Abstract

The mixed linear model has been widely used in genome-wide association studies (GWAS), but its application to multi-locus

GWAS analysis has not been explored and assessed. Here, we implemented a fast multi-locus random-SNP-effect EMMA

(FASTmrEMMA) model for GWAS. The model is built on random single nucleotide polymorphism (SNP) effects and a new al-

gorithm. This algorithm whitens the covariance matrix of the polygenic matrix K and environmental noise, and speciﬁes the

number of nonzero eigenvalues as one. The model ﬁrst chooses all putative quantitative trait nucleotides (QTNs) with  0.005

P-values and then includes them in a multi-locus model for true QTN detection. Owing to the multi-locus feature, the

Bonferroni correction is replaced by a less stringent selection criterion. Results from analyses of both simulated and real data

showed that FASTmrEMMA is more powerful in QTN detection and model ﬁt, has less bias in QTN effect estimation and

requires a less running time than existing single- and multi-locus methods, such as empirical Bayes, settlement of mixed

linear model under progressively exclusive relationship (SUPER), efﬁcient mixed model association (EMMA), compressed

MLM (CMLM) and enriched CMLM (ECMLM). FASTmrEMMA provides an alternative for multi-locus GWAS.

Key words: genome-wide association study; mixed linear model; multi-locus model; random effect

Introduction

Genome-wide association studies (GWAS) have been widely used

in the genetic dissection of quantitative traits in human, animal

and plant genetics, especially in combination with the output of

genomic sequencing technologies. The most popular method for

GWAS is the mixed linear model (MLM) method [1, 2]becauseofits

demonstrated effectiveness in correcting the inflation from many

small genetic effects (polygenic background) and controlling the

bias of population stratification [3–7]. Since the MLM of Yu et al. [2]

Yang-Jun Wen is a Ph D candidate in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, China.

Hanwen Zhang is a bachelor student in the Faculty of Applied Science at the University of British Columbia, Canada.

Yuan-Li Ni is a Master student in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, China.

Bo Huang is a Master student in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, China.

Jin Zhang is an associate professor in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, China.

Jian-Ying Feng is a lecturer in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, China.

Shi-Bo Wang is a postdoctoral research fellow in the College of Plant Science and Technology at Huazhong Agricultural University, China.

Jim M. Dunwell is a full professor in the School of Agriculture, Policy and Development at the University of Reading, United Kingdom.

Yuan-Ming Zhang is a full professor in State Key Laboratory of Crop Genetics and Germplasm Enhancement at Nanjing Agricultural University, Nanjing, China

and Chutian Scholar Professor of Statistical Genomics in the College of Plant Science and Technology at Huazhong Agricultural University, Wuhan, China.

Rongling Wu is Distinguished Professor of Public Health Sciences and Statistics and the Director of the Center for Statistical Genetics at The Pennsylvania

State University, USA. He found the Center for Computational Biology at Beijing Forestry University, China.

Submitted: 24 October 2016; Received (in revised form): 15 December 2016

V

C

The Author 2017. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/

licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited.

For commercial re-use, please contact journals.permissions@oup.com

700

Briefings in Bioinformatics, 19(4), 2018, 700–712

doi: 10.1093/bib/bbw145

Advance Access Publication Date: 1 February 2017

Software Review

Downloaded from https://academic.oup.com/bib/article/19/4/700/2965637 by guest on 21 August 2022

was published, many MLM-based methods have been proposed.

However, most of them comprise a one-dimensional genome scan

by testing one marker at a time, which is involved in multiple test

correction for the threshold value of significance test. The widely

used Bonferroni correction is often too conservative to detect

many important loci for quantitative traits.

Most quantitative traits are controlled by a few genes with large

effects and numerous polygenes with minor effects. However, the

current one-dimensional genome scan approaches for GWAS do

not match the true genetic model for these traits. To overcome this

issue, multi-locus methodologies have been developed; for ex-

ample, Bayesian least absolute shrinkage and selection operator

(LASSO) [8], adaptive mixed LASSO [9], penalized Logistic regression

[10–11], Elastic-Net [12], empirical Bayes (E-BAYES) [13]andE-

BAYES LASSO [14]. If the number of markers is several times larger

than sample size, all marker effects can be included in one single

model and estimated in an unbiased way. If the number of markers

is many times larger than sample size, however, these shrinkage

approaches will fail. In this situation, we should consider how to re-

duce the number of marker effects in the multi-locus genetic

model. For example, Zhou et al. [15] developed a Bayesian sparse

linear mixed model, and Moser et al. [16] proposed a Bayesian mix-

ture model. Under these models, two to four common components

in the mixture distribution were considered and only a few vari-

ance components were estimated. Although about 500 effects in

the genetic model are finally considered after several rounds of

Gibbs sampling, the computing time becomes a major concern for

these Bayesian approaches. Recently, Segura et al. [17]andWang

et al. [7] have proposed multi-locus MLM approaches. However, fur-

ther refinement for fast algorithm is needed.

Zhang et al.’s [1] MLM method treated the quantitative trait

nucleotide (QTN) effect as being random, in which three compo-

nent variances owing to QTNs, polygenes and residual errors need

to be estimated. If the number of effects is large, this calculation

takes a long time. To reduce computing time and increase power

in QTN detection, a compressed MLM (CMLM) with a population

parameters previously determined (P3D) algorithm [18] and an en-

riched CMLM (ECMLM) [19] have been proposed. On the other

hand, Kang et al. [3] proposed an efficient mixed model association

(EMMA), and other authors suggested alternatives, such as EMMA

eXpedited (EMMAX) [20], FaST-LMM [21], FaST-LMM-Select [22],

genome-wide EMMA [4] and genome-wide rapid association using

mixed model and regression-Gamma (GRAMMAR-Gamma) [23].

Recently, settlement of mixed linear model under progressively

exclusive relationship (SUPER) [24] has been developed based on

FaST-LMM. Among the above fast methods, the SNP effect was

treated as being fixed. Goddard et al. [25] noted that a random-

marker model has several advantages, compared with the fixed

model [7, 26, 27]. For example, the random model approach will

shrink the estimated SNP effects toward zero. However, Goddard

et al. [25] did not provide an efficient computational algorithm to

estimate marker effects.

In this article, we describe a new method that can quickly

scan each random-effect marker throughout the genome by

constructing a fast and new matrix transformation for the three

component variances. Then, all the putative QTNs with  0.005

P-values were placed into one multi-locus genetic model and

these QTN effects were estimated by EM empirical Bayes (EMEB)

[28] for true QTN identification. This new method, called fast

multi-locus random-SNP-effect EMMA (FASTmrEMMA), was

validated by analysis of real data from Arabidopsis [29] and by a

series of simulation studies and compared with the other meth-

ods, such as E-BAYES (multi-locus model) [30], SUPER, EMMA,

ECMLM and CMLM (single-locus model).

Statistical approaches for GWAS

Fast multi-locus random-SNP-effect EMMA

FASTmrEMMA (Appendix A) is a multi-locus two-stage GWAS

approach. In the first stage, SNP effect was treated as random

and minor part of SNPs were picked up based on the prior prem-

ise that most SNPs should have no effect on the quantitative

traits. Meanwhile, three techniques were implemented to save

running time. First, a new matrix transformation was used to

multiply original MLM and its purpose is to whiten the covari-

ance matrix of the polygenic matrix K and environmental noise.

Then, a polygenic-to-residual variance ratio under the null hy-

pothesis was fixed in all the single marker genome tests. Finally,

the number of nonzero eigenvalues was specified as one. In the

second stage, all the selected SNP effects in the first stage were

placed into one multi-locus model and then estimated by

expectation and maximization empirical Bayes (EMEB) [28] for

true QTN identification. The new method has been implemented

in R and its software can be downloaded from https://cran.r-pro

ject.org/web/packages/mrMLM/index.html.

E-BAYES

E-BAYES is an existing multi-locus Bayesian approach imple-

mented by the SAS program [30], and was used as a gold stand-

ard for multi-locus model comparison. In this method, all the

SNP-effect variances are simultaneously estimated. Owing to the

multi-locus nature, Bonferroni correction is replaced by a less

stringent selection criterion. The critical value of P-value in the

significance test is set at 0.05 in three simulation experiments.

EMMA

EMMA is an existing single-locus genome scan method for

GWAS [3], and a fixed model version of the original MLM, in

which QTN effect is treated as a fixed effect with no prior distri-

bution assigned. The method was implemented by the R soft-

ware package EMMA (http://mouse.cs.ucla.edu/emma/).

CMLM and ECMLM

CMLM [18] and ECMLM [19] are existing single-locus genome

scan methods for GWAS. CMLM decreases the effective sample

size by clustering individuals into groups and eliminates the

need to re-compute variance components. ECMLM chooses the

best combination of three kinship algorithms and eight group-

ing algorithms to increases statistical power. The two methods

are also the fixed model version of the original MLM and ap-

proximation algorithm for SNP effect estimation.

SUPER

FaST-LMM [21] is a newly developed algorithm in GWAS that

can solve the computational problem, but requires that the

number of SNPs be less than the number of individuals. To over-

come this shortcoming, SUPER [24] extracts a small subset of

SNPs and uses them in the FaST-LMM. This SUPER not only re-

tains the computational advantage of the FaST-LMM but also re-

markably increases statistical power.

All ECMLM, CMLM and SUPER were implemented in the R

software package GAPIT (http://zzlab.net/GAPIT).

The methodological comparison for the above approaches is

listed in Table 1.

Methodological implementation of mixed linear models | 701

Downloaded from https://academic.oup.com/bib/article/19/4/700/2965637 by guest on 21 August 2022

Table 1. Comparison of six methods and their softwares for GWAS

Case FASTmrEMMA E-BAYES EMMA CMLM ECMLM SUPER

Model Multi-locus model Multi-locus model Single-locus model Single-locus model Single-locus model Single-locus model

QTN effect Random Random Fixed Fixed Fixed Fixed

Polygenic back-

ground control

Yes No Yes Yes Yes Yes

Population structure

control

Yes No Yes Yes Yes Yes

Number of variance

components

Three No. of effects Two Two Two Two

Polygenic-to-re-

sidual variance

ratio

Fixed NA NA Fixed Fixed NA

Signiﬁcant critical

value

LOD (logarithm of odds)¼3 P-value¼0.05 P-value¼0.05/p, where p is no. of markers P-value¼0.05/pP-value¼0.05/pP-value¼0.05/p

Transformation ma-

trix and

performances

Q

1

K



1

2

r

Q

T

1

where

Q

1

K

1

2

r

Q

T

1



Q

1

K

1

2

r

Q

T

1



¼

b

k

g

ZKZ

T

þ I

n

Covariance matrix of the polygenic

matrix K and environmental noise

are whitened.Number of nonzero

eigenvalues is speciﬁed as one.

Shrinkage is select-

ive. Large effects

subject to virtually

no shrinkage

while small effects

are shrunken to

zero.

U

T

R

where

SHS ¼ U

R

diag k

1

þ d; ; k

nq

þ d



U

T

R

H ¼ ZKZ

T

þ dI and S ¼ I  XX

T

X



1

X

T

One-dimensional optimization by

deriving the likelihood as a function of

QTN-to-residual variance ratio.

Kinship among individ-

uals is replaced by the

kinship among

groups.Fit the groups

as the random effect,

and estimates popu-

lation parameters

only once and then

ﬁxes them to test gen-

etic markers.

Kinship among individ-

uals is replaced by the

kinship among

groups.Chooses the

best combination be-

tween kinship algo-

rithms and grouping

algorithms.

Dramatically re-

duces the number

of markers used to

deﬁne individual

relationships, and

uses them in

FaST-LMM.

Running time Fast Depend on the num-

ber of effects.

Slow Fast Fast Moderate

Software Web site https://cran.r-project.org/web/pack

ages/mrMLM/index.html

http://statgen.ucr.

edu/software.html

http://mouse.cs.ucla.edu/emma/ http://zzlab.net/GAPIT http://zzlab.net/GAPIT http://zzlab.net/

GAPIT

702 | Wen et al.

Downloaded from https://academic.oup.com/bib/article/19/4/700/2965637 by guest on 21 August 2022

Results

Fast multi-locus random-SNP-effect EMMA

Estimation of the QTN variance

FASTmrEMMA (Appendix A) is a new algorithm that can ap-

proximate the estimation of QTN variance. Thus, we need to

know whether this approximation has a significant effect on

the estimate of QTN variance. To answer this question, four

flowering time traits in Arabidopsis [29] (Appendix B) were re-

analyzed by FASTmrEMMA and an exact method implemented

by PROC MIXED in SAS. The estimates for QTN variance are

listed in Figure 1 and Supplementary Table S1. As a result,

the relative error between the two methods ranged from 0.0% to

24.09%, and the average was 1.60%, indicating no effect on the

QTN variance estimate using FASTmrEMMA under the condi-

tions of this simulation.

To confirm the effectiveness of FASTmrEMMA, three Monte

Carlo simulation experiments (Appendix C) were carried out and

the simulation procedures were almost same as those in Wang

et al. [7]. In the three experiments, various backgrounds (no, poly-

genes and epistasis) were simulated to conduct sensitivity ana-

lysis. Each sample in these simulation experiments was analyzed

by six methods. In the six methods, FASTmrEMMA is also a new

multi-locus algorithm within the framework of MLM, E-BAYES

[30] is an existing multi-locus approach under the framework of

Bayesian statistics and SUPER, EMMA, ECMLM and CMLM are the

existing single-locus GWAS methods.

Statistical power for QTN detection

In the above three simulation experiments, the power for each

QTN was defined as the proportion of samples where the QTN

was detected (the P-value is smaller than the designated thresh-

old). When only six QTNs were simulated in the first experi-

ment, the power in the detection of each QTN was higher for

FASTmrEMMA than for the others (Figure 2A; Supplementary

Table S2). When a polygenic background (h

2

pg

¼ 0:092) was added

to the first experiment, a similar trend was observed (Figure 2B;

Supplementary Table S2). When the polygenic background was

changed into an epistatic background (h

2

epi

¼ 0:15), the results

were also similar to those in the first experiment (Figure 2C;

Supplementary Table S2). These results demonstrate the high-

est power of FASTmrEMMA across all the approaches under

various genetic backgrounds, although the other methods are

also robust under these backgrounds.

Accuracy for estimated QTN effects

We used the average, mean squared error (MSE) and mean abso-

lute deviation (MAD) to measure the accuracy of an estimated

QTN effect. We evaluated the accuracies for the estimates of all

the six simulated QTNs across all the six methods. As a result,

the estimate of each QTN effect from FASTmrEMMA was much

closer to the true value than the estimates obtained from the

other methods. On these occasions (QTN numbers 1 and 4), the

averages from E-BAYES were closer to the true value than those

from FASTmrEMMA in three simulation experiments

Figure 1. Comparison of the QTN-variance estimates between fast multi-locus random-SNP-effect EMMA (FASTmrEMMA) and one exact algorithm implemented by

PROC MIXED in SAS. LD: days to ﬂowering under long days; SDV: days to ﬂowering under short days with vernalization; 8W GH LN: leaf number at ﬂowering with

8 weeks vernalization, greenhouse; and 8W GH FT: days to ﬂowering, 8 weeks vernalization, greenhouse.

Methodological implementation of mixed linear models | 703

Downloaded from https://academic.oup.com/bib/article/19/4/700/2965637 by guest on 21 August 2022

(Supplementary Table S2). The MSE and MAD for each QTN ef-

fect were significantly less from FASTmrEMMA than from the

others with two exceptions for QTN number 6, E-BAYES method

had slightly higher accuracy than FASTmrEMMA method in the

first and second simulation experiments (Figure 2D–I;

Supplementary Table S2). These results indicate that a higher

accuracy for the estimate of QTN effect can be achieved using

FASTmrEMMA than using the other methods.

False-positive rate and receiver operating characteristic curve

All the false QTNs, detected by the six methods, in three simula-

tion experiments were used to calculate the empirical false-

positive rates of the six methods. These results are listed in

Supplementary Table S3. In these three simulation experi-

ments, the empirical false-positive rates of the six methods

were between 0.357 and 7.785 (1E-4), and had the same order

of magnitude. ECMLM has the lowest false-positive rate fol-

lowed by CMLM, FASTmrEMMA and EMMA methods, and SUPER

has the maximum false-positive rate followed by E-BAYES

method.

A receiver operating characteristic curve is a plot of the stat-

istical power against the controlled type I error. This curve is

frequently used to compare different methods for their efficien-

cies in the detection of significant effects; the higher the curve,

the better is the method. When 11 probability levels for signifi-

cance, between 1E-8 to 1E-3, were inserted, the corresponding

powers were calculated in the first simulation experiment. The

results are shown in Figure 3. Among the six approaches,

clearly, FASTmrEMMA method is the best one and the next one

is E-BAYES.

Computing time

In each of the three simulation experiments, computing times

for the six methods were recorded and are listed in

Supplementary Table S4. In summary, FASTmrEMMA has the

least computing time followed by ECMLM, E-BAYES, CMLM and

SUPER methods, and EMMA has the maximum computing time.

Real data analysis in Arabidopsis

To validate FASTmrEMMA, this new method along with E-

BAYES, SUPER, EMMA, ECMLM and CMLM was used to re-

analyze the Arabidopsis data [29] for days to flowering under

long days (LD), days to flowering under short days with

Figure 2. Comparison of FASTmrEMMA with the single- and multi-locus approaches under various genetic backgrounds. The single-locus model approaches include

SUPER, EMMA, ECMLM and CMLM, and the multi-locus approach has E-BAYES. The powers are presented in A–C, MSEs are showed in D–F and MADs are listed in G–I.

Six QTNs (A, D and G), six QTNs plus polygenes (B, E and H) and six QTNs plus three epistasis (C, F and I) were simulated, respectively, in the ﬁrst to third simulation

experiments.

704 | Wen et al.

Downloaded from https://academic.oup.com/bib/article/19/4/700/2965637 by guest on 21 August 2022

Methodological implementation of mixed linear models in multi-locus genome-wide association studies.

Citations

Cites background or methods or result from "Methodological implementation of mi..."

Cites background from "Methodological implementation of mi..."

References

"Methodological implementation of mi..." refers background or methods in this paper

"Methodological implementation of mi..." refers methods in this paper

Related Papers (5)