What are the contributions in this paper?

The authors present results obtained with robust 15 Mahalanobis distance, which is a new statistic for genome scans available in the 2. 0 and 16 later versions of the package. Using simulated data, the authors compare pcadapt to other software 19 for genome scans ( BayeScan, hapflk, OutFLANK, sNMF ). The authors find that the proportion of 20 false discoveries is around a nominal false discovery rate set at 10 % with the exception of 21 BayeScan that generates 40 % of false discoveries. The authors also find that the power of BayeScan 22 is severely impacted by the presence of admixed individuals whereas pcadapt is not im23 pacted.

(Open Access) pcadapt: an R package to perform genome scans for selection based on principal component analysis (2017) | Keurcien Luu

pcadapt : an R package to perform genome scans1

for selection based on principal component2

analysis.3

Keurcien Luu

& Eric Bazin

& Michael G. B. Blum

Université Grenoble Alpes, CNRS, Laboratoire TIMC-IMAG, UMR 5525, France.5

Université Grenoble Alpes, CNR S, Laboratoire d’Ecologie Alpine UMR 5553, France6

Abstract7

The R package pcadapt performs genome scans to detect genes under selection based8

on population genomic data. It assumes that candidate markers are outliers with respect9

to how they are related to population structure. Because population structure is ascer-10

tained with principal component analysis, the package is fast and works with large-scale11

data. It can handle missing data and pooled sequencing data. By contrast to population-12

based approaches, the packa g e handle admixed individuals and does not require grouping13

individuals into populations. Since its ﬁrst release, pcadapt has evolved both in terms of14

statistical approach and software implementation. We present results obtained with robust15

Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and16

later versions of the packag e. When hierarchical population structure occurs, Mahalanobis17

distance is more powerful than the communality statistic that was implemented in the18

ﬁrst version of the package. Using simulated data, we compare pcadapt to other software19

for genome scans (BayeScan, hapﬂk, OutFLANK, sNMF). We ﬁnd tha t the proportion of20

false discoveries is around a nominal false discovery rate set at 10% with the exception of21

BayeScan that generates 40% of false discoveries. We also ﬁnd that the power of BayeScan22

is severely impacted by the presence of admixed individuals whereas pcadapt is not im-23

pacted. Last, we ﬁnd that pcadapt and hapﬂk are the most powerful software in scenarios24

of population divergence and range expansion. Because pcadapt handles next-generation25

sequencing data, it is a valuable tool for data analysis in molecular ecology.26

Keywords. population genetics, R package, outlier detection, Mahalanobis distance,27

principal component analysis.28

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted July 25, 2016. ; https://doi.org/10.1101/056135doi: bioRxiv preprint

Introduction29

Looking for variants with unexpectedly large diﬀerences of allele frequencies between30

populations is a common approach to detect signals of natural selection (Lewontin and31

Krakauer, 1973). When variants confer a selective advantage in the local environment,32

allele frequency changes are triggered by natural selection leading to unexpectedly large33

diﬀerences of allele frequencies between populations. To detect variants with large diﬀe-34

rences of allele frequencies, numerous test statistics have been proposed, which are usually35

based on chi-square approximations of F

-related test statistics (François et al., 2016).36

Statistical approaches for detecting selection should address several challenges. The37

ﬁrst challenge is to account for hierarchical population structure that arises when genetic38

diﬀerentiation between populations is not identical between all pairs of populations. Sta-39

tistical tests based on F

that do not account for hierarchical structure, when it occurs,40

generate a large excess of false positive loci (Bierne et al., 2013; Excoﬃer et al., 2009).41

AsecondchallengearisesbecauseapproachesbasedonF

-related measures require42

to group individuals into populations, although deﬁning populations is a diﬃcult ta sk43

(Waples and Gaggiotti, 2006). Individual sampling may not be population-based but44

based on more continuous sampling schemes (Lotterhos and Whitlock, 2015). Additionally45

assigning an admixed individual to a single population involves some arbitrariness because46

diﬀerent regions of its genome might come from diﬀerent populations (Pritchard et al.,47

2000). Several individual-based methods of genome scans have already been proposed to48

address this challenge and they are based on related techniques of multivariate analysis49

including principal component analysis (PCA), factor models, and non-negative matrix50

factorization (Duforet-Frebourg et al., 201 4; Hao et al., 2016; Galinsky et al., 2016; Chen51

et al., 2016; Duforet-Frebourg et al., 2016; Martins et al., 2016).52

The last challenge arises from the nature of multilocus datasets generated from next53

generation sequencing platforms. Because datasets are massive with a large number of54

molecular markers, Mo nte Carlo methods usua lly implemented in B ayesian statistics may55

be prohibitively slow (Lange et al., 2014). Additionally, next generation sequencing data56

may contain a substantial proportion of missing data tha t should be accounted for (Arnold57

et al., 2013; Gautier et al., 2013).58

To address the aforementioned challenges, we have developed the software pcadapt and59

the R package pcadapt. The software pcadapt is now deprecated and the R package only is60

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted July 25, 2016. ; https://doi.org/10.1101/056135doi: bioRxiv preprint

maintained. pcadapt assumes that markers excessively related with population structure61

are candidates for local adaptation. Since its ﬁrst release, pcadapt has substantially evolved62

both in terms of statistical approach and software implementation (Table 1).63

The ﬁrst release of pcadapt was a command line C software. It implemented a Monte64

Carlo approach based on a Bayesian factor model (Duforet-Frebourg et al., 2014). The65

test statistic for outlier detection was a Bayes factor. Because Monte Carlo methods can66

be computationally prohibitive with massive NGS data, we then d eveloped an alternative67

approach based on PCA. The ﬁrst statistic based on PCA was the communality statistic,68

which measures the percentage of variation of a SNP explained by the ﬁrst K principal69

components (Duforet-Frebourg et al., 2016). It was initially implemented with a command-70

line C software (the pcadapt fast command) before being implemented in the pcadapt R71

package. We do not maintain C versions o f pcadapt anymore. The whole analysis that goes72

from reading genotype ﬁles to detecting outlier SNPs can now be performed in R (R Core73

Team, 2015).74

The 2.0 and following versions of the R package implement a more powerful statistic75

for genome scans. The test statistic is a robust Mahalanobis distance. A vector containing76

Kz-scores measures to what extent a SNP is related to the ﬁrst K principal components.77

The Mahalanobis distance is then computed for each SNP to detect outliers for which78

the vector of z-scores do not follow the distribution of the main bulk of points. The79

term robust refers to the fact that the estimators of the mean and of the cova riance80

matrix of z, which are required to compute Mahalanobis distances, are not sensitive to81

the presence of outliers in the dataset (Maronna and Zamar, 2012). In the following, we82

provide a comparison of statistical power that shows that Mahalanobis distance provides83

more powerful genome scans compared to the communality statistic and to the Bayes84

factor that were implemented in previous versions of pcadapt.85

In addition to comparing the diﬀerent test statistics that were implemented in pcadapt,86

we compare statistic performance obtained with the 3.0 version of pcadapt and with o ther87

software of genome scans. We use simulated data to compare software in terms of false88

discovery rate (FDR) and statistical power. We consider data simulated under diﬀerent89

demographic models including island model, divergence model and range expansion. To90

perform comparisons, we include software that require to group individuals into popu-91

lations : BayeScan (Foll and Gaggiotti, 2008), the F

statistic as implemented in the92

hapﬂk software (Bonhomme et al., 2010), and OutFLANK that provides a robust estima-93

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted July 25, 2016. ; https://doi.org/10.1101/056135doi: bioRxiv preprint

tion of the null distribution of a F

test statistic (Whitlock and Lotterhos, 2015). We94

additionally consider the sNMF software that implements another individual-based test95

statistic for genome scans (Frichot et al., 2014; Martins et al., 2016).96

Statistical and Computational approach97

Input data98

The R package can handle diﬀerent data formats for the genotype data matrix. In the99

version 3.0 that is currently available on CRAN, the package can handle genotype data100

ﬁles in the vcf, ped and lfmm formats. In addition, the package can also handle a pcadapt101

format, which is a text ﬁle where each line contains the allele counts of all individuals102

at a given locus. When reading a genotype data matrix with the read.pcadapt function, a103

.pcadapt ﬁle is generated, which contains the genotype data in the pcadapt format.104

Choosing the number of principal components105

In the following, we denote by n the number of individuals, by p the number of genetic106

markers, and by G the genotype matrix that is composed o f n lines and p columns.107

The genotypic information at locus j for individual i is encoded by the allele count G

,108

1  i  n and 1  j  p, which is a value in 0, 1 for haploid species and in 0, 1, 2 for109

diploid species.110

First, we normalize the genotype matrix columnwise. For diploid data, we consider the111

usual normalization in population genomics where

=(G

p

)/(2⇥p

(1p

))

1/2

,and112

denotes the minor allele frequency for locus j (Patterson et al., 2006). The normalization113

for haploid data is similar except that the denominator is given by (p

(1  p

))

1/2

.114

Then, we use the normalized genotype matrix

G to ascertain population structure115

with PCA (Patterson et al., 2006). The number of principal components to consider is116

denoted K and is a parameter that should be chosen by the user. In order to choose K,we117

recommend to consider the g ra phical approach based on the scree plot (Jackson, 1993).118

The scree plot displays the eigenvalues of the covariance matrix ⌦ in descending order.119

Up to a constant, eigenvalues are proportional to the proportion of variance explained120

by each principal component. The eigenvalues that correspond to random variation lie on121

a straight line whereas the ones corresponding to population structure depart from the122

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted July 25, 2016. ; https://doi.org/10.1101/056135doi: bioRxiv preprint

line. We recommend to use Cattell’s rule that states that components corresponding to123

eigenvalues to the left of the straight line should be kept (Cattell, 1966).124

Test statistic125

We now detail how the package computes the test statistic. We consider multiple linear126

regressions by regressing each of the p SNPs by the K principal components X

,...,X

127

k=1



+ ✏

,j=1,...,p, (1)

where 

is the regression coeﬃcient corresponding to the j-th SNP regressed by the128

k-th principal component, and ✏

is the residuals vector. To summarize the result of the129

regression analysis for the j-th SNP, we return a vector of z-scores z

=(z

,...,z

)130

where z

corresponds to the z-score obtained when regressing the j-th SNP by the k-th131

principal component.132

The next step is to look for outliers based on the vector of z-scores. We consider a133

classical approach in multivariate analysis for outlier detection. The test statistic is a134

robust Mahalanobis distance D deﬁned as135

=(z

 ¯z)

⌃

1

 ¯z), (2)

where ⌃ is the (K ⇥ K)covariancematrixofthez-scores and ¯z is the vector of the

136

Kz-score means (Maronna a nd Zamar, 2012). When K>1,thecovariancematrix⌃137

is estimated with the Orthogonalized Gnanadesikan-Kettenring method that is a robust138

estimate of the covariance able to hand le large-scale data (Maronna and Zamar, 2012)139

(covRob function of the robust R package). When K =1, the variance is estimated with140

another robust estimate (cov.rob function of the MASS R package).141

Genomic Inﬂation Factor142

To perform multiple hypothesis testing, Mahalanobis distances should be transformed143

into p-values. If the z-scores were truly multivariate Gaussian, the Mahalanobis distances144

D should be chi-square distributed with K degrees of freedom. However, as usual for145

genome scans, there are confounding factors that inﬂate values of the test statistic and146

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted July 25, 2016. ; https://doi.org/10.1101/056135doi: bioRxiv preprint

pcadapt: an R package to perform genome scans for selection based on principal component analysis

Figures

Citations

Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

Tracing the ancestry of modern bread wheats

Guidelines for planning genomic assessment and monitoring of locally adaptive variation to inform species conservation.

Climate and Urbanization Drive Mosquito Preference for Humans

References

R: A language and environment for statistical computing.

Inference of population structure using multilocus genotype data

The scree test for the number of factors

Statistical significance for genomewide studies

Population structure and eigenanalysis

Related Papers (5)

The variant call format and VCFtools

Estimating F-statistics for the analysis of population structure.

Inference of population structure using multilocus genotype data

Fast model-based estimation of ancestry in unrelated individuals

The Sequence Alignment/Map format and SAMtools

Frequently Asked Questions (1)

Q1. What are the contributions in this paper?