scispace - formally typeset
Open AccessJournal ArticleDOI

pcadapt: an R package to perform genome scans for selection based on principal component analysis

Reads0
Chats0
TLDR
The R package pcadapt performs genome scans to detect genes under selection based on population genomic data and is compared to other computer programs for genome scans, finding that pcadapt and hapflk are the most powerful in scenarios of population divergence and range expansion.
Abstract
The R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large-scale data. It can handle missing data and pooled sequencing data. By contrast to population-based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved in terms of both statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other computer programs for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful in scenarios of population divergence and range expansion. Because pcadapt handles next-generation sequencing data, it is a valuable tool for data analysis in molecular ecology.

read more

Content maybe subject to copyright    Report

pcadapt : an R package to perform genome scans1
for selection based on principal component2
analysis.3
Keurcien Luu
1
& Eric Bazin
2
& Michael G. B. Blum
1
4
1
Université Grenoble Alpes, CNRS, Laboratoire TIMC-IMAG, UMR 5525, France.5
2
Université Grenoble Alpes, CNR S, Laboratoire d’Ecologie Alpine UMR 5553, France6
Abstract7
The R package pcadapt performs genome scans to detect genes under selection based8
on population genomic data. It assumes that candidate markers are outliers with respect9
to how they are related to population structure. Because population structure is ascer-10
tained with principal component analysis, the package is fast and works with large-scale11
data. It can handle missing data and pooled sequencing data. By contrast to population-12
based approaches, the packa g e handle admixed individuals and does not require grouping13
individuals into populations. Since its first release, pcadapt has evolved both in terms of14
statistical approach and software implementation. We present results obtained with robust15
Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and16
later versions of the packag e. When hierarchical population structure occurs, Mahalanobis17
distance is more powerful than the communality statistic that was implemented in the18
first version of the package. Using simulated data, we compare pcadapt to other software19
for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find tha t the proportion of20
false discoveries is around a nominal false discovery rate set at 10% with the exception of21
BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan22
is severely impacted by the presence of admixed individuals whereas pcadapt is not im-23
pacted. Last, we find that pcadapt and hapflk are the most powerful software in scenarios24
of population divergence and range expansion. Because pcadapt handles next-generation25
sequencing data, it is a valuable tool for data analysis in molecular ecology.26
Keywords. population genetics, R package, outlier detection, Mahalanobis distance,27
principal component analysis.28
1
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted July 25, 2016. ; https://doi.org/10.1101/056135doi: bioRxiv preprint

Introduction29
Looking for variants with unexpectedly large dierences of allele frequencies between30
populations is a common approach to detect signals of natural selection (Lewontin and31
Krakauer, 1973). When variants confer a selective advantage in the local environment,32
allele frequency changes are triggered by natural selection leading to unexpectedly large33
dierences of allele frequencies between populations. To detect variants with large die-34
rences of allele frequencies, numerous test statistics have been proposed, which are usually35
based on chi-square approximations of F
ST
-related test statistics (François et al., 2016).36
Statistical approaches for detecting selection should address several challenges. The37
first challenge is to account for hierarchical population structure that arises when genetic38
dierentiation between populations is not identical between all pairs of populations. Sta-39
tistical tests based on F
ST
that do not account for hierarchical structure, when it occurs,40
generate a large excess of false positive loci (Bierne et al., 2013; Excoer et al., 2009).41
AsecondchallengearisesbecauseapproachesbasedonF
ST
-related measures require42
to group individuals into populations, although defining populations is a dicult ta sk43
(Waples and Gaggiotti, 2006). Individual sampling may not be population-based but44
based on more continuous sampling schemes (Lotterhos and Whitlock, 2015). Additionally45
assigning an admixed individual to a single population involves some arbitrariness because46
dierent regions of its genome might come from dierent populations (Pritchard et al.,47
2000). Several individual-based methods of genome scans have already been proposed to48
address this challenge and they are based on related techniques of multivariate analysis49
including principal component analysis (PCA), factor models, and non-negative matrix50
factorization (Duforet-Frebourg et al., 201 4; Hao et al., 2016; Galinsky et al., 2016; Chen51
et al., 2016; Duforet-Frebourg et al., 2016; Martins et al., 2016).52
The last challenge arises from the nature of multilocus datasets generated from next53
generation sequencing platforms. Because datasets are massive with a large number of54
molecular markers, Mo nte Carlo methods usua lly implemented in B ayesian statistics may55
be prohibitively slow (Lange et al., 2014). Additionally, next generation sequencing data56
may contain a substantial proportion of missing data tha t should be accounted for (Arnold57
et al., 2013; Gautier et al., 2013).58
To address the aforementioned challenges, we have developed the software pcadapt and59
the R package pcadapt. The software pcadapt is now deprecated and the R package only is60
2
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted July 25, 2016. ; https://doi.org/10.1101/056135doi: bioRxiv preprint

maintained. pcadapt assumes that markers excessively related with population structure61
are candidates for local adaptation. Since its first release, pcadapt has substantially evolved62
both in terms of statistical approach and software implementation (Table 1).63
The first release of pcadapt was a command line C software. It implemented a Monte64
Carlo approach based on a Bayesian factor model (Duforet-Frebourg et al., 2014). The65
test statistic for outlier detection was a Bayes factor. Because Monte Carlo methods can66
be computationally prohibitive with massive NGS data, we then d eveloped an alternative67
approach based on PCA. The first statistic based on PCA was the communality statistic,68
which measures the percentage of variation of a SNP explained by the first K principal69
components (Duforet-Frebourg et al., 2016). It was initially implemented with a command-70
line C software (the pcadapt fast command) before being implemented in the pcadapt R71
package. We do not maintain C versions o f pcadapt anymore. The whole analysis that goes72
from reading genotype files to detecting outlier SNPs can now be performed in R (R Core73
Team, 2015).74
The 2.0 and following versions of the R package implement a more powerful statistic75
for genome scans. The test statistic is a robust Mahalanobis distance. A vector containing76
Kz-scores measures to what extent a SNP is related to the first K principal components.77
The Mahalanobis distance is then computed for each SNP to detect outliers for which78
the vector of z-scores do not follow the distribution of the main bulk of points. The79
term robust refers to the fact that the estimators of the mean and of the cova riance80
matrix of z, which are required to compute Mahalanobis distances, are not sensitive to81
the presence of outliers in the dataset (Maronna and Zamar, 2012). In the following, we82
provide a comparison of statistical power that shows that Mahalanobis distance provides83
more powerful genome scans compared to the communality statistic and to the Bayes84
factor that were implemented in previous versions of pcadapt.85
In addition to comparing the dierent test statistics that were implemented in pcadapt,86
we compare statistic performance obtained with the 3.0 version of pcadapt and with o ther87
software of genome scans. We use simulated data to compare software in terms of false88
discovery rate (FDR) and statistical power. We consider data simulated under dierent89
demographic models including island model, divergence model and range expansion. To90
perform comparisons, we include software that require to group individuals into popu-91
lations : BayeScan (Foll and Gaggiotti, 2008), the F
LK
statistic as implemented in the92
hapflk software (Bonhomme et al., 2010), and OutFLANK that provides a robust estima-93
3
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted July 25, 2016. ; https://doi.org/10.1101/056135doi: bioRxiv preprint

tion of the null distribution of a F
ST
test statistic (Whitlock and Lotterhos, 2015). We94
additionally consider the sNMF software that implements another individual-based test95
statistic for genome scans (Frichot et al., 2014; Martins et al., 2016).96
Statistical and Computational approach97
Input data98
The R package can handle dierent data formats for the genotype data matrix. In the99
version 3.0 that is currently available on CRAN, the package can handle genotype data100
files in the vcf, ped and lfmm formats. In addition, the package can also handle a pcadapt101
format, which is a text file where each line contains the allele counts of all individuals102
at a given locus. When reading a genotype data matrix with the read.pcadapt function, a103
.pcadapt file is generated, which contains the genotype data in the pcadapt format.104
Choosing the number of principal components105
In the following, we denote by n the number of individuals, by p the number of genetic106
markers, and by G the genotype matrix that is composed o f n lines and p columns.107
The genotypic information at locus j for individual i is encoded by the allele count G
ij
,108
1 i n and 1 j p, which is a value in 0, 1 for haploid species and in 0, 1, 2 for109
diploid species.110
First, we normalize the genotype matrix columnwise. For diploid data, we consider the111
usual normalization in population genomics where
˜
G
ij
=(G
ij
p
j
)/(2p
j
(1p
j
))
1/2
,and112
p
j
denotes the minor allele frequency for locus j (Patterson et al., 2006). The normalization113
for haploid data is similar except that the denominator is given by (p
j
(1 p
j
))
1/2
.114
Then, we use the normalized genotype matrix
˜
G to ascertain population structure115
with PCA (Patterson et al., 2006). The number of principal components to consider is116
denoted K and is a parameter that should be chosen by the user. In order to choose K,we117
recommend to consider the g ra phical approach based on the scree plot (Jackson, 1993).118
The scree plot displays the eigenvalues of the covariance matrix in descending order.119
Up to a constant, eigenvalues are proportional to the proportion of variance explained120
by each principal component. The eigenvalues that correspond to random variation lie on121
a straight line whereas the ones corresponding to population structure depart from the122
4
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted July 25, 2016. ; https://doi.org/10.1101/056135doi: bioRxiv preprint

line. We recommend to use Cattell’s rule that states that components corresponding to123
eigenvalues to the left of the straight line should be kept (Cattell, 1966).124
Test statistic125
We now detail how the package computes the test statistic. We consider multiple linear126
regressions by regressing each of the p SNPs by the K principal components X
1
,...,X
K
127
G
j
=
K
X
k=1
jk
X
k
+
j
,j=1,...,p, (1)
where
jk
is the regression coecient corresponding to the j-th SNP regressed by the128
k-th principal component, and
j
is the residuals vector. To summarize the result of the129
regression analysis for the j-th SNP, we return a vector of z-scores z
j
=(z
j1
,...,z
jK
)130
where z
jk
corresponds to the z-score obtained when regressing the j-th SNP by the k-th131
principal component.132
The next step is to look for outliers based on the vector of z-scores. We consider a133
classical approach in multivariate analysis for outlier detection. The test statistic is a134
robust Mahalanobis distance D defined as135
D
2
j
=(z
j
¯z)
T
1
(z
j
¯z), (2)
where is the (K K)covariancematrixofthez-scores and ¯z is the vector of the
136
Kz-score means (Maronna a nd Zamar, 2012). When K>1,thecovariancematrix137
is estimated with the Orthogonalized Gnanadesikan-Kettenring method that is a robust138
estimate of the covariance able to hand le large-scale data (Maronna and Zamar, 2012)139
(covRob function of the robust R package). When K =1, the variance is estimated with140
another robust estimate (cov.rob function of the MASS R package).141
Genomic Inflation Factor142
To perform multiple hypothesis testing, Mahalanobis distances should be transformed143
into p-values. If the z-scores were truly multivariate Gaussian, the Mahalanobis distances144
D should be chi-square distributed with K degrees of freedom. However, as usual for145
genome scans, there are confounding factors that inflate values of the test statistic and146
5
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted July 25, 2016. ; https://doi.org/10.1101/056135doi: bioRxiv preprint

Figures
Citations
More filters
Journal ArticleDOI

Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data

TL;DR: A method for inferring population structure through PCA in an iterative heuristic approach of estimating individual allele frequencies is proposed, where improved accuracy in samples with low and variable sequencing depth for both simulated and real datasets are demonstrated.
Journal ArticleDOI

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

TL;DR: Two R packages, bigstatsr and bigsnpr, allowing for the analysis of large scale genomic data to be performed within R, integrate most of the tools that are commonly used and demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500 000 individuals and 1 million markers on a single desktop computer.
Journal ArticleDOI

Guidelines for planning genomic assessment and monitoring of locally adaptive variation to inform species conservation.

TL;DR: An adaptive management framework is offered to help conservation biologists and managers decide when genomics is likely to be effective in detecting local adaptation, and how to plan assessment and monitoring of adaptive variation to address conservation objectives.
References
More filters
Journal Article

R: A language and environment for statistical computing.

R Core Team
- 01 Jan 2014 - 
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Journal ArticleDOI

Inference of population structure using multilocus genotype data

TL;DR: Pritch et al. as discussed by the authors proposed a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations, which can be applied to most of the commonly used genetic markers, provided that they are not closely linked.
Journal ArticleDOI

The scree test for the number of factors

TL;DR: The Scree Test for the Number Of Factors this paper was first proposed in 1966 and has been used extensively in the field of behavioral analysis since then, e.g., in this paper.
Journal ArticleDOI

Statistical significance for genomewide studies

TL;DR: This work proposes an approach to measuring statistical significance in genomewide studies based on the concept of the false discovery rate, which offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted.
Journal ArticleDOI

Population structure and eigenanalysis

TL;DR: An approach to studying population structure (principal components analysis) is discussed that was first applied to genetic data by Cavalli-Sforza and colleagues, and results from modern statistics are used to develop formal significance tests for population differentiation.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

The authors present results obtained with robust 15 Mahalanobis distance, which is a new statistic for genome scans available in the 2. 0 and 16 later versions of the package. Using simulated data, the authors compare pcadapt to other software 19 for genome scans ( BayeScan, hapflk, OutFLANK, sNMF ). The authors find that the proportion of 20 false discoveries is around a nominal false discovery rate set at 10 % with the exception of 21 BayeScan that generates 40 % of false discoveries. The authors also find that the power of BayeScan 22 is severely impacted by the presence of admixed individuals whereas pcadapt is not im23 pacted.