What are the contributions in this paper?

The goal of this study is to carry out such tests, using various dispersal scenarios from data generated with an individualbased model. The authors found that in most cases the estimated ‘ log probability of data ’ does not provide a correct estimation of the number of clusters, K.

Why did the authors restrict their simulations to cases of moderate to strong structure?

The authors restricted their simulations to cases of moderate to strong structure at different hierarchical levels because their goal was to test the ability of the algorithm to detect the number of groups of individuals in situations when different layers of population structure exist, as is often the case in real situations.

What is the common criterion used to identify true number of populations?

True number of populations ( K ) is often identified using the maximal value of L ( K ) returned by structure (Zeisset & Beebee 2001; Ciofi et al .

How many clusters of individuals can be detected with the AFLP?

few nonhuman species could be genotyped with such intensity, but this study indicates that detection of the correct number of clusters can still be found when differentiation is weaker than in their main simulations, and this was confirmed by further limited simulations with FST among archipelagos as low as 3.8% (see above).

What was the effect of subsampling of individuals or loci?

Subsampling of individuals or loci reduced the height of the modal value of ∆K (Fig. 4G, H), and 10 AFLPs produced a weaker signal than one microsatellite because the average magnitude of the height of the modal value of ∆K was twice lower for the former.

What is the expected value of FST for this model?

The expected value of FST for this model cannot be easily analytically resolved, but global FST estimated over the 10 replicates (10 times 100 microsatellite loci) is 0.33 and pairwise FST range from 0.16 to 0.43.

How many individuals did Rosenberg et al. (2002) find in the data set?

Rosenberg et al. (2002) showed empirically on a very large microsatellite data set (377 loci) encompassing 1026 individuals from the five continents that humans cluster in five groups, loosely corresponding to the five continents.

(Open Access) Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. (2005) | Guillaume Evanno

Q: What is the predictor of the real number of clusters?

The authors find that ∆ K, an ad hoc quantity related to the second order rate of change of the log probability of data with respect to the number of clusters, is a good predictor of the real number of clusters.

Q: What is the expected value of FST between archipelagos?

The expected value of FST is 0.30 between archipelagos ( FArchipelago-Total ), 0.16 between islands within archipelagos ( FIsland-Archipelago), and 0.41 overall ( FIsland-Total ).

Q: Why did the authors divide m(|L(K)|) by s?

The authors divided m(|L′′(K )|) by s[L(K )] because the authors found a clear and general trend toward an increase of the variance of L(K) between runs as K increased.

Molecular Ecology (2005)

, 2611–2620 doi: 10.1111/j.1365-294X.2005.02553.x

Blackwell Publishing, Ltd.

Detecting the number of clusters of individuals using

the software

STRUCTURE

: a simulation study

G. EVANNO, S. REGNAUT and J. GOUDET

Department of Ecology and Evolution, Biology building, University of Lausanne, CH 1015 Lausanne, Switzerland

Abstract

The identification of genetically homogeneous groups of individuals is a long standing

issue in population genetics. A recent Bayesian algorithm implemented in the software

STRUCTURE

allows the identification of such groups. However, the ability of this algorithm to

detect the true number of clusters (

) in a sample of individuals when patterns of dispersal

among populations are not homogeneous has not been tested. The goal of this study is to

carry out such tests, using various dispersal scenarios from data generated with an individual-

based model. We found that in most cases the estimated ‘log probability of data’ does not

provide a correct estimation of the number of clusters,

. However, using an ad hoc statistic

∆∆

based on the rate of change in the log probability of data between successive

values,

we found that

STRUCTURE

accurately detects the uppermost hierarchical level of structure for

the scenarios we tested. As might be expected, the results are sensitive to the type of genetic

marker used (AFLP vs. microsatellite), the number of loci scored, the number of popula-

tions sampled, and the number of individuals typed in each sample.

Keywords

: AFLP, hierarchical structure, microsatellite, simulations,

structure

software

Received 5 October 2004; revision accepted 17 February 2005

Introduction

Population genetics deals with the variations of allele

frequencies between and within populations. The most

widely used measures of population structure are Wright’s

statistics (Wright 1931). To calculate these indices, one

needs first to define groups of individuals and then to use

their genotypes to compute variance in allele frequencies.

Thus, a fundamental prerequisite of any inference on the

genetic structure of populations is the definition of popu-

lations themselves. Population determination is usually

based upon geographical origin of samples or phenotypes.

However, the genetic structure of populations is not always

reflected in the geographical proximity of individuals. Popu-

lations that are not discretely distributed can nevertheless

be genetically structured, due to unidentified barriers to

gene flow. In addition, groups of individuals with different

geographical locations, behavioural patterns or phenotypes

are not necessarily genetically differentiated (for instance,

migratory bats from the same breeding roost could be

sampled thousands of kilometres apart in winter, see, e.g.

Petit

et al

. 2001).

Among the methods not assuming predefined structure,

tree-based methods use genetic distance between indi-

viduals and tree construction algorithms such as

upgma

neighbour joining to group them in clusters (e.g. Saitou &

Nei 1987). Similarly, multivariate analyses such as multi-

dimensional scaling can help in identifying clusters of

individuals. However, these graphical methods are only

loosely connected to statistical procedures allowing the

identification of homogeneous clusters of individuals.

An alternative model-based method developed recently

by Pritchard

et al

. (2000) and implemented in the software

structure

aims at delineating clusters of individuals on

the basis of their genotypes at multiple loci using a Bayesian

approach. The model accounts for the presence of Hardy–

Weinberg or linkage disequilibrium by introducing popu-

lation structure and attempts to find population groupings

that (as far as possible) are not in disequilibrium (Pritchard

et al

. 2000). The estimated log probability of data Pr(

)

(equation 12 in Pritchard

et al

. 2000) for each value of

given, allowing the estimation of the more likely number

of clusters. A quantification of how likely each individual

Correspondence: Jérôme Goudet, Fax: + 41 21 692 42 65;

E-mail: Jerome.goudet@unil.ch

2612

G. EVANNO, S. REGNAUT and J. GOUDET

Molecular Ecology

, 14, 2611–2620

is to belong to each group is also given, information that

can be then used to assign individuals to populations.

While the authors warn that Pr(

) is really only an indi-

cation of the number of clusters and an ad hoc guide (p. 949

in Pritchard

et al

. 2000; p. 3 in Pritchard & Wen 2003), the

program has been widely used to this end. More generally,

it has been used for detection of genetic structure in sample

populations for medical purposes (Pritchard & Donnelly

2001; Satten

et al

. 2001), assignment studies (Rosenberg

et al

. 2001), population admixture and hybridization ana-

lysis (Beaumont

et al

. 2001; Goossens

et al

. 2002; Randi &

Lucchini 2002), migration and dispersal analysis (Arnaud

et al

. 2003; Cegelski

et al

. 2003; Berry

et al

. 2004) and also to

detect, with or without success, cryptic genetic structure

of natural populations (Rosenberg

et al

. 2002; Caizergues

et al

. 2003). Among the Bayesian clustering methods,

structure

is the most widely used. While other methods

have been developed (Banks & Eichert 2000; Dawson &

Belkhir 2001; Corander

et al

. 2003) and still other methods

for the assignment of individuals to populations exist (but

imply the a priori knowledge of source populations: Paetkau

et al

. 1995; Rannala & Mountain 1997; Cornuet

et al

. 1999),

we will focus here exclusively on the software

structure

Tests and comparative studies using empirical data sets

have been performed to assess

structure

’s ability in assign-

ing individuals to their known cluster of origin (Pritchard

& Donnelly 2001; Rosenberg

et al

. 2001; Manel

et al

. 2002;

Turakulov & Easteal 2003). Most of these studies have

proven the software to be efficient in assigning individuals

to their populations of origin (albeit most are based on simu-

lations with limited number of populations and absence of

dispersal between them). However, little is known on the

crucial ability of

structure

to detect the real number

of clusters (

) which composes a data set. Pritchard

et al

(2000) showed that

structure

easily detects two to four

highly differentiated populations but studies in molecular

ecology usually include many more populations and very

often these populations are not evenly distributed in space.

Many studies have described migration patterns departing

from Wright’s island model and including several hier-

archical levels and/or isolation by distance. For instance,

Chapuisat

et al

. (1997), Giles

et al

. (1998), Bouzat & Johnson

(2004) or Trouvé

et al

. (2005) have documented situations

with a hierarchical pattern of population structure, as groups

are themselves clusters of differentiated populations. Another

pattern frequently described is a contact zone between

otherwise isolated populations. This situation implies a

relative genetic isolation between the two groups of popu-

lations and sometimes also a pattern of isolation by distance

within each group. Such a migration scheme was found for

instance by Lugon-Moulin

et al

. (1999) who describe two

longitudinal geographical patterns of isolated shrew

populations separated by a zone through which dispersal

is strongly reduced.

Many of these studies have been conducted using

microsatellite markers to assess polymorphism. These DNA

markers are widely used because they are both co-

dominant and highly polymorphic (Jarne & Lagoda 1996).

However, their development is relatively expensive, time

consuming and can be difficult. An alternative family

of markers also commonly used in populations studies are

the amplified fragment length polymorphism (AFLPs)

(Vos

et al

. 1995). AFLPs generate hundreds of polymorphic

bands and are easier to develop than microsatellites, but

they have the potential inconvenience of being dominant

(a DNA band is either present or absent). These two types

of markers have different properties. For instance, Gaudeul

et al

. (2004) reported very different levels of population

structuring inferred from AFLPs and microsatellite markers.

Both AFLP and microsatellites can be used for assignment

studies but their respective ability to delineate clusters of

individuals has not been compared so far.

The goal of this study is to test the ability of the algorithm

underlying the software

structure

to detect the number

of clusters in situations including more than two populations.

While the program is increasingly used, it is unknown

whether it can efficiently detect the real number of clusters

in hierarchical systems where migration between popula-

tions is uneven. We present an evaluation of the perform-

ances of the method under three models of population

structure: the island model, a contact zone, and a hierarchical

island model. For each model, we simulated AFLP and

microsatellite genotypic data sets that were subsequently

run in

structure

, and then we analysed the output. We find

that

∆

, an ad hoc quantity related to the second order rate

of change of the log probability of data with respect to the

number of clusters, is a good predictor of the real number

of clusters.

structure

identifies groups of individuals

corresponding to the uppermost hierarchical level, and

performs well with both dominant and codominant markers.

Materials and methods

Simulation of the three migration models

We used the software

easypop

(Balloux 2001) to generate

genotypic data from three different models of population

structure: an island model, a hierarchical island model and

a contact-zone model (Fig. 1). For all simulations and model

of population structure, mutation process followed the

allele model (equal probability of mutations to any allelic

state) at a rate of

= 10

−

. The modelled organisms are

diploid, hermaphroditic and randomly mating (excluding

selfing). Each simulation was run for 10 000 generations

to obtain populations at drift, migration and mutation

equilibrium. For each model, we generated 10 replicates

where each individual genotype was made of 100 micro-

satellite loci, each with 10 possible allelic states.

DETECTING CLUSTER NUMBER

2613

Molecular Ecology

, 14, 2611–2620

The parameters that were varied for the simulations are

the number of populations, the number of individuals per

population, and the migration rates. These parameters are

summarized in Table 1. For the finite island model, five

populations of 100 individuals each are exchanging migrants

at a rate 0.01. The expected value of

for these simula-

tions is 0.15.

The hierarchical island model (Slatkin & Voelm 1991)

consists in five sets of four populations, each made of 50

individuals (Fig. 1). Migration occurs at a rate 0.02 within

archipelago and 0.001 between archipelagos (Table 1).

The expected value of

is 0.30 between archipelagos

(

Archipelago-Total

), 0.16 between islands within archipelagos

(

Island-Archipelago

), and 0.41 overall (

Island-Total

The contact zone model is characterized by two sets of

five populations, which are organized in a one dimension

stepping-stone scheme (Kimura & Weiss 1964). Migration

between the two sets occurs through the two central popu-

lations at a rate 10 times lower than within each set

(Table 1). The expected value of

for this model cannot

be easily analytically resolved, but global

estimated

over the 10 replicates (10 times 100 microsatellite loci) is

0.33 and pairwise

range from 0.16 to 0.43. The observed

value of

is 0.17 between the two sets (

Set-Total

), 0.25

between populations within sets (

Population-Set

), and 0.38

overall (

Population-Total

easypop

generates codominant, microsatellite-like geno-

typic data. In order to simulate dominant AFLP data, the

genotypes generated by

easypop

were recoded as biallelic

loci, in a manner similar to Mariette

et al

. (2002): a ran-

domly chosen half of the microsatellite alleles were coded

as ‘1’ and considered dominant while the second half was

coded as ‘2’ and considered recessive. Because with dom-

inant data, one cannot distinguish between a dominant

homozygote and a heterozygote, dominant phenotypes

(obtained from genotypes 1–1 and 1–2/2–1) were recoded

as 1–0, where 0 indicates a missing datum. Thus, AFLP data

sets bear a proportion of missing data that microsatellite

sets do not. This coding of alleles is different from what is

recommended in the user’s manual of

structure

(Pritchard

& Wen 2003), which suggests that dominant markers can

be dealt with by coding each phenotype (absence or pres-

ence of a band) by a single allele and a missing datum (1–

0 for dominant and 2–0 for recessive). We did not use this

method because it implies adding a missing value also for

recessive homozygotes, which seems unnecessary.

Microsatellite data sets given to

structure

were made

of 10 loci as this is a number commonly found in molecular

ecology studies. AFLP data sets were made of 100 loci,

which seem conservative as AFLP-based studies often

include hundreds of markers (Luikart

et al

. 2003). A further

reason for this 1:10 ratio of microsatellite loci to AFLP bands

comes from a recent simulation-based study (Mariette

et al

. 2002) showing that at least 10 times more AFLP than

microsatellite loci are necessary to reach a similar accuracy

in the estimation of genetic diversity.

Sampling scheme

To assess the effects of sampling strategies on the method’s

accuracy, analyses were also carried out on partial data

sets. We investigated first the effect of the number of typed

loci by sampling only five microsatellites or 50 AFLP bands

(Table 2). We also looked at the effect of sampling a subset

of individuals from each population (Table 2). Last, for the

hierarchical island model, we also looked at the effect of

sampling a subset of the populations by randomly omitting

one island per archipelago (Table 2). We tested whether

partial sampling affected the detection of the true

comparing results between full and partial data sets.

Table 1 Parameters of the three migration models

Number of

populations

Number of individuals/

population

Migration rate

within set

Migration rate

between sets

Island model 5 100 10

−2

—

Contact zone 10, 2 sets of 5 pop. 100 10

−2

−3

Hierarchical island model 20, 5 sets of 4 pop. 50 2 × 10

−2

−3

Fig. 1 Schematic representation of the three migration models:

(A) Island model. (B) Hierarchical island model. (C) Contact zone.

Open arrows represent the migration rates between sets of popu-

lations and solid arrows the migration rates within sets (see also

Table 1).

2614

G. EVANNO, S. REGNAUT and J. GOUDET

Molecular Ecology

, 14, 2611–2620

Structure runs

We set most of parameters to their default values as advised

in the user’s manual of

structure

2.0 (Pritchard & Wen

2003). Specifically, we chose the admixture model and the

option of correlated allele frequencies between populations,

as this configuration is considered best by Falush

et al

(2003) in cases of subtle population structure. Similarly, we

let the degree of admixture alpha be inferred from the data.

When alpha is close to zero, most individuals are essentially

from one population or another, while alpha > 1 means

that most individuals are admixed (Falush

et al

. 2003).

Lambda, the parameter of the distribution of allelic frequ-

encies, was set to one, as the manual advices. From a pilot

study, we found that a length of the burn-in and MCMC

(Markov chain Monte Carlo) of 10 000 each was sufficient.

Longer burn-in or MCMC did not change significantly the

results. As we found that different runs could produce

different likelihood values (even with much longer chains,

e.g. 1 000 000), for each data set 20 runs were carried out

in order to quantify the amount of variation of the likelihood

for each

. The range of possible

s we tested was from 1

or 2 to the true number of populations plus 3.

Statistics used to select

The model choice criterion implemented in

structure

detect the true

is an estimate of the posterior probability

of the data for a given

, Pr(

) (Pritchard

et al

. 2000).

This value, called ‘Ln P(D)’ in

structure

output, is obtained

by first computing the log likelihood of the data at each

step of the MCMC. Then the average of these values is

computed and half their variance is subtracted to the

mean. This gives ‘Ln P(D)’, the model choice criterion to

which we refer as

(

) afterwards. True number of popu-

lations (

) is often identified using the maximal value of

(

) returned by

structure

(Zeisset & Beebee 2001; Ciofi

et al

. 2002; Vernesi

et al

. 2003; Hampton

et al

. 2004). However,

we observed in our simulations that in most cases, once the

real

K is reached, L(K) at larger Ks plateaus or continues

increasing slightly (a phenomenon mentioned in the

structure’s manual, Pritchard & Wen 2003) and the

variance between runs increases (Fig. 2A).

The distribution of L(K) did not show a clear mode for

the true K, but we found that an ad hoc quantity based on

the second order rate of change of the likelihood function

with respect to K (∆K) did show a clear peak at the true

value of K. The rational for this ∆K is to make salient the

break in slope of the distribution of L(K) at the true K. It is

best explained graphically, as is shown on Fig. 2. First, we

plotted the mean likelihood L(K) over 20 runs for each K

(Fig. 2A). Second, we plotted the mean difference between

successive likelihood values of K, L′(K) = L(K) − L(K − 1)

(Fig. 2B). This difference corresponds to the rate of change

of the likelihood function with respect to K, and is noted

L′(K). In a third step we plotted the (absolute value of the)

difference between successive values of L′(K), |L′′(K)| =

|L′(K + 1) − L′(K)| (Fig. 2C). This corresponds to the second

order rate of change of L(K) with respect to K. Finally, we

estimated ∆K as the mean of the absolute values of L′′(K)

averaged over 20 runs divided by the standard deviation

of L(K), ∆K = m(|L′′(K)|)/s[L(K)], which expands to ∆K =

m(|L(K + 1) − 2 L(K) + L(K − 1)|)/s[L(K)] ( Fig. 2D). We divided

m(|L′′(K)|) by s[L(K)] because we found a clear and general

trend toward an increase of the variance of L(K) between

runs as K increased. We found the modal value of the

distribution of ∆K to be located at the real K. We used the

height of this modal value as an indicator of the strength of

the signal detected by structure.

Results

Overall simulation scenarios, we seldom found a mode of

the likelihood distribution L(K) at the real K (Fig. 3). In

most cases, the likelihood increased until the real K was

reached, and then leveled off (often still increasing after the

Table 2 Sampling scheme used for each model. In each situation, all the combinations (full and partial) between the numbers of individuals

and loci were tested. For the hierarchical island model the number of populations was also subsampled: 15 out of 20 populations (three

populations per archipelago)

Number of

populations

Number of

individuals/

population Number of loci

full partial full partial full partial

AFLP microsat AFLP microsat

Island model 5 — 100 20 100 10 50 5

Contact zone 10 — 100 20 100 10 50 5

Hierarchical island model 20 15 50 20 100 10 50 5

DETECTING CLUSTER NUMBER 2615

real K, Fig. 3). On the other hand, the distribution of ∆K

almost always showed a mode at the real K (Fig. 4).

For all three models, and both in full or partial con-

figurations, structure identified a number of groups corre-

sponding to the uppermost hierarchical level of genetic

partitioning between populations. structure primarily

highlights the between-sets of populations level for the

hierarchical island model and the contact zone, and the

between populations level for the island model. Importantly,

these results were obtained by using the modal value of

∆K rather than the maximum value of L(K) (Fig. 2A, D). In

Fig. 4, the magnitude of ∆K is plotted for each model and

sampling scheme, which allows the comparison of results

obtained with different parameters sets. Overall, there was

some variance among likelihood values L(K) for the differ-

ent replicates of the same parameter set, but for 29 out of 32

models, all replicates had the same modal value for ∆K.

Island model

For the full data set, as well as for the partial samplings, the

modal value of ∆K was K = 5, the true number of popu-

lations (Fig. 4A, B). The only situations in which structure

failed to detect the real K were the partial samplings of

20 individuals and five microsatellite markers as well as

20 individuals and 50 AFLPs markers (Fig. 4A, B). For the

case with microsatellites which failed to work, we did not

see any plateau nor a clear maximum in the likelihood

distribution of K for any of the 10 replicates, and the

software found a maximal likelihood value at K = 5 in 2

replicates, at K = 2 twice, at K = 3 four times and at K = 4

twice. For the case where the true K was not detected by

AFLPs, although most replicates had a distribution of L(K)

with a break in slope at K = 5 followed by a plateau, this

pattern was not strong enough to be translated in a high ∆K.

There is a stronger effect of the partial sampling of indi-

viduals and loci for microsatellites than AFLP markers

(Fig. 4A, B). For the complete data sets, microsatellites seem

to perform better than AFLPs markers (the peak is higher)

whereas for partial sampling, the results are similar for

both types of marker (Fig. 4A, B).

Hierarchical island model

For this model and under exhaustive sampling, the highest

likelihood was observed for K = 11 for AFLP (Fig. 3C) and

K = 12 for microsatellites (Fig. 3D) but the modal value of

∆K was at K = 5, which corresponds to the number of

archipelagos. Using ∆K, we observed that structure always

found the modal value to be K = 5 when all populations

were sampled (Fig. 4C, D). When we omitted one island in

each of the archipelagoes there was only one case of partial

Fig. 2 Description of the four steps for the

graphical method allowing detection of the

true number of groups K*. (A) Mean L(K) (±

SD) over 20 runs for each K value. The

model considered here is a hierarchical

island model using all 100 individuals per

population and 50 AFLP loci. (B) Rate of

change of the likelihood distribution (mean

± SD) calculated as L′(K) = L(K) – L(K – 1).

rate of change of the likelihood distribution

(mean ± SD) calculated according to the

formula: |L′′(K)| = |L′(K + 1) – L′(K)|.

(D) ∆K calculated as ∆K = m|L′′(K)|/

s[L(K)]. The modal value of this dis-

tribution is the true K(*) or the uppermost

level of structure, here five clusters.

Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.

Figures

Citations

STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method

CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure

Discriminant analysis of principal components: a new method for the analysis of genetically structured populations

Inferring weak population structure with the assistance of sample group information.

Clumpak: a program for identifying clustering modes and packaging population structure inferences across K

References

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Inference of population structure using multilocus genotype data

AFLP: a new technique for DNA fingerprinting.

Evolution in Mendelian Populations.

Inference of Population Structure Using Multilocus Genotype Data: Linked Loci and Correlated Allele Frequencies

Related Papers (5)

Inference of population structure using multilocus genotype data

STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method

genalex 6: genetic analysis in Excel. Population genetic software for teaching and research

Estimating F-statistics for the analysis of population structure.

Micro-Checker: Software for identifying and correcting genotyping errors in microsatellite data

Frequently Asked Questions (13)

Q1. What are the contributions in this paper?

Q2. What is the predictor of the real number of clusters?

Q3. What other markers are commonly used in populations studies?

Q4. What is the expected value of FST between archipelagos?

Q5. What is the goal of this study?

Q6. Why did the authors divide m(|L(K)|) by s?

Q7. What is the name of the alternative model-based method?

Q8. Why did the authors restrict their simulations to cases of moderate to strong structure?

Q9. What is the common criterion used to identify true number of populations?

Q10. How many clusters of individuals can be detected with the AFLP?

Q11. What was the effect of subsampling of individuals or loci?

Q12. What is the expected value of FST for this model?

Q13. How many individuals did Rosenberg et al. (2002) find in the data set?