scispace - formally typeset
Open AccessJournal ArticleDOI

Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.

Guillaume Evanno, +2 more
- 01 Jul 2005 - 
- Vol. 14, Iss: 8, pp 2611-2620
TLDR
It is found that in most cases the estimated ‘log probability of data’ does not provide a correct estimation of the number of clusters, K, and using an ad hoc statistic ΔK based on the rate of change in the log probability between successive K values, structure accurately detects the uppermost hierarchical level of structure for the scenarios the authors tested.
Abstract
The identification of genetically homogeneous groups of individuals is a long standing issue in population genetics. A recent Bayesian algorithm implemented in the software STRUCTURE allows the identification of such groups. However, the ability of this algorithm to detect the true number of clusters (K) in a sample of individuals when patterns of dispersal among populations are not homogeneous has not been tested. The goal of this study is to carry out such tests, using various dispersal scenarios from data generated with an individual-based model. We found that in most cases the estimated 'log probability of data' does not provide a correct estimation of the number of clusters, K. However, using an ad hoc statistic DeltaK based on the rate of change in the log probability of data between successive K values, we found that STRUCTURE accurately detects the uppermost hierarchical level of structure for the scenarios we tested. As might be expected, the results are sensitive to the type of genetic marker used (AFLP vs. microsatellite), the number of loci scored, the number of populations sampled, and the number of individuals typed in each sample.

read more

Content maybe subject to copyright    Report

Molecular Ecology (2005)
14
, 2611–2620 doi: 10.1111/j.1365-294X.2005.02553.x
© 2005 Blackwell Publishing Ltd
Blackwell Publishing, Ltd.
Detecting the number of clusters of individuals using
the software
STRUCTURE
: a simulation study
G. EVANNO, S. REGNAUT and J. GOUDET
Department of Ecology and Evolution, Biology building, University of Lausanne, CH 1015 Lausanne, Switzerland
Abstract
The identification of genetically homogeneous groups of individuals is a long standing
issue in population genetics. A recent Bayesian algorithm implemented in the software
STRUCTURE
allows the identification of such groups. However, the ability of this algorithm to
detect the true number of clusters (
K
) in a sample of individuals when patterns of dispersal
among populations are not homogeneous has not been tested. The goal of this study is to
carry out such tests, using various dispersal scenarios from data generated with an individual-
based model. We found that in most cases the estimated ‘log probability of data’ does not
provide a correct estimation of the number of clusters,
K
. However, using an ad hoc statistic
K
based on the rate of change in the log probability of data between successive
K
values,
we found that
STRUCTURE
accurately detects the uppermost hierarchical level of structure for
the scenarios we tested. As might be expected, the results are sensitive to the type of genetic
marker used (AFLP vs. microsatellite), the number of loci scored, the number of popula-
tions sampled, and the number of individuals typed in each sample.
Keywords
: AFLP, hierarchical structure, microsatellite, simulations,
structure
software
Received 5 October 2004; revision accepted 17 February 2005
Introduction
Population genetics deals with the variations of allele
frequencies between and within populations. The most
widely used measures of population structure are Wright’s
F
statistics (Wright 1931). To calculate these indices, one
needs first to define groups of individuals and then to use
their genotypes to compute variance in allele frequencies.
Thus, a fundamental prerequisite of any inference on the
genetic structure of populations is the definition of popu-
lations themselves. Population determination is usually
based upon geographical origin of samples or phenotypes.
However, the genetic structure of populations is not always
reflected in the geographical proximity of individuals. Popu-
lations that are not discretely distributed can nevertheless
be genetically structured, due to unidentified barriers to
gene flow. In addition, groups of individuals with different
geographical locations, behavioural patterns or phenotypes
are not necessarily genetically differentiated (for instance,
migratory bats from the same breeding roost could be
sampled thousands of kilometres apart in winter, see, e.g.
Petit
et al
. 2001).
Among the methods not assuming predefined structure,
tree-based methods use genetic distance between indi-
viduals and tree construction algorithms such as
upgma
or
neighbour joining to group them in clusters (e.g. Saitou &
Nei 1987). Similarly, multivariate analyses such as multi-
dimensional scaling can help in identifying clusters of
individuals. However, these graphical methods are only
loosely connected to statistical procedures allowing the
identification of homogeneous clusters of individuals.
An alternative model-based method developed recently
by Pritchard
et al
. (2000) and implemented in the software
structure
aims at delineating clusters of individuals on
the basis of their genotypes at multiple loci using a Bayesian
approach. The model accounts for the presence of Hardy–
Weinberg or linkage disequilibrium by introducing popu-
lation structure and attempts to find population groupings
that (as far as possible) are not in disequilibrium (Pritchard
et al
. 2000). The estimated log probability of data Pr(
X
|
K
)
(equation 12 in Pritchard
et al
. 2000) for each value of
K
is
given, allowing the estimation of the more likely number
of clusters. A quantification of how likely each individual
Correspondence: Jérôme Goudet, Fax: + 41 21 692 42 65;
E-mail: Jerome.goudet@unil.ch

2612
G. EVANNO, S. REGNAUT and J. GOUDET
© 2005 Blackwell Publishing Ltd,
Molecular Ecology
, 14, 2611–2620
is to belong to each group is also given, information that
can be then used to assign individuals to populations.
While the authors warn that Pr(
X
|
K
) is really only an indi-
cation of the number of clusters and an ad hoc guide (p. 949
in Pritchard
et al
. 2000; p. 3 in Pritchard & Wen 2003), the
program has been widely used to this end. More generally,
it has been used for detection of genetic structure in sample
populations for medical purposes (Pritchard & Donnelly
2001; Satten
et al
. 2001), assignment studies (Rosenberg
et al
. 2001), population admixture and hybridization ana-
lysis (Beaumont
et al
. 2001; Goossens
et al
. 2002; Randi &
Lucchini 2002), migration and dispersal analysis (Arnaud
et al
. 2003; Cegelski
et al
. 2003; Berry
et al
. 2004) and also to
detect, with or without success, cryptic genetic structure
of natural populations (Rosenberg
et al
. 2002; Caizergues
et al
. 2003). Among the Bayesian clustering methods,
structure
is the most widely used. While other methods
have been developed (Banks & Eichert 2000; Dawson &
Belkhir 2001; Corander
et al
. 2003) and still other methods
for the assignment of individuals to populations exist (but
imply the a priori knowledge of source populations: Paetkau
et al
. 1995; Rannala & Mountain 1997; Cornuet
et al
. 1999),
we will focus here exclusively on the software
structure
.
Tests and comparative studies using empirical data sets
have been performed to assess
structure
’s ability in assign-
ing individuals to their known cluster of origin (Pritchard
& Donnelly 2001; Rosenberg
et al
. 2001; Manel
et al
. 2002;
Turakulov & Easteal 2003). Most of these studies have
proven the software to be efficient in assigning individuals
to their populations of origin (albeit most are based on simu-
lations with limited number of populations and absence of
dispersal between them). However, little is known on the
crucial ability of
structure
to detect the real number
of clusters (
K
) which composes a data set. Pritchard
et al
.
(2000) showed that
structure
easily detects two to four
highly differentiated populations but studies in molecular
ecology usually include many more populations and very
often these populations are not evenly distributed in space.
Many studies have described migration patterns departing
from Wright’s island model and including several hier-
archical levels and/or isolation by distance. For instance,
Chapuisat
et al
. (1997), Giles
et al
. (1998), Bouzat & Johnson
(2004) or Trouvé
et al
. (2005) have documented situations
with a hierarchical pattern of population structure, as groups
are themselves clusters of differentiated populations. Another
pattern frequently described is a contact zone between
otherwise isolated populations. This situation implies a
relative genetic isolation between the two groups of popu-
lations and sometimes also a pattern of isolation by distance
within each group. Such a migration scheme was found for
instance by Lugon-Moulin
et al
. (1999) who describe two
longitudinal geographical patterns of isolated shrew
populations separated by a zone through which dispersal
is strongly reduced.
Many of these studies have been conducted using
microsatellite markers to assess polymorphism. These DNA
markers are widely used because they are both co-
dominant and highly polymorphic (Jarne & Lagoda 1996).
However, their development is relatively expensive, time
consuming and can be difficult. An alternative family
of markers also commonly used in populations studies are
the amplified fragment length polymorphism (AFLPs)
(Vos
et al
. 1995). AFLPs generate hundreds of polymorphic
bands and are easier to develop than microsatellites, but
they have the potential inconvenience of being dominant
(a DNA band is either present or absent). These two types
of markers have different properties. For instance, Gaudeul
et al
. (2004) reported very different levels of population
structuring inferred from AFLPs and microsatellite markers.
Both AFLP and microsatellites can be used for assignment
studies but their respective ability to delineate clusters of
individuals has not been compared so far.
The goal of this study is to test the ability of the algorithm
underlying the software
structure
to detect the number
of clusters in situations including more than two populations.
While the program is increasingly used, it is unknown
whether it can efficiently detect the real number of clusters
in hierarchical systems where migration between popula-
tions is uneven. We present an evaluation of the perform-
ances of the method under three models of population
structure: the island model, a contact zone, and a hierarchical
island model. For each model, we simulated AFLP and
microsatellite genotypic data sets that were subsequently
run in
structure
, and then we analysed the output. We find
that
K
, an ad hoc quantity related to the second order rate
of change of the log probability of data with respect to the
number of clusters, is a good predictor of the real number
of clusters.
structure
identifies groups of individuals
corresponding to the uppermost hierarchical level, and
performs well with both dominant and codominant markers.
Materials and methods
Simulation of the three migration models
We used the software
easypop
(Balloux 2001) to generate
genotypic data from three different models of population
structure: an island model, a hierarchical island model and
a contact-zone model (Fig. 1). For all simulations and model
of population structure, mutation process followed the
K
allele model (equal probability of mutations to any allelic
state) at a rate of
µ
= 10
3
. The modelled organisms are
diploid, hermaphroditic and randomly mating (excluding
selfing). Each simulation was run for 10 000 generations
to obtain populations at drift, migration and mutation
equilibrium. For each model, we generated 10 replicates
where each individual genotype was made of 100 micro-
satellite loci, each with 10 possible allelic states.

DETECTING CLUSTER NUMBER
2613
© 2005 Blackwell Publishing Ltd,
Molecular Ecology
, 14, 2611–2620
The parameters that were varied for the simulations are
the number of populations, the number of individuals per
population, and the migration rates. These parameters are
summarized in Table 1. For the finite island model, five
populations of 100 individuals each are exchanging migrants
at a rate 0.01. The expected value of
F
ST
for these simula-
tions is 0.15.
The hierarchical island model (Slatkin & Voelm 1991)
consists in five sets of four populations, each made of 50
individuals (Fig. 1). Migration occurs at a rate 0.02 within
archipelago and 0.001 between archipelagos (Table 1).
The expected value of
F
ST
is 0.30 between archipelagos
(
F
Archipelago-Total
), 0.16 between islands within archipelagos
(
F
Island-Archipelago
), and 0.41 overall (
F
Island-Total
).
The contact zone model is characterized by two sets of
five populations, which are organized in a one dimension
stepping-stone scheme (Kimura & Weiss 1964). Migration
between the two sets occurs through the two central popu-
lations at a rate 10 times lower than within each set
(Table 1). The expected value of
F
ST
for this model cannot
be easily analytically resolved, but global
F
ST
estimated
over the 10 replicates (10 times 100 microsatellite loci) is
0.33 and pairwise
F
ST
range from 0.16 to 0.43. The observed
value of
F
ST
is 0.17 between the two sets (
F
Set-Total
), 0.25
between populations within sets (
F
Population-Set
), and 0.38
overall (
F
Population-Total
).
easypop
generates codominant, microsatellite-like geno-
typic data. In order to simulate dominant AFLP data, the
genotypes generated by
easypop
were recoded as biallelic
loci, in a manner similar to Mariette
et al
. (2002): a ran-
domly chosen half of the microsatellite alleles were coded
as ‘1’ and considered dominant while the second half was
coded as ‘2’ and considered recessive. Because with dom-
inant data, one cannot distinguish between a dominant
homozygote and a heterozygote, dominant phenotypes
(obtained from genotypes 1–1 and 1–2/2–1) were recoded
as 1–0, where 0 indicates a missing datum. Thus, AFLP data
sets bear a proportion of missing data that microsatellite
sets do not. This coding of alleles is different from what is
recommended in the user’s manual of
structure
(Pritchard
& Wen 2003), which suggests that dominant markers can
be dealt with by coding each phenotype (absence or pres-
ence of a band) by a single allele and a missing datum (1–
0 for dominant and 2–0 for recessive). We did not use this
method because it implies adding a missing value also for
recessive homozygotes, which seems unnecessary.
Microsatellite data sets given to
structure
were made
of 10 loci as this is a number commonly found in molecular
ecology studies. AFLP data sets were made of 100 loci,
which seem conservative as AFLP-based studies often
include hundreds of markers (Luikart
et al
. 2003). A further
reason for this 1:10 ratio of microsatellite loci to AFLP bands
comes from a recent simulation-based study (Mariette
et al
. 2002) showing that at least 10 times more AFLP than
microsatellite loci are necessary to reach a similar accuracy
in the estimation of genetic diversity.
Sampling scheme
To assess the effects of sampling strategies on the method’s
accuracy, analyses were also carried out on partial data
sets. We investigated first the effect of the number of typed
loci by sampling only five microsatellites or 50 AFLP bands
(Table 2). We also looked at the effect of sampling a subset
of individuals from each population (Table 2). Last, for the
hierarchical island model, we also looked at the effect of
sampling a subset of the populations by randomly omitting
one island per archipelago (Table 2). We tested whether
partial sampling affected the detection of the true
K
by
comparing results between full and partial data sets.
Table 1 Parameters of the three migration models
Number of
populations
Number of individuals/
population
Migration rate
within set
Migration rate
between sets
Island model 5 100 10
2
Contact zone 10, 2 sets of 5 pop. 100 10
2
10
3
Hierarchical island model 20, 5 sets of 4 pop. 50 2 × 10
2
10
3
Fig. 1 Schematic representation of the three migration models:
(A) Island model. (B) Hierarchical island model. (C) Contact zone.
Open arrows represent the migration rates between sets of popu-
lations and solid arrows the migration rates within sets (see also
Table 1).

2614
G. EVANNO, S. REGNAUT and J. GOUDET
© 2005 Blackwell Publishing Ltd,
Molecular Ecology
, 14, 2611–2620
Structure runs
We set most of parameters to their default values as advised
in the user’s manual of
structure
2.0 (Pritchard & Wen
2003). Specifically, we chose the admixture model and the
option of correlated allele frequencies between populations,
as this configuration is considered best by Falush
et al
.
(2003) in cases of subtle population structure. Similarly, we
let the degree of admixture alpha be inferred from the data.
When alpha is close to zero, most individuals are essentially
from one population or another, while alpha > 1 means
that most individuals are admixed (Falush
et al
. 2003).
Lambda, the parameter of the distribution of allelic frequ-
encies, was set to one, as the manual advices. From a pilot
study, we found that a length of the burn-in and MCMC
(Markov chain Monte Carlo) of 10 000 each was sufficient.
Longer burn-in or MCMC did not change significantly the
results. As we found that different runs could produce
different likelihood values (even with much longer chains,
e.g. 1 000 000), for each data set 20 runs were carried out
in order to quantify the amount of variation of the likelihood
for each
K
. The range of possible
K
s we tested was from 1
or 2 to the true number of populations plus 3.
Statistics used to select
K
The model choice criterion implemented in
structure
to
detect the true
K
is an estimate of the posterior probability
of the data for a given
K
, Pr(
X
|
K
) (Pritchard
et al
. 2000).
This value, called ‘Ln P(D)’ in
structure
output, is obtained
by first computing the log likelihood of the data at each
step of the MCMC. Then the average of these values is
computed and half their variance is subtracted to the
mean. This gives ‘Ln P(D)’, the model choice criterion to
which we refer as
L
(
K
) afterwards. True number of popu-
lations (
K
) is often identified using the maximal value of
L
(
K
) returned by
structure
(Zeisset & Beebee 2001; Ciofi
et al
. 2002; Vernesi
et al
. 2003; Hampton
et al
. 2004). However,
we observed in our simulations that in most cases, once the
real
K is reached, L(K) at larger Ks plateaus or continues
increasing slightly (a phenomenon mentioned in the
structure’s manual, Pritchard & Wen 2003) and the
variance between runs increases (Fig. 2A).
The distribution of L(K) did not show a clear mode for
the true K, but we found that an ad hoc quantity based on
the second order rate of change of the likelihood function
with respect to K (K) did show a clear peak at the true
value of K. The rational for this K is to make salient the
break in slope of the distribution of L(K) at the true K. It is
best explained graphically, as is shown on Fig. 2. First, we
plotted the mean likelihood L(K) over 20 runs for each K
(Fig. 2A). Second, we plotted the mean difference between
successive likelihood values of K, L(K) = L(K) L(K 1)
(Fig. 2B). This difference corresponds to the rate of change
of the likelihood function with respect to K, and is noted
L(K). In a third step we plotted the (absolute value of the)
difference between successive values of L(K), |L′′(K)| =
|L(K + 1) L(K)| (Fig. 2C). This corresponds to the second
order rate of change of L(K) with respect to K. Finally, we
estimated K as the mean of the absolute values of L′′(K)
averaged over 20 runs divided by the standard deviation
of L(K), K = m(|L′′(K)|)/s[L(K)], which expands to K =
m(|L(K + 1) 2 L(K) + L(K 1)|)/s[L(K)] ( Fig. 2D). We divided
m(|L′′(K)|) by s[L(K)] because we found a clear and general
trend toward an increase of the variance of L(K) between
runs as K increased. We found the modal value of the
distribution of K to be located at the real K. We used the
height of this modal value as an indicator of the strength of
the signal detected by structure.
Results
Overall simulation scenarios, we seldom found a mode of
the likelihood distribution L(K) at the real K (Fig. 3). In
most cases, the likelihood increased until the real K was
reached, and then leveled off (often still increasing after the
Table 2 Sampling scheme used for each model. In each situation, all the combinations (full and partial) between the numbers of individuals
and loci were tested. For the hierarchical island model the number of populations was also subsampled: 15 out of 20 populations (three
populations per archipelago)
Number of
populations
Number of
individuals/
population Number of loci
full partial full partial full partial
AFLP microsat AFLP microsat
Island model 5 100 20 100 10 50 5
Contact zone 10 100 20 100 10 50 5
Hierarchical island model 20 15 50 20 100 10 50 5

DETECTING CLUSTER NUMBER 2615
© 2005 Blackwell Publishing Ltd, Molecular Ecology, 14, 2611–2620
real K, Fig. 3). On the other hand, the distribution of K
almost always showed a mode at the real K (Fig. 4).
For all three models, and both in full or partial con-
figurations, structure identified a number of groups corre-
sponding to the uppermost hierarchical level of genetic
partitioning between populations. structure primarily
highlights the between-sets of populations level for the
hierarchical island model and the contact zone, and the
between populations level for the island model. Importantly,
these results were obtained by using the modal value of
K rather than the maximum value of L(K) (Fig. 2A, D). In
Fig. 4, the magnitude of K is plotted for each model and
sampling scheme, which allows the comparison of results
obtained with different parameters sets. Overall, there was
some variance among likelihood values L(K) for the differ-
ent replicates of the same parameter set, but for 29 out of 32
models, all replicates had the same modal value for K.
Island model
For the full data set, as well as for the partial samplings, the
modal value of K was K = 5, the true number of popu-
lations (Fig. 4A, B). The only situations in which structure
failed to detect the real K were the partial samplings of
20 individuals and five microsatellite markers as well as
20 individuals and 50 AFLPs markers (Fig. 4A, B). For the
case with microsatellites which failed to work, we did not
see any plateau nor a clear maximum in the likelihood
distribution of K for any of the 10 replicates, and the
software found a maximal likelihood value at K = 5 in 2
replicates, at K = 2 twice, at K = 3 four times and at K = 4
twice. For the case where the true K was not detected by
AFLPs, although most replicates had a distribution of L(K)
with a break in slope at K = 5 followed by a plateau, this
pattern was not strong enough to be translated in a high K.
There is a stronger effect of the partial sampling of indi-
viduals and loci for microsatellites than AFLP markers
(Fig. 4A, B). For the complete data sets, microsatellites seem
to perform better than AFLPs markers (the peak is higher)
whereas for partial sampling, the results are similar for
both types of marker (Fig. 4A, B).
Hierarchical island model
For this model and under exhaustive sampling, the highest
likelihood was observed for K = 11 for AFLP (Fig. 3C) and
K = 12 for microsatellites (Fig. 3D) but the modal value of
K was at K = 5, which corresponds to the number of
archipelagos. Using K, we observed that structure always
found the modal value to be K = 5 when all populations
were sampled (Fig. 4C, D). When we omitted one island in
each of the archipelagoes there was only one case of partial
Fig. 2 Description of the four steps for the
graphical method allowing detection of the
true number of groups K*. (A) Mean L(K) (±
SD) over 20 runs for each K value. The
model considered here is a hierarchical
island model using all 100 individuals per
population and 50 AFLP loci. (B) Rate of
change of the likelihood distribution (mean
± SD) calculated as L(K) = L(K) – L(K – 1).
(C) Absolute values of the second order
rate of change of the likelihood distribution
(mean ± SD) calculated according to the
formula: |L′′(K)| = |L(K + 1) – L(K)|.
(D) K calculated as K = m|L′′(K)|/
s[L(K)]. The modal value of this dis-
tribution is the true K(*) or the uppermost
level of structure, here five clusters.

Figures
Citations
More filters
Journal ArticleDOI

STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method

TL;DR: STRUCTURE HARVESTER is presented, a web-based program for collating results generated by the program STRUCTURE, which provides a fast way to assess and visualize likelihood values across multiple values of K and hundreds of iterations for easier detection of the number of genetic groups that best fit the data.
Journal ArticleDOI

CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure

TL;DR: Three algorithms for aligning multiple replicate analyses of the same data set using the computer program CLUMPP (CLUster Matching and Permutation Program) are described.
Journal ArticleDOI

Discriminant analysis of principal components: a new method for the analysis of genetically structured populations

TL;DR: The Discriminant Analysis of Principal Components (DAPC) is introduced, a multivariate method designed to identify and describe clusters of genetically related individuals that performs generally better than STRUCTURE at characterizing population subdivision.
Journal ArticleDOI

Inferring weak population structure with the assistance of sample group information.

TL;DR: It is demonstrated that the new models developed for the structure program allow structure to be detected at lower levels of divergence, or with less data, than the original structure models or principal components methods, and that they are not biased towards detecting structure when it is not present.
Journal ArticleDOI

Clumpak: a program for identifying clustering modes and packaging population structure inferences across K

TL;DR: Clumpak, available at http://clumpak.tau.ac.il, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology by automating the postprocessing of results of model‐based population structure analyses.
References
More filters
Journal ArticleDOI

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

TL;DR: The neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods for reconstructing phylogenetic trees from evolutionary distance data.
Journal ArticleDOI

Inference of population structure using multilocus genotype data

TL;DR: Pritch et al. as discussed by the authors proposed a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations, which can be applied to most of the commonly used genetic markers, provided that they are not closely linked.
Journal ArticleDOI

AFLP: a new technique for DNA fingerprinting.

TL;DR: The AFLP technique provides a novel and very powerful DNA fingerprinting technique for DNAs of any origin or complexity that allows the specific co-amplification of high numbers of restriction fragments.
Journal ArticleDOI

Evolution in Mendelian Populations.

TL;DR: Page 108, last line of text, for "P/P″" read "P′/ P″."
Journal ArticleDOI

Inference of Population Structure Using Multilocus Genotype Data: Linked Loci and Correlated Allele Frequencies

TL;DR: Extensions to the method of Pritchard et al. for inferring population structure from multilocus genotype data are described and methods that allow for linkage between loci are developed, which allows identification of subtle population subdivisions that were not detectable using the existing method.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions in this paper?

The goal of this study is to carry out such tests, using various dispersal scenarios from data generated with an individualbased model. The authors found that in most cases the estimated ‘ log probability of data ’ does not provide a correct estimation of the number of clusters, K. 

The authors find that ∆ K, an ad hoc quantity related to the second order rate of change of the log probability of data with respect to the number of clusters, is a good predictor of the real number of clusters. 

An alternative family of markers also commonly used in populations studies are the amplified fragment length polymorphism (AFLPs) (Vos et al. 1995). 

The expected value of FST is 0.30 between archipelagos ( FArchipelago-Total ), 0.16 between islands within archipelagos ( FIsland-Archipelago), and 0.41 overall ( FIsland-Total ). 

The goal of this study is to test the ability of the algorithm underlying the software structureto detect the number of clusters in situations including more than two populations. 

The authors divided m(|L′′(K )|) by s[L(K )] because the authors found a clear and general trend toward an increase of the variance of L(K) between runs as K increased. 

An alternative model-based method developed recently by Pritchard et al . (2000) and implemented in the software structureaims at delineating clusters of individuals on the basis of their genotypes at multiple loci using a Bayesian approach. 

The authors restricted their simulations to cases of moderate to strong structure at different hierarchical levels because their goal was to test the ability of the algorithm to detect the number of groups of individuals in situations when different layers of population structure exist, as is often the case in real situations. 

True number of populations ( K ) is often identified using the maximal value of L ( K ) returned by structure (Zeisset & Beebee 2001; Ciofi et al . 

few nonhuman species could be genotyped with such intensity, but this study indicates that detection of the correct number of clusters can still be found when differentiation is weaker than in their main simulations, and this was confirmed by further limited simulations with FST among archipelagos as low as 3.8% (see above). 

Subsampling of individuals or loci reduced the height of the modal value of ∆K (Fig. 4G, H), and 10 AFLPs produced a weaker signal than one microsatellite because the average magnitude of the height of the modal value of ∆K was twice lower for the former. 

The expected value of FST for this model cannot be easily analytically resolved, but global FST estimated over the 10 replicates (10 times 100 microsatellite loci) is 0.33 and pairwise FST range from 0.16 to 0.43. 

Rosenberg et al. (2002) showed empirically on a very large microsatellite data set (377 loci) encompassing 1026 individuals from the five continents that humans cluster in five groups, loosely corresponding to the five continents.