scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Estimating F-statistics for the analysis of population structure.

01 Nov 1984-Evolution (Wiley)-Vol. 38, Iss: 6, pp 1358-1370
TL;DR: The purpose of this discussion is to offer some unity to various estimation formulae and to point out that correlations of genes in structured populations, with which F-statistics are concerned, are expressed very conveniently with a set of parameters treated by Cockerham (1 969, 1973).
Abstract: This journal frequently contains papers that report values of F-statistics estimated from genetic data collected from several populations. These parameters, FST, FIT, and FIS, were introduced by Wright (1951), and offer a convenient means of summarizing population structure. While there is some disagreement about the interpretation of the quantities, there is considerably more disagreement on the method of evaluating them. Different authors make different assumptions about sample sizes or numbers of populations and handle the difficulties of multiple alleles and unequal sample sizes in different ways. Wright himself, for example, did not consider the effects of finite sample size. The purpose of this discussion is to offer some unity to various estimation formulae and to point out that correlations of genes in structured populations, with which F-statistics are concerned, are expressed very conveniently with a set of parameters treated by Cockerham (1 969, 1973). We start with the parameters and construct appropriate estimators for them, rather than beginning the discussion with various data functions. The extension of Cockerham's work to multiple alleles and loci will be made explicit, and the use of jackknife procedures for estimating variances will be advocated. All of this may be regarded as an extension of a recent treatment of estimating the coancestry coefficient to serve as a mea-
Citations
More filters
Journal ArticleDOI
TL;DR: Arlequin ver 3.0 as discussed by the authors is a software package integrating several basic and advanced methods for population genetics data analysis, like the computation of standard genetic diversity indices, the estimation of allele and haplotype frequencies, tests of departure from linkage equilibrium, departure from selective neutrality and demographic equilibrium, estimation or parameters from past population expansions, and thorough analyses of population subdivision under the AMOVA framework.
Abstract: Arlequin ver 3.0 is a software package integrating several basic and advanced methods for population genetics data analysis, like the computation of standard genetic diversity indices, the estimation of allele and haplotype frequencies, tests of departure from linkage equilibrium, departure from selective neutrality and demographic equilibrium, estimation or parameters from past population expansions, and thorough analyses of population subdivision under the AMOVA framework. Arlequin 3 introduces a completely new graphical interface written in C++, a more robust semantic analysis of input files, and two new methods: a Bayesian estimation of gametic phase from multi-locus genotypes, and an estimation of the parameters of an instantaneous spatial expansion from DNA sequence polymorphism. Arlequin can handle several data types like DNA sequences, microsatellite data, or standard multi-locus genotypes. A Windows version of the software is freely available on http://cmpg.unibe.ch/software/arlequin3.

14,271 citations

Journal ArticleDOI
01 Jun 1992-Genetics
TL;DR: In this article, a framework for the study of molecular variation within a single species is presented, where information on DNA haplotype divergence is incorporated into an analysis of variance format, derived from a matrix of squared-distances among all pairs of haplotypes.
Abstract: We present here a framework for the study of molecular variation within a single species. Information on DNA haplotype divergence is incorporated into an analysis of variance format, derived from a matrix of squared-distances among all pairs of haplotypes. This analysis of molecular variance (AMOVA) produces estimates of variance components and F-statistic analogs, designated here as phi-statistics, reflecting the correlation of haplotypic diversity at different levels of hierarchical subdivision. The method is flexible enough to accommodate several alternative input matrices, corresponding to different types of molecular data, as well as different types of evolutionary assumptions, without modifying the basic structure of the analysis. The significance of the variance components and phi-statistics is tested using a permutational approach, eliminating the normality assumption that is conventional for analysis of variance but inappropriate for molecular data. Application of AMOVA to human mitochondrial DNA haplotype data shows that population subdivisions are better resolved when some measure of molecular differences among haplotypes is introduced into the analysis. At the intraspecific level, however, the additional information provided by knowing the exact phylogenetic relations among haplotypes or by a nonlinear translation of restriction-site change into nucleotide diversity does not significantly modify the inferred population genetic structure. Monte Carlo studies show that site sampling does not fundamentally affect the significance of the molecular variance components. The AMOVA treatment is easily extended in several different directions and it constitutes a coherent and flexible framework for the statistical analysis of molecular data.

12,835 citations

Journal ArticleDOI
TL;DR: This note summarizes developments of the genepop software since its first description in 1995, and in particular those new to version 4.0: an extended input format, several estimators of neighbourhood size under isolation by distance, new estimators and confidence intervals for null allele frequency, and less important extensions to previous options.
Abstract: This note summarizes developments of the genepop software since its first description in 1995, and in particular those new to version 4.0: an extended input format, several estimators of neighbourhood size under isolation by distance, new estimators and confidence intervals for null allele frequency, and less important extensions to previous options. genepop now runs under Linux as well as under Windows, and can be entirely controlled by batch calls.

8,171 citations


Cites background or methods from "Estimating F-statistics for the ana..."

  • ...As further detailed in the genepop documentation, while the single locus estimators are identical, these multilocus estimators differ from the ones described in Weir & Cockerham (1984) and Weir (1996)....

    [...]

  • ...…of Weir (1996) give the same weight to estimates of the Q’s for a locus typed at five individuals in each subpopulation as for a locus typed at 50 individuals in each subpopulation, while the estimators or Weir & Cockerham (1984) give less weight to the Q estimates from loci with larger samples....

    [...]

Journal ArticleDOI
18 Oct 2007-Nature
TL;DR: The Phase II HapMap is described, which characterizes over 3.1 million human single nucleotide polymorphisms genotyped in 270 individuals from four geographically diverse populations and includes 25–35% of common SNP variation in the populations surveyed, and increased differentiation at non-synonymous, compared to synonymous, SNPs is demonstrated.
Abstract: We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.

4,565 citations

References
More filters
Journal ArticleDOI
TL;DR: The purpose of the present paper is to relate the two in a way which the author thinks is meaningful and easy to grasp and to incorporate the role that the inbreeding and coancestry of individuals play in this variance.
Abstract: Inbreeding, gene frequency variance, and their corresponding effective population numbers are now commonplace terms in population genetics. The concepts and much of the theory are classical (Wright, 1921, 1931; Fisher, 1930). More recent refinements and extensions of the theory by Crow and associates (Crow, 1954; Crow and Morton, 1955; Kimura and Crow, 1963a, b) have been primarily concerned with distinguishing between the inbreeding effect on heterozygosity and on the variance of gene frequencies which are so intimately connected in finite populations. The purpose of the present paper is to relate the two in a way which the author thinks is meaningful and easy to grasp. Further, correlational measures are made compatible with probability measures of identity by descent and a simple basis is provided for the analysis of the variance of gene frequencies in experimental or natural populations. The procedure is to work with the variance of a linear function and to incorporate the role that the inbreeding and coancestry of individuals play in this variance. First, let us develop this role. We let aij index the jth allele in the ith individual and introduce a measure of frequency xij defined by

564 citations


"Estimating F-statistics for the ana..." refers background or methods in this paper

  • ...Cockerham (1969, 1973) showed that, for all intents and purposes, these parameters are related to Wright's F-statistics as 1358 ESTIMATING F-STATISTICS 1359 Ea = p(1 - p)(J,...

    [...]

  • ...The purpose of this discussion is to offer some unity to various estimation formulae and to point out that correlations of genes in structured populations, with which F-statistics are concerned, are expressed very conveniently with a set of parameters treated by Cockerham (1969, 1973)....

    [...]

  • ...P) n; n - 1 _ r ~ 1S2 _ ~ h]} (2) Following the approach of Cockerham (1969, 1973), we perform analyses of variance of the frequencies for the allele A under consideration....

    [...]

Journal ArticleDOI
01 Aug 1973-Genetics
TL;DR: Estimable functions are elaborated demonstrating that intraclass correlations can be estimated only relative to that for the least related genes in the informational system, and small sample estimators are formulated for all of the parameters by three different methods.
Abstract: Models of variance components and their intraclass correlational equivalences are developed for genes falling into various categories of subdivisions within a population. Estimable functions are elaborated demonstrating that intraclass correlations can be estimated only relative to that for the least related genes in the informational system. The effects of different types of subdivisions—and of ignoring them—on the parameters are demonstrated. Small sample estimators are formulated for all of the parameters by three different methods, including both a weighted and an unweighted method of analysis of the variation among subpopulations. How estimators change with assumptions about the parameters is illustrated. Various tests of hypotheses are outlined in χ2 and F-test terminology. Discussed are factors which may affect the correlations and the manner in which their effects are manifest, hopefully in clarification of some of the misconceptions that have arisen in this connection.

358 citations


"Estimating F-statistics for the ana..." refers background or methods in this paper

  • ...While there are other statistics with the same expectations as a, b, c, these three quantities were obtained from a weighted analysis of variance (Cockerham, 1973)....

    [...]

  • ...It is often the case that we recognize a demic structure within populations, so that we can replace the parameter 8 by 81 for pairs of alleles between Individuals within demes, and 82 for pairs of alleles between demes within populations (Cockerham, 1973)....

    [...]

  • ...Cockerham (1969, 1973) showed that, for all intents and purposes, these parameters are related to Wright's F-statistics as 1358 ESTIMATING F-STATISTICS 1359 Ea = p(1 - p)(J,...

    [...]

  • ...P) n; n - 1 _ r ~ 1S2 _ ~ h]} (2) Following the approach of Cockerham (1969, 1973), we perform analyses of variance of the frequencies for the allele A under consideration....

    [...]

  • ...The purpose of this discussion is to offer some unity to various estimation formulae and to point out that correlations of genes in structured populations, with which F-statistics are concerned, are expressed very conveniently with a set of parameters treated by Cockerham (1969, 1973)....

    [...]

01 Jan 1984
TL;DR: An analysis is made of the distribution of deviations from Hardy-Weinberg proportions with k alleles and of estimates of inbreeding coefficients obtained from these deviations, finding that if f is small, the best estimate of f in large samples is shown to be 2 sigma i(Tii/Ni)/(k - 1), which is probably close to the best for all f values.
Abstract: An analysis is made of the distribution of deviations from Hardy-Weinberg proportions with k alleles and of estimates of inbreeding coefficients (f) obtained from these deviations.-If f is small, the best estimate off in large samples is shown to be 2x,(T./N,)/(k - I), where T. is an unbiased measure of the excess of the ith homozygote and N, the number of the ith allele in the sample [frequency = N,/(2N)]. No extra information is obtained from the T,,, where these are departures of numbers of heterozygotes from expectation. Alternatively, the best estimator can be computed from the T,,, ignoring the Tu. Also (1) the variance of the estimate off equals l/(N(k - 1)) when all individuals in the sample are unrelated, and the test for f = 0 with 1 d.f. is given by the ratio of the estimate to its standard error; (2) the variance is reduced if some alleles are rare; and (3) if the sample consists of full-sib families of size n, the variance is increased by a proportion (n - 1)/4 but is not increased by a half-sib relationship.-If f is not small, the structure of the population is of critical importance. (1) If the inbreeding is due to a proportion of inbred matings in an otherwise random-breeding population, f as determined from homozygote excess is the same for all genes and expressions are given for its sampling variance. (2) If the homozygote excess is due to population admixture, f is not the same for all genes. The above estimator is probably close to the best for all f values.

239 citations

Journal ArticleDOI
01 Aug 1984-Genetics
TL;DR: In this article, the distribution of deviations from Hardy-Weinberg proportions with k alleles and estimates of inbreeding coefficients (f) obtained from these deviations were analyzed and the best estimate of f in large samples was shown to be 2 sigma i(Tii/Ni)/(k - 1), where Tii is an unbiased measure of the excess of the homozygote and Ni the number of the ith allele in the sample [frequency = Ni/(2N)].
Abstract: An analysis is made of the distribution of deviations from Hardy-Weinberg proportions with k alleles and of estimates of inbreeding coefficients (f) obtained from these deviations. If f is small, the best estimate of f in large samples is shown to be 2 sigma i(Tii/Ni)/(k - 1), where Tii is an unbiased measure of the excess of the ith homozygote and Ni the number of the ith allele in the sample [frequency = Ni/(2N)]. No extra information is obtained from the Tij, where these are departures of numbers of heterozygotes from expectation. Alternatively, the best estimator can be computed from the Tij, ignoring the Tii. Also (1) the variance of the estimate of f equals 1/(N(k - 1] when all individuals in the sample are unrelated, and the test for f = 0 with 1 d.f. is given by the ratio of the estimate to its standard error; (2) the variance is reduced if some alleles are rare; and (3) if the sample consists of full-sib families of size n, the variance is increased by a proportion (n - 1)/4 but is not increased by a half-sib relationship. If f is not small, the structure of the population is of critical importance. (1) If the inbreeding is due to a proportion of inbred matings in an otherwise random-breeding population, f as determined from homozygote excess is the same for all genes and expressions are given for its sampling variance. (2) If the homozygote excess is due to population admixture, f is not the same for all genes. The above estimator is probably close to the best for all f values.

232 citations

Journal ArticleDOI
TL;DR: The objectives of the present study were to determine the levels of genic diversity characteristic of pitch pine, and to examine the organization ofgenic variability within the species and the patterning of geni differentiation between populations.
Abstract: Electrophoretic studies of protein polymorphisms in plants have focused upon herbaceous species, primarily inbreeding annuals, in efforts to characterize the levels and patterns of genic variation within and between populations (Clegg and Allard, 1972; Gottlieb, 1973, 1975; Levin, 1975, 1978; Levy and Levin, 1975; Schaal, 1975; Roose and Gottlieb, 1976; Brown et al., 1978; and others). These studies have indicated that predominantly outbreeding species maintain higher levels of intrapopulation variation than predominantly inbreeding species, while inbreeders exhibit a greater degree of population differentiation than outbreeders (Brown, 1979; Hamrick et al., 1979). This relationship is by no means perfect as Levin (1978) points out, because of differences in ecological requirements, breeding systems, dispersal mechanisms, evolutionary history, and other factors which affect the genetic system (Grant, 1958, 1971; Brown, 1979; Hamrick et al., 1979). Whether longlived perennials such as forest trees conform to the general pattern is still an open question. Allozyme studies of forest tree species have suggested that levels of genic variation are exceptionally high in natural populations (Tigerstedt, 1973; Rudin et al., 1974; Lundkvist and Rudin, 1977; Yang et al., 1977; Hamrick, 1979; Hamrick et al., 1979; Lundkvist, 1979), that certain populations appear to be moderately inbred (Rudin et al., 1974; Mejnartowicz and Bergmann, 1975; Phillips and Brown, 1977), and that populations have become differentiated over relatively short distances (Sakai and Park, 1971; Mitton et al., 1977). However, many inferences have been drawn from only one or a few loci, or only from loci known to be highly polymorphic. Valid estimates of mating system parameters may be obtained by examining only a few loci, but for estimates of heterozygosity, genic diversity, and the extent of differentiation, a large number of loci is preferred (Lewontin, 1974; Nei, 1975). As part of a continuing study of the genetics and ecology of pitch pine (Pinus rigida Mill.), we have surveyed 21 enzymatic loci in 11 populations across the species range. Pitch pine occurs from coastal Maine and southern Quebec to northern Georgia and from the Atlantic Coast to central Ohio, but almost always on relatively infertile sites. In spite of a history of overexploitation, it appears to have retained appreciable variation (Ledig and Fryer, 1974). Pitch pine demonstrates clinal patterns of variation in cone serotiny (Ledig and Fryer, 1972), wood properties (Ledig et al., 1975), and seedling growth (Ledig et al., 1976), as well as seed and needle characters (Ledig, unpubl.). It is uncertain whether these clinal patterns are the result of gene flow among pockets of differential fitness or reflect a continuous gradient in selection pressures (Endler, 1977; Givnish, 1981). The objectives of the present study were to determine the levels of genic diversity characteristic of pitch pine, and to examine the organization of genic variability within the species and the patterning of genic differentiation between populations. Among other comparisons, we contrasted marginal vs. central populations and the

187 citations


"Estimating F-statistics for the ana..." refers methods in this paper

  • ...Papers using the weighted average offixation indices are by Avise and Felley (1979), Baker et al. (1982), Guries and Ledig (1982), and Ryman et al. (1980)....

    [...]