A fast likelihood solution to the genetic clustering problem

doi:10.1111/2041-210X.12968

Open AccessJournal ArticleDOI

A fast likelihood solution to the genetic clustering problem

Marie-Pauline Beugin, +4 more

- 01 Apr 2018 -

Methods in Ecology and Evolution

- Vol. 9, Iss: 4, pp 1006-1016

Chats0

TLDR

Snapclust as discussed by the authors is a fast maximum-likelihood solution to the genetic clustering problem, which allies the advantages of both model-based and geometric approaches, using the Expectation-Maximisation (EM) algorithm.

Abstract:

The investigation of genetic clusters in natural populations is an ubiquitous problem in a range of fields relying on the analysis of genetic data, such as molecular ecology, conservation biology and microbiology. Typically, genetic clusters are defined as distinct panmictic populations, or parental groups in the context of hybridisation. Two types of methods have been developed for identifying such clusters: model-based methods, which are usually computer-intensive but yield results which can be interpreted in the light of an explicit population genetic model, and geometric approaches, which are less interpretable but remarkably faster.Here, we introduce snapclust, a fast maximum-likelihood solution to the genetic clustering problem, which allies the advantages of both model-based and geometric approaches. Our method relies on maximising the likelihood of a fixed number of panmictic populations, using a combination of geometric approach and fast likelihood optimisation, using the Expectation-Maximisation (EM) algorithm. It can be used for assigning genotypes to populations and optionally identify various types of hybrids between two parental populations. Several goodness-of-fit statistics can also be used to guide the choice of the retained number of clusters.Using extensive simulations, we show that snapclust performs comparably to current gold standards for genetic clustering as well as hybrid detection, with some advantages for identifying hybrids after several backcrosses, while being orders of magnitude faster than other model-based methods. We also illustrate how snapclust can be used for identifying the optimal number of clusters, and subsequently assign individuals to various hybrid classes simulated from an empirical microsatellite dataset. snapclust is implemented in the package adegenet for the free software R, and is therefore easily integrated into existing pipelines for genetic data analysis. It can be applied to any kind of co-dominant markers, and can easily be extended to more complex models including, for instance, varying ploidy levels. Given its flexibility and computer-efficiency, it provides a useful complement to the existing toolbox for the study of genetic diversity in natural populations.

A fast likelihood solution to the genetic clustering problem

Citations

RhierBAPS: An R implementation of the population clustering algorithm hierBAPS

Fast hierarchical Bayesian analysis of population structure.

Fast Hierarchical Bayesian Analysis of Population Structure

Genomic biosurveillance of forest invasive alien enemies: A story written in code.

A parsimony estimator of the number of populations from a STRUCTURE-like analysis.

References

R: A language and environment for statistical computing.

Maximum likelihood from incomplete data via the EM algorithm

Estimating the Dimension of a Model

Estimating the dimension of a model

Inference of population structure using multilocus genotype data

Related Papers (5)

Inference of population structure using multilocus genotype data

adegenet: a R package for the multivariate analysis of genetic markers

Inference of Population Structure Using Multilocus Genotype Data: Linked Loci and Correlated Allele Frequencies

Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.

Estimating F-statistics for the analysis of population structure.