scispace - formally typeset
Open AccessJournal ArticleDOI

A fast likelihood solution to the genetic clustering problem

Reads0
Chats0
TLDR
Snapclust as discussed by the authors is a fast maximum-likelihood solution to the genetic clustering problem, which allies the advantages of both model-based and geometric approaches, using the Expectation-Maximisation (EM) algorithm.
Abstract
The investigation of genetic clusters in natural populations is an ubiquitous problem in a range of fields relying on the analysis of genetic data, such as molecular ecology, conservation biology and microbiology. Typically, genetic clusters are defined as distinct panmictic populations, or parental groups in the context of hybridisation. Two types of methods have been developed for identifying such clusters: model-based methods, which are usually computer-intensive but yield results which can be interpreted in the light of an explicit population genetic model, and geometric approaches, which are less interpretable but remarkably faster.Here, we introduce snapclust, a fast maximum-likelihood solution to the genetic clustering problem, which allies the advantages of both model-based and geometric approaches. Our method relies on maximising the likelihood of a fixed number of panmictic populations, using a combination of geometric approach and fast likelihood optimisation, using the Expectation-Maximisation (EM) algorithm. It can be used for assigning genotypes to populations and optionally identify various types of hybrids between two parental populations. Several goodness-of-fit statistics can also be used to guide the choice of the retained number of clusters.Using extensive simulations, we show that snapclust performs comparably to current gold standards for genetic clustering as well as hybrid detection, with some advantages for identifying hybrids after several backcrosses, while being orders of magnitude faster than other model-based methods. We also illustrate how snapclust can be used for identifying the optimal number of clusters, and subsequently assign individuals to various hybrid classes simulated from an empirical microsatellite dataset. snapclust is implemented in the package adegenet for the free software R, and is therefore easily integrated into existing pipelines for genetic data analysis. It can be applied to any kind of co-dominant markers, and can easily be extended to more complex models including, for instance, varying ploidy levels. Given its flexibility and computer-efficiency, it provides a useful complement to the existing toolbox for the study of genetic diversity in natural populations.

read more

Citations
More filters
Journal ArticleDOI

RhierBAPS: An R implementation of the population clustering algorithm hierBAPS

TL;DR: The aim is that this package aids in the understanding and dissemination of the hierBAPS method, as well as enhancing the reproducibility of population structure analyses.
Journal ArticleDOI

Fast hierarchical Bayesian analysis of population structure.

TL;DR: This work rapidly identifies an approximate fit to a Dirichlet process mixture model (DPM) for clustering multilocus genotype data and provides a method for rapidly partitioning an existing hierarchy in order to maximize the DPM model marginal likelihood.
Posted ContentDOI

Fast Hierarchical Bayesian Analysis of Population Structure

TL;DR: Fastbaps rapidly identifies an approximate fit to a Dirichlet Process Mixture model for clustering multilocus genotype data, and provides a method for rapidly partitioning an existing hierarchy in order to maximise the DPM model marginal likelihood.
Journal ArticleDOI

Genomic biosurveillance of forest invasive alien enemies: A story written in code.

TL;DR: Current and future application of genomic tools and pipelines that will provide accurate identification of pests and pathogens, assign outbreak or survey samples to putative sources to identify pathways of spread, and assess risk based on traits that impact the outbreak outcome are described.
Journal ArticleDOI

A parsimony estimator of the number of populations from a STRUCTURE-like analysis.

TL;DR: This study proposed a new ad hoc estimator of K, calculable from the output of a population clustering program such as STRUCTURE or ADMIXTURE, called parsimony index (PI), which was shown to be more accurate than the other methods consistently in various population structure and sampling scenarios.
References
More filters
Journal Article

R: A language and environment for statistical computing.

R Core Team
- 01 Jan 2014 - 
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Journal ArticleDOI

Estimating the Dimension of a Model

TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.

Estimating the dimension of a model

TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Journal ArticleDOI

Inference of population structure using multilocus genotype data

TL;DR: Pritch et al. as discussed by the authors proposed a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations, which can be applied to most of the commonly used genetic markers, provided that they are not closely linked.
Related Papers (5)