scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Stacks: an analysis tool set for population genomics

01 Jun 2013-Molecular Ecology (Mol Ecol)-Vol. 22, Iss: 11, pp 3124-3140
TL;DR: The expanded population genomics functions in Stacks will make it a useful tool to harness the newest generation of massively parallel genotyping data for ecological and evolutionary genetics.
Abstract: Massively parallel short-read sequencing technologies, coupled with powerful software platforms, are enabling investigators to analyse tens of thousands of genetic markers. This wealth of data is rapidly expanding and allowing biological questions to be addressed with unprecedented scope and precision. The sizes of the data sets are now posing significant data processing and analysis challenges. Here we describe an extension of the Stacks software package to efficiently use genotype-by-sequencing data for studies of populations of organisms. Stacks now produces core population genomic summary statistics and SNP-by-SNP statistical tests. These statistics can be analysed across a reference genome using a smoothed sliding window. Stacks also now provides several output formats for several commonly used downstream analysis packages. The expanded population genomics functions in Stacks will make it a useful tool to harness the newest generation of massively parallel genotyping data for ecological and evolutionary genetics.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This Review provides a comprehensive discussion of RADseq methods to aid researchers in choosing among the many different approaches and avoiding erroneous scientific conclusions from RADseq data, a problem that has plagued other genetic marker types in the past.
Abstract: High-throughput techniques based on restriction site-associated DNA sequencing (RADseq) are enabling the low-cost discovery and genotyping of thousands of genetic markers for any species, including non-model organisms, which is revolutionizing ecological, evolutionary and conservation genetics. Technical differences among these methods lead to important considerations for all steps of genomics studies, from the specific scientific questions that can be addressed, and the costs of library preparation and sequencing, to the types of bias and error inherent in the resulting data. In this Review, we provide a comprehensive discussion of RADseq methods to aid researchers in choosing among the many different approaches and avoiding erroneous scientific conclusions from RADseq data, a problem that has plagued other genetic marker types in the past.

1,102 citations

Journal ArticleDOI
TL;DR: It is shown through reanalysis of an empirical RADseq dataset that indels are a common feature of such data, even at shallow phylogenetic scales, and PyRAD uses parallel processing as well as an optional hierarchical clustering method, which allows it to rapidly assemble phylogenetic datasets with hundreds of sampled individuals.
Abstract: Motivation: Restriction-site associated genomic markers are a powerful tool for investigating evolutionary questions at the population level, but are limited in their utility at deeper phylogenetic scales where fewer orthologous loci are typically recovered across disparate taxa. While this limitation stems in part from mutations to restriction recognition sites that disrupt data generation, an additional source of data loss comes from the failure to identify homology during bioinformatic analyses. Clustering methods that allow for lower similarity thresholds and the inclusion of indel variation will perform better at assembling RADseq loci at the phylogenetic scale. Results: PyRAD is a pipeline to assemble de novo RADseq loci with the aim of optimizing coverage across phylogenetic data sets. It utilizes a wrapper around an alignment-clustering algorithm which allows for indel variation within and between samples, as well as for incomplete overlap among reads (e.g., paired-end). Here I compare PyRAD with the program Stacks in their performance analyzing a simulated RADseq data set that includes indel variation. Indels disrupt clustering of homologous loci in Stacks but not in PyRAD, such that the latter recovers more shared loci across disparate taxa. I show through re-analysis of an empirical RADseq data set that indels are a common feature of such data, even at shallow phylogenetic scales. PyRAD utilizes parallel processing as well as an optional hierarchical clustering method which allow it to rapidly assemble phylogenetic data sets with hundreds of sampled individuals. Availability: Software is written in Python and freely available at http://www.dereneaton.com/software/ Supplement: Scripts to completely reproduce all simulated and empirical analyses are available in the Supplementary Materials.

710 citations


Cites methods from "Stacks: an analysis tool set for po..."

  • ...Availability: Software is written in Python and freely available at http://www.dereneaton.com/software/ Supplement: Scripts to completely reproduce all simulated and empirical analyses are available in the Supplementary Materials....

    [...]

Journal ArticleDOI
TL;DR: This Review demonstrates the breadth of questions that are being addressed by Pool-seq but also discusses its limitations and provides guidelines for users.
Abstract: The analysis of polymorphism data is becoming increasingly important as a complementary tool to classical genetic analyses. Nevertheless, despite plunging sequencing costs, genomic sequencing of individuals at the population scale is still restricted to a few model species. Whole-genome sequencing of pools of individuals (Pool-seq) provides a cost-effective alternative to sequencing individuals separately. With the availability of custom-tailored software tools, Pool-seq is being increasingly used for population genomic research on both model and non-model organisms. In this Review, we not only demonstrate the breadth of questions that are being addressed by Pool-seq but also discuss its limitations and provide guidelines for users.

642 citations

Journal ArticleDOI
TL;DR: This special issue on ‘Genotyping-by-Sequencing in Ecological and Conservation Genomics’ represents a diverse set of empirical and theoretical studies that demonstrate both the utility and some of the challenges of GBS in ecological and conservation genomics.
Abstract: The fields of ecological and conservation genetics have developed greatly in recent decades through the use of molecular markers to investigate organisms in their natural habitat and to evaluate the effect of anthropogenic disturbances. However, many of these studies have been limited to narrow regions of the genome, allowing for limited inferences but making it difficult to generalize about the organisms and their evolutionary history. Tremendous advances in sequencing technology over the last decade (i.e. next-generation sequencing; NGS) have led to the ability to sample the genome much more densely and to observe the patterns of genetic variation that result from the full range of evolutionary processes acting across the genome (Allendorf et al. 2010; Stapley et al. 2010; Li et al. 2012). These studies are transforming molecular ecology by making many long-standing questions much more easily accessible in almost any organism. When studying the genetics of wild populations, it is desirable to samples tens, hundreds or even thousands of individuals. While it is now possible to sequence whole genomes for tens of individuals with small genome sizes, the sequencing of hundreds of individuals with large genomes remains prohibitively expensive, particularly where the genome sequence is unknown. Further, for the purpose of many studies, complete genomic sequence data for all individuals would be unnecessary and simply inflate the computational and bioinformatic costs. A major recent advance has been the development of genotyping-by-sequencing (GBS) approaches that allow a targeted fraction of the genome (a reduced representation library) to be sequenced with next-generation technology rather than the entire genome, even in species with little or no previous genomic information and large genomes. The subset of the genome to be sequenced in these GBS approaches may be targeted using restriction enzymes or capture probes or by sequencing the transcriptome (reviewed in Davey et al. 2011). In the future, as sequencing technology and computational and bioinformatic methods develop further, whole-genome resequencing may become the predominant method for ecological and conservation genomics. Currently, reduced representation approaches offer the ability to not only discover genetic variants such as SNPs but also genotype individuals at these newly discovered loci in the same data. This special issue on ‘Genotyping-by-Sequencing in Ecological and Conservation Genomics’ represents a diverse set of empirical and theoretical studies that demonstrate both the utility and some of the challenges of GBS in ecological and conservation genomics. The empirical studies include demonstrations of the utility of GBS for population genomics and association mapping, as well as the development of genomic resources (i.e. large SNP data sets) for target species. The studies also illustrate some of the differences between GBS methods, in particular, aligning paired-end reads to achieve longer consensus sequences in contrast to single-end reads with shorter alignments, and double-digest versus sonication methods to fragment DNA. In addition, several papers describe advanced data pipelines for handling GBS-related sequence data and critically evaluate best practices for GBS methods and potential biases and novel features associated with GBS data. Overall, this compilation of papers emphasizes that GBS has been quickly adopted by the scientific community and is expected to become a common tool for studies in molecular ecology.

505 citations


Cites background or methods from "Stacks: an analysis tool set for po..."

  • ...…mammals, making novel inferences regarding selection in natural populations in addition to measuring demographic parameters using neutral markers (Catchen et al. 2013b; Corander et al. 2013; De Wit & Palumbi 2013; Hess et al. 2013; Hyma & Fay 2013; Keller et al. 2013; Reitzel et al. 2013; Roda…...

    [...]

  • ...The most comprehensive pipeline for handling GBS data is Stacks (Catchen et al. 2011), and in this issue, Catchen et al. (2013a) describe new features in Stacks to calculate population genomic statistics (such as FST and nucleotide diversity), create smoothed distributions using sliding window…...

    [...]

Journal ArticleDOI
TL;DR: It is demonstrated that hybridization between two divergent lineages facilitated this process by providing genetic variation that subsequently became recombined and sorted into many new species, indicating rapid and extensive adaptive radiation.
Abstract: Understanding why some evolutionary lineages generate exceptionally high species diversity is an important goal in evolutionary biology. Haplochromine cichlid fishes of Africa’s Lake Victoria region encompass >700 diverse species that all evolved in the last 150,000 years. How this ‘Lake Victoria Region Superflock’ could evolve on such rapid timescales is an enduring question. Here, we demonstrate that hybridization between two divergent lineages facilitated this process by providing genetic variation that subsequently became recombined and sorted into many new species. Notably, the hybridization event generated exceptional allelic variation at an opsin gene known to be involved in adaptation and speciation. More generally, differentiation between new species is accentuated around variants that were fixed differences between the parental lineages, and that now appear in many new combinations in the radiation species. We conclude that hybridization between divergent lineages, when coincident with ecological opportunity, may facilitate rapid and extensive adaptive radiation. Cichlids underwent a rapid diversification in the Lake Victoria region, expanding to more than 700 species within 150,000 years. Here, Meier and colleagues show that an ancient hybridization between two divergent cichlid lineages generated high genetic diversity that facilitated the rapid radiation.

496 citations

References
More filters
Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations


"Stacks: an analysis tool set for po..." refers methods in this paper

  • ...Although reference genome aligners report reads aligned to both the positive and negative strand by the left-most genomic coordinate, pstacks will utilize the CIGAR string in the SAM file to reorient all reads such that their genomic alignment position is determined by the location of the restriction enzyme cut site....

    [...]

  • ...Stacks can handle raw sequencing data in FASTA or FASTQ format to identify loci de novo and reads aligned against a reference genome in SAM (Li et al. 2009) format....

    [...]

  • ...The pstacks program will read the CIGAR string (Li et al. 2009) from each alignment in the SAM file to determine whether the read contained an insertion, deletion or soft-masking [see Appendix S1, 1.2, Supporting information for information on CIGAR strings]....

    [...]

  • ...Furthermore, SAMtools/BCFtools and GATK can call SNPs in multiple samples and can generate allele frequencies, but there is no built-in concept of populations....

    [...]

  • ...Two of the most widely used are SAMtools/BCFtools (Li et al. 2009) and the Genome Analysis Toolkit (GATK, McKenna et al. 2010)....

    [...]

Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations

Journal ArticleDOI
01 Jun 2000-Genetics
TL;DR: Pritch et al. as discussed by the authors proposed a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations, which can be applied to most of the commonly used genetic markers, provided that they are not closely linked.
Abstract: We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci— e.g. , seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/~pritch/home.html.

27,454 citations

Journal ArticleDOI
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

20,557 citations


"Stacks: an analysis tool set for po..." refers methods in this paper

  • ...Two of the most widely used are SAMtools/BCFtools (Li et al. 2009) and the Genome Analysis Toolkit (GATK, McKenna et al. 2010)....

    [...]

Journal ArticleDOI
TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

20,335 citations


"Stacks: an analysis tool set for po..." refers methods in this paper

  • ...Through the program pstacks, Stacks is able to use data from any alignment program that can produce SAM or BAM output files and has been extensively tested with Bowtie (Langmead et al. 2009), BWA (Li & Durbin 2009) and GSNAP (Wu & Nacu 2010)....

    [...]