scispace - formally typeset
Search or ask a question

Showing papers in "Molecular Ecology Resources in 2015"


Journal ArticleDOI
TL;DR: Clumpak, available at http://clumpak.tau.ac.il, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology by automating the postprocessing of results of model‐based population structure analyses.
Abstract: The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population-genetic data analysis. Application of model-based clustering programs often entails a number of steps, in which the user considers different modelling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present CLUMPAK (Cluster Markov Packager Across K), a method that automates the postprocessing of results of model-based population structure analyses. For analysing multiple independent runs at a single K value, CLUMPAK identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software CLUMPP. Next, CLUMPAK identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in CLUMPP and simplifying the comparison of clustering results across different K values. CLUMPAK incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. CLUMPAK, available at http://clumpak.tau.ac.il, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology.

2,252 citations


Journal ArticleDOI
TL;DR: The level of replication required for accurate detection of targeted taxa in different contexts was evaluated and whether statistical approaches developed to estimate occupancy in the presence of observational errors can successfully estimate true prevalence, detection probability and false‐positive rates was evaluated.
Abstract: Environmental DNA (eDNA) metabarcoding is increasingly used to study the present and past biodiversity. eDNA analyses often rely on amplification of very small quantities or degraded DNA. To avoid missing detection of taxa that are actually present (false negatives), multiple extractions and amplifications of the same samples are often performed. However, the level of replication needed for reliable estimates of the presence/absence patterns remains an unaddressed topic. Furthermore, degraded DNA and PCR/sequencing errors might produce false positives. We used simulations and empirical data to evaluate the level of replication required for accurate detection of targeted taxa in different contexts and to assess the performance of methods used to reduce the risk of false detections. Furthermore, we evaluated whether statistical approaches developed to estimate occupancy in the presence of observational errors can successfully estimate true prevalence, detection probability and false-positive rates. Replications reduced the rate of false negatives; the optimal level of replication was strongly dependent on the detection probability of taxa. Occupancy models successfully estimated true prevalence, detection probability and false-positive rates, but their performance increased with the number of replicates. At least eight PCR replicates should be performed if detection probability is not high, such as in ancient DNA studies. Multiple DNA extractions from the same sample yielded consistent results; in some cases, collecting multiple samples from the same locality allowed detecting more species. The optimal level of replication for accurate species detection strongly varies among studies and could be explicitly estimated to improve the reliability of results.

490 citations


Journal ArticleDOI
TL;DR: During high leaf deposition periods, the presence of inhibitors resulted in no amplification for high copy number samples in the absence of an inhibition‐releasing strategy, demonstrating the necessity to carefully consider inhibition in eDNA analysis.
Abstract: Environmental DNA (eDNA) detection has emerged as a powerful tool for monitoring aquatic organisms, but much remains unknown about the dynamics of aquatic eDNA over a range of environmental conditions. DNA concentrations in streams and rivers will depend not only on the equilibrium between DNA entering the water and DNA leaving the system through degradation, but also on downstream transport. To improve understanding of the dynamics of eDNA concentration in lotic systems, we introduced caged trout into two fishless headwater streams and took eDNA samples at evenly spaced downstream intervals. This was repeated 18 times from mid-summer through autumn, over flows ranging from approximately 1-96 L/s. We used quantitative PCR to relate DNA copy number to distance from source. We found that regardless of flow, there were detectable levels of DNA at 239.5 m. The main effect of flow on eDNA counts was in opposite directions in the two streams. At the lowest flows, eDNA counts were highest close to the source and quickly trailed off over distance. At the highest flows, DNA counts were relatively low both near and far from the source. Biomass was positively related to eDNA copy number in both streams. A combination of cell settling, turbulence and dilution effects is probably responsible for our observations. Additionally, during high leaf deposition periods, the presence of inhibitors resulted in no amplification for high copy number samples in the absence of an inhibition-releasing strategy, demonstrating the necessity to carefully consider inhibition in eDNA analysis.

415 citations


Journal ArticleDOI
TL;DR: It is argued that tag jumping and contamination between libraries represents a considerable challenge for Illumina‐based metabarcoding studies, and measures to avoid false assignment of tag jumping‐derived sequences to samples are suggested.
Abstract: Metabarcoding of environmental samples on second-generation sequencing platforms has rapidly become a valuable tool for ecological studies. A fundamental assumption of this approach is the reliance on being able to track tagged amplicons back to the samples from which they originated. In this study, we address the problem of sequences in metabarcoding sequencing outputs with false combinations of used tags (tag jumps). Unless these sequences can be identified and excluded from downstream analyses, tag jumps creating sequences with false, but already used tag combinations, can cause incorrect assignment of sequences to samples and artificially inflate diversity. In this study, we document and investigate tag jumping in metabarcoding studies on Illumina sequencing platforms by amplifying mixed-template extracts obtained from bat droppings and leech gut contents with tagged generic arthropod and mammal primers, respectively. We found that an average of 2.6% and 2.1% of sequences had tag combinations, which could be explained by tag jumping in the leech and bat diet study, respectively. We suggest that tag jumping can happen during blunt-ending of pools of tagged amplicons during library build and as a consequence of chimera formation during bulk amplification of tagged amplicons during library index PCR. We argue that tag jumping and contamination between libraries represents a considerable challenge for Illumina-based metabarcoding studies, and suggest measures to avoid false assignment of tag jumping-derived sequences to samples.

409 citations


Journal ArticleDOI
TL;DR: Individual sample replicates are used, under the expectation of identical genotypes, to quantify genotyping error in the absence of a reference genome and optimize de novo assembly parameters within the program Stacks, by minimizing error and maximizing the retrieval of informative loci.
Abstract: Restriction site-associated DNA sequencing (RADseq) provides researchers with the ability to record genetic polymorphism across thousands of loci for nonmodel organisms, potentially revolutionizing the field of molecular ecology. However, as with other genotyping methods, RADseq is prone to a number of sources of error that may have consequential effects for population genetic inferences, and these have received only limited attention in terms of the estimation and reporting of genotyping error rates. Here we use individual sample replicates, under the expectation of identical genotypes, to quantify genotyping error in the absence of a reference genome. We then use sample replicates to (i) optimize de novo assembly parameters within the program Stacks, by minimizing error and maximizing the retrieval of informative loci; and (ii) quantify error rates for loci, alleles and single-nucleotide polymorphisms. As an empirical example, we use a double-digest RAD data set of a nonmodel plant species, Berberis alpina, collected from high-altitude mountains in Mexico.

353 citations


Journal ArticleDOI
TL;DR: A comprehensive update to metaxa is described, introducing support for the LSU rRNA gene, a greatly improved classifier allowing classification down to genus or species level, as well as enhanced support for short‐read (100 bp) and paired‐end sequences, among other changes.
Abstract: The ribosomal rRNA genes are widely used as genetic markers for taxonomic identification of microbes. Particularly the small subunit (SSU; 16S/18S) rRNA gene is frequently used for species- or genus-level identification, but also the large subunit (LSU; 23S/28S) rRNA gene is employed in taxonomic assignment. The metaxa software tool is a popular utility for extracting partial rRNA sequences from large sequencing data sets and assigning them to an archaeal, bacterial, nuclear eukaryote, mitochondrial or chloroplast origin. This study describes a comprehensive update to metaxa – metaxa2 – that extends the capabilities of the tool, introducing support for the LSU rRNA gene, a greatly improved classifier allowing classification down to genus or species level, as well as enhanced support for short-read (100 bp) and paired-end sequences, among other changes. The performance of metaxa2 was compared to other commonly used taxonomic classifiers, showing that metaxa2 often outperforms previous methods in terms of making correct predictions while maintaining a low misclassification rate. metaxa2 is freely available from http://microbiology.se/software/metaxa2/

345 citations


Journal ArticleDOI
TL;DR: This study demonstrates amplicon sequencing with GT‐seq greatly reduces the cost of genotyping hundreds of targeted SNPs relative to existing methods by utilizing a simple library preparation method and massive efficiency of scale.
Abstract: Genotyping-in-Thousands by sequencing (GT-seq) is a method that uses next-generation sequencing of multiplexed PCR products to generate genotypes from relatively small panels (50-500) of targeted single-nucleotide polymorphisms (SNPs) for thousands of individuals in a single Illumina HiSeq lane. This method uses only unlabelled oligos and PCR master mix in two thermal cycling steps for amplification of targeted SNP loci. During this process, sequencing adapters and dual barcode sequence tags are incorporated into the amplicons enabling thousands of individuals to be pooled into a single sequencing library. Post sequencing, reads from individual samples are split into individual files using their unique combination of barcode sequences. Genotyping is performed with a simple perl script which counts amplicon-specific sequences for each allele, and allele ratios are used to determine the genotypes. We demonstrate this technique by genotyping 2068 individual steelhead trout (Oncorhynchus mykiss) samples with a set of 192 SNP markers in a single library sequenced in a single Illumina HiSeq lane. Genotype data were 99.9% concordant to previously collected TaqMan(™) genotypes at the same 192 loci, but call rates were slightly lower with GT-seq (96.4%) relative to Taqman (99.0%). Of the 192 SNPs, 187 were genotyped in ≥90% of the individual samples and only 3 SNPs were genotyped in <70% of samples. This study demonstrates amplicon sequencing with GT-seq greatly reduces the cost of genotyping hundreds of targeted SNPs relative to existing methods by utilizing a simple library preparation method and massive efficiency of scale.

314 citations


Journal ArticleDOI
TL;DR: The results show that reads of all species were recovered after PCR enrichment at the authors' control conditions and high‐throughput sequencing, and show that the four factors considered biased the final proportions of the species to some degree.
Abstract: The quantification of the biological diversity in environmental samples using high-throughput DNA sequencing is hindered by the PCR bias caused by variable primer-template mismatches of the individual species. In some dietary studies, there is the added problem that samples are enriched with predator DNA, so often a predator-specific blocking oligonucleotide is used to alleviate the problem. However, specific blocking oligonucleotides could coblock nontarget species to some degree. Here, we accurately estimate the extent of the PCR biases induced by universal and blocking primers on a mock community prepared with DNA of twelve species of terrestrial arthropods. We also compare universal and blocking primer biases with those induced by variable annealing temperature and number of PCR cycles. The results show that reads of all species were recovered after PCR enrichment at our control conditions (no blocking oligonucleotide, 45 °C annealing temperature and 40 cycles) and high-throughput sequencing. They also show that the four factors considered biased the final proportions of the species to some degree. Among these factors, the number of primer-template mismatches of each species had a disproportionate effect (up to five orders of magnitude) on the amplification efficiency. In particular, the number of primer-template mismatches explained most of the variation (~3/4) in the amplification efficiency of the species. The effect of blocking oligonucleotide concentration on nontarget species relative abundance was also significant, but less important (below one order of magnitude). Considering the results reported here, the quantitative potential of the technique is limited, and only qualitative results (the species list) are reliable, at least when targeting the barcoding COI region.

313 citations


Journal ArticleDOI
TL;DR: A new R package is presented, called related, that can calculate relatedness based on seven estimators, can account for genotyping errors, missing data and inbreeding, and can estimate 95% confidence intervals.
Abstract: Analyses of pairwise relatedness represent a key component to addressing many topics in biology. However, such analyses have been limited because most available programs provide a means to estimate relatedness based on only a single estimator, making comparison across estimators difficult. Second, all programs to date have been platform specific, working only on a specific operating system. This has the undesirable outcome of making choice of relatedness estimator limited by operating system preference, rather than being based on scientific rationale. Here, we present a new R package, called related, that can calculate relatedness based on seven estimators, can account for genotyping errors, missing data and inbreeding, and can estimate 95% confidence intervals. Moreover, simulation functions are provided that allow for easy comparison of the performance of different estimators and for analyses of how much resolution to expect from a given data set. Because this package works in R, it is platform independent. Combined, this functionality should allow for more appropriate analyses and interpretation of pairwise relatedness and will also allow for the integration of relatedness data into larger R workflows.

296 citations


Journal ArticleDOI
TL;DR: This study demonstrates the successful preservation of eDNA at room temperature (20 °C) in two lysis buffers, CTAB and Longmire's, over a 2‐week period of time and suggests that for many kinds of studies recently reported on macrobial eDNA, detection probabilities could have been increased, and at a lower cost, by utilizing the Longmire’s preservation buffer with a PCI DNA extraction.
Abstract: Current research targeting filtered macrobial environmental DNA (eDNA) often relies upon cold ambient temperatures at various stages, including the transport of water samples from the field to the laboratory and the storage of water and/or filtered samples in the laboratory. This poses practical limitations for field collections in locations where refrigeration and frozen storage is difficult or where samples must be transported long distances for further processing and screening. This study demonstrates the successful preservation of eDNA at room temperature (20 °C) in two lysis buffers, CTAB and Longmire's, over a 2-week period of time. Moreover, the preserved eDNA samples were seamlessly integrated into a phenol–chloroform–isoamyl alcohol (PCI) DNA extraction protocol. The successful application of the eDNA extraction to multiple filter membrane types suggests the methods evaluated here may be broadly applied in future eDNA research. Our results also suggest that for many kinds of studies recently reported on macrobial eDNA, detection probabilities could have been increased, and at a lower cost, by utilizing the Longmire's preservation buffer with a PCI DNA extraction.

258 citations


Journal ArticleDOI
TL;DR: A large set (n = 1510) of ultraconserved elements (UCEs) shared among the insect order Hymenoptera are identified and used to reconstruct phylogenetic relationships spanning very old to very young divergences among hymenopteran lineages with complete support.
Abstract: Gaining a genomic perspective on phylogeny requires the collection of data from many putatively independent loci across the genome. Among insects, an increasingly common approach to collecting this class of data involves transcriptome sequencing, because few insects have high-quality genome sequences available; assembling new genomes remains a limiting factor; the transcribed portion of the genome is a reasonable, reduced subset of the genome to target; and the data collected from transcribed portions of the genome are similar in composition to the types of data with which biologists have traditionally worked (e.g. exons). However, molecular techniques requiring RNA as a template, including transcriptome sequencing, are limited to using very high-quality source materials, which are often unavailable from a large proportion of biologically important insect samples. Recent research suggests that DNA-based target enrichment of conserved genomic elements offers another path to collecting phylogenomic data across insect taxa, provided that conserved elements are present in and can be collected from insect genomes. Here, we identify a large set (n = 1510) of ultraconserved elements (UCEs) shared among the insect order Hymenoptera. We used in silico analyses to show that these loci accurately reconstruct relationships among genome-enabled hymenoptera, and we designed a set of RNA baits (n = 2749) for enriching these loci that researchers can use with DNA templates extracted from a variety of sources. We used our UCE bait set to enrich an average of 721 UCE loci from 30 hymenopteran taxa, and we used these UCE loci to reconstruct phylogenetic relationships spanning very old (≥220 Ma) to very young (≤1 Ma) divergences among hymenopteran lineages. In contrast to a recent study addressing hymenopteran phylogeny using transcriptome data, we found ants to be sister to all remaining aculeate lineages with complete support, although this result could be explained by factors such as taxon sampling. We discuss this approach and our results in the context of elucidating the evolutionary history of one of the most diverse and speciose animal orders.

Journal ArticleDOI
TL;DR: Evidence is provided that metabarcoding of diatoms via NGS sequencing of the V4 region (18S) has a great potential for water quality assessments and could complement and maybe even improve the identification via light microscopy.
Abstract: Diatoms are frequently used for water quality assessments; however, identification to species level is difficult, time-consuming and needs in-depth knowledge of the organisms under investigation, as nonhomoplastic species-specific morphological characters are scarce. We here investigate how identification methods based on DNA (metabarcoding using NGS platforms) perform in comparison to morphological diatom identification and propose a workflow to optimize diatom fresh water quality assessments. Diatom diversity at seven different sites along the course of the river system Odra and Lusatian Neisse from the source to the mouth is analysed with DNA and morphological methods, which are compared. The NGS technology almost always leads to a higher number of identified taxa (270 via NGS vs. 103 by light microscopy LM), whose presence could subsequently be verified by LM. The sequence-based approach allows for a much more graduated insight into the taxonomic diversity of the environmental samples. Taxa retrieval varies considerably throughout the river system, depending on species occurrences and the taxonomic depth of the reference databases. Mostly rare taxa from oligotrophic parts of the river systems are less well represented in the reference database used. A workflow for DNA-based NGS diatom identification is presented. 28 000 diatom sequences were evaluated. Our findings provide evidence that metabarcoding of diatoms via NGS sequencing of the V4 region (18S) has a great potential for water quality assessments and could complement and maybe even improve the identification via light microscopy.

Journal ArticleDOI
TL;DR: This study provides the globally largest DNA barcode reference library for Coleoptera for 15 948 individuals belonging to 3514 well‐identified species (53% of the German fauna) with representatives from 97 of 103 families with a focus on Germany.
Abstract: Beetles are the most diverse group of animals and are crucial for ecosystem functioning. In many countries, they are well established for environmental impact assessment, but even in the well-studied Central European fauna, species identification can be very difficult. A comprehensive and taxonomically well-curated DNA barcode library could remedy this deficit and could also link hundreds of years of traditional knowledge with next generation sequencing technology. However, such a beetle library is missing to date. This study provides the globally largest DNA barcode reference library for Coleoptera for 15 948 individuals belonging to 3514 well-identified species (53% of the German fauna) with representatives from 97 of 103 families (94%). This study is the first comprehensive regional test of the efficiency of DNA barcoding for beetles with a focus on Germany. Sequences ≥500 bp were recovered from 63% of the specimens analysed (15 948 of 25 294) with short sequences from another 997 specimens. Whereas most specimens (92.2%) could be unambiguously assigned to a single known species by sequence diversity at CO1, 1089 specimens (6.8%) were assigned to more than one Barcode Index Number (BIN), creating 395 BINs which need further study to ascertain if they represent cryptic species, mitochondrial introgression, or simply regional variation in widespread species. We found 409 specimens (2.6%) that shared a BIN assignment with another species, most involving a pair of closely allied species as 43 BINs were involved. Most of these taxa were separated by barcodes although sequence divergences were low. Only 155 specimens (0.97%) show identical or overlapping clusters.

Journal ArticleDOI
TL;DR: The barcode data contributed to clarifying the status of nearly half the examined taxonomically problematic species of bees in the German fauna, and the role of DNA barcoding as a tool for current and future taxonomic work is discussed.
Abstract: This study presents DNA barcode records for 4118 specimens representing 561 species of bees belonging to the six families of Apoidea (Andrenidae, Apidae, Colletidae, Halictidae, Megachilidae and Melittidae) found in Central Europe. These records provide fully compliant barcode sequences for 503 of the 571 bee species in the German fauna and partial sequences for 43 more. The barcode results are largely congruent with traditional taxonomy as only five closely allied pairs of species could not be discriminated by barcodes. As well, 90% of the species possessed sufficiently deep sequence divergence to be assigned to a different Barcode Index Number (BIN). In fact, 56 species (11%) were assigned to two or more BINs reflecting the high levels of intraspecific divergence among their component specimens. Fifty other species (9.7%) shared the same Barcode Index Number with one or more species, but most of these species belonged to a distinct barcode cluster within a particular BIN. The barcode data contributed to clarifying the status of nearly half the examined taxonomically problematic species of bees in the German fauna. Based on these results, the role of DNA barcoding as a tool for current and future taxonomic work is discussed.

Journal ArticleDOI
TL;DR: This method for identification and quantification of airborne pollen using DNA sequencing is presented and it is shown that it provides an accurate qualitative and quantitative view of the species composition of samples of pollen grains.
Abstract: Pollen monitoring is an important and widely used tool in allergy research and creation of awareness in pollen-allergic patients. Current pollen monitoring methods are microscope-based, labour intensive and cannot identify pollen to the genus level in some relevant allergenic plant groups. Therefore, a more efficient, cost-effective and sensitive method is needed. Here, we present a method for identification and quantification of airborne pollen using DNA sequencing. Pollen is collected from ambient air using standard techniques. DNA is extracted from the collected pollen, and a fragment of the chloroplast gene trnL is amplified using PCR. The PCR product is subsequently sequenced on a next-generation sequencing platform (Ion Torrent). Amplicon molecules are sequenced individually, allowing identification of different sequences from a mixed sample. We show that this method provides an accurate qualitative and quantitative view of the species composition of samples of airborne pollen grains. We also show that it correctly identifies the individual grass genera present in a mixed sample of grass pollen, which cannot be achieved using microscopic pollen identification. We conclude that our method is more efficient and sensitive than current pollen monitoring techniques and therefore has the potential to increase the throughput of pollen monitoring.

Journal ArticleDOI
TL;DR: The development and characterization of the first high‐density single nucleotide polymorphism (SNP) genotyping array for rainbow trout is described and strong evidence for a wide distribution throughout the genome with good representation in all 29 chromosomes is provided.
Abstract: In this study, we describe the development and characterization of the first high-density single nucleotide polymorphism (SNP) genotyping array for rainbow trout. The SNP array is publically available from a commercial vendor (Affymetrix). The SNP genotyping quality was high, and validation rate was close to 90%. This is comparable to other farm animals and is much higher than previous smaller scale SNP validation studies in rainbow trout. High quality and integrity of the genotypes are evident from sample reproducibility and from nearly 100% agreement in genotyping results from other methods. The array is very useful for rainbow trout aquaculture populations with more than 40 900 polymorphic markers per population. For wild populations that were confounded by a smaller sample size, the number of polymorphic markers was between 10 577 and 24 330. Comparison between genotypes from individual populations suggests good potential for identifying candidate markers for populations' traceability. Linkage analysis and mapping of the SNPs to the reference genome assembly provide strong evidence for a wide distribution throughout the genome with good representation in all 29 chromosomes. A total of 68% of the genome scaffolds and contigs were anchored through linkage analysis using the SNP array genotypes, including ~20% of the genome assembly that has not been previously anchored to chromosomes.

Journal ArticleDOI
TL;DR: The results support models of independent patterns of morphological and molecular evolution by showing that DNA barcodes are effective in species identification regardless of their morphological diagnosibility and that the size of the barcoding gap strongly depends on taxonomic groups and practices.
Abstract: The philosophical basis and utility of DNA barcoding have been a subject of numerous debates. While most literature embraces it, some studies continue to question its use in dipterans, butterflies and marine gastropods. Here, we explore the utility of DNA barcoding in identifying spider species that vary in taxonomic affiliation, morphological diagnosibility and geographic distribution. Our first test searched for a 'barcoding gap' by comparing intra- and interspecific means, medians and overlap in more than 75,000 computed Kimura 2-parameter (K2P) genetic distances in three families. Our second test compared K2P distances of congeneric species with high vs. low morphological distinctness in 20 genera of 11 families. Our third test explored the effect of enlarging geographical sampling area at a continental scale on genetic variability in DNA barcodes within 20 species of nine families. Our results generally point towards a high utility of DNA barcodes in identifying spider species. However, the size of the barcoding gap strongly depends on taxonomic groups and practices. It is becoming critical to define the barcoding gap statistically more consistently and to document its variation over taxonomic scales. Our results support models of independent patterns of morphological and molecular evolution by showing that DNA barcodes are effective in species identification regardless of their morphological diagnosibility. We also show that DNA barcodes represent an effective tool for identifying spider species over geographic scales, yet their variation contains useful biogeographic information.

Journal ArticleDOI
TL;DR: The PhytoREF database is built that contains 6490 plastidial 16S rDNA reference sequences that originate from a large diversity of eukaryotes representing all known major photosynthetic lineages and mainly focuses on marine microalgae, but sequences from land plants and freshwater taxa were also included to broaden the applicability of Phy toREF to different aquatic and terrestrial habitats.
Abstract: Photosynthetic eukaryotes have a critical role as the main producers in most ecosystems of the biosphere. The ongoing environmental metabarcoding revolution opens the perspective for holistic ecosystems biological studies of these organisms, in particular the unicellular microalgae that often lack distinctive morphological characters and have complex life cycles. To interpret environmental sequences, metabarcoding necessarily relies on taxonomically curated databases containing reference sequences of the targeted gene (or barcode) from identified organisms. To date, no such reference framework exists for photosynthetic eukaryotes. In this study, we built the PhytoREF database that contains 6490 plastidial 16S rDNA reference sequences that originate from a large diversity of eukaryotes representing all known major photosynthetic lineages. We compiled 3333 amplicon sequences available from public databases and 879 sequences extracted from plastidial genomes, and generated 411 novel sequences from cultured marine microalgal strains belonging to different eukaryotic lineages. A total of 1867 environmental Sanger 16S rDNA sequences were also included in the database. Stringent quality filtering and a phylogeny-based taxonomic classification were applied for each 16S rDNA sequence. The database mainly focuses on marine microalgae, but sequences from land plants (representing half of the PhytoREF sequences) and freshwater taxa were also included to broaden the applicability of PhytoREF to different aquatic and terrestrial habitats. PhytoREF, accessible via a web interface (http://phytoref.fr), is a new resource in molecular ecology to foster the discovery, assessment and monitoring of the diversity of photosynthetic eukaryotes using high-throughput sequencing.

Journal ArticleDOI
TL;DR: In this paper, approximate Bayesian computation (ABC)-based method is used to estimate the population genetic parameters from time-sampled data sets, which is then set as a prior for inferring per-site selection coefficients accurately and precisely.
Abstract: With novel developments in sequencing technologies, time-sampled data are becoming more available and accessible. Naturally, there have been efforts in parallel to infer population genetic parameters from these data sets. Here, we compare and analyse four recent approaches based on the Wright-Fisher model for inferring selection coefficients (s) given effective population size (N-e), with simulated temporal data sets. Furthermore, we demonstrate the advantage of a recently proposed approximate Bayesian computation (ABC)-based method that is able to correctly infer genomewide average N-e from time-serial data, which is then set as a prior for inferring per-site selection coefficients accurately and precisely. We implement this ABC method in a new software and apply it to a classical time-serial data set of the medionigra genotype in the moth Panaxia dominula. We show that a recessive lethal model is the best explanation for the observed variation in allele frequency by implementing an estimator of the dominance ratio (h).

Journal ArticleDOI
TL;DR: A large‐scale meta‐analysis was carried out to compare ITS1 and ITS2 from three aspects: PCR amplification, DNA sequencing and species discrimination, and found that ITS1 represents a better DNA barcode than ITS2 for eukaryotic species.
Abstract: A DNA barcode is a short piece of DNA sequence used for species determination and discovery. The internal transcribed spacer (ITS/ITS2) region has been proposed as the standard DNA barcode for fungi and seed plants and has been widely used in DNA barcoding analyses for other biological groups, for example algae, protists and animals. The ITS region consists of both ITS1 and ITS2 regions. Here, a large-scale meta-analysis was carried out to compare ITS1 and ITS2 from three aspects: PCR amplification, DNA sequencing and species discrimination, in terms of the presence of DNA barcoding gaps, species discrimination efficiency, sequence length distribution, GC content distribution and primer universality. In total, 85 345 sequence pairs in 10 major groups of eukaryotes, including ascomycetes, basidiomycetes, liverworts, mosses, ferns, gymnosperms, monocotyledons, eudicotyledons, insects and fishes, covering 611 families, 3694 genera, and 19 060 species, were analysed. Using similarity-based methods, we calculated species discrimination efficiencies for ITS1 and ITS2 in all major groups, families and genera. Using Fisher's exact test, we found that ITS1 has significantly higher efficiencies than ITS2 in 17 of the 47 families and 20 of the 49 genera, which are sample-rich. By in silico PCR amplification evaluation, primer universality of the extensively applied ITS1 primers was found superior to that of ITS2 primers. Additionally, shorter length of amplification product and lower GC content was discovered to be two other advantages of ITS1 for sequencing. In summary, ITS1 represents a better DNA barcode than ITS2 for eukaryotic species.

Journal ArticleDOI
TL;DR: Read numbers for diet species in metagenomic and metabarcoding data were correlated, indicating that both are useful for determining relative sequence abundance, and the precision of identifications and species recovery would improve further.
Abstract: Faecal samples are of great value as a non-invasive means to gather information on the genetics, distribution, demography, diet and parasite infestation of endangered species. Direct shotgun sequencing of faecal DNA could give information on these simultaneously, but this approach is largely untested. Here, we used two faecal samples to characterize the diet of two red-shanked doucs langurs (Pygathrix nemaeus) that were fed known foliage, fruits, vegetables and cereals. Illumina HiSeq produced ~74 and 67 million paired reads for these samples, of which ~ 10,000 (0.014%) and ~ 44,000 (0.066%), respectively, were of chloroplast origin. Sequences were matched against a database of available chloroplast 'barcodes' for angiosperms. The results were compared with 'metabarcoding' using PCR amplification of the P6 loop of trnL. Metagenomics identified seven and nine of the likely 16 diet plants while six and five were identified by metabarcoding. Metabarcoding produced thousands of reads consistent with the known diet, but the barcodes were too short to identify several plant species to genus. Metagenomics utilized multiple, longer barcodes that combined had greater power of identification. However, rare diet items were not recovered. Read numbers for diet species in metagenomic and metabarcoding data were correlated, indicating that both are useful for determining relative sequence abundance. Metagenomic reads were uniformly distributed across the chloroplast genomes; thus, if chloroplast genomes were used as reference, the precision of identifications and species recovery would improve further. Metagenomics also recovered the host mitochondrial genome and numerous intestinal parasite sequences in addition to generating data useful for characterizing the microbiome.

Journal ArticleDOI
TL;DR: It is concluded that starting DNA quality is an important consideration for RADSeq; however, the approach remains robust until genomic DNA is extensively degraded.
Abstract: Degraded DNA from suboptimal field sampling is common in molecular ecology. However, its impact on techniques that use restriction site associated next-generation DNA sequencing (RADSeq, GBS) is unknown. We experimentally examined the effects of in situDNA degradation on data generation for a modified double-digest RADSeq approach (3RAD). We generated libraries using genomic DNA serially extracted from the muscle tissue of 8 individual lake whitefish (Coregonus clupeaformis) following 0-, 12-, 48- and 96-h incubation at room temperature posteuthanasia. This treatment of the tissue resulted in input DNA that ranged in quality from nearly intact to highly sheared. All samples were sequenced as a multiplexed pool on an Illumina MiSeq. Libraries created from low to moderately degraded DNA (12-48 h) performed well. In contrast, the number of RADtags per individual, number of variable sites, and percentage of identical RADtags retained were all dramatically reduced when libraries were made using highly degraded DNA (96-h group). This reduction in performance was largely due to a significant and unexpected loss of raw reads as a result of poor quality scores. Our findings remained consistent after changes in restriction enzymes, modified fold coverage values (2- to 16-fold), and additional read-length trimming. We conclude that starting DNA quality is an important consideration for RADSeq; however, the approach remains robust until genomic DNA is extensively degraded.

Journal ArticleDOI
TL;DR: New degenerate primers were developed that enabled acquisition of the COI barcode region from 100% of specimens tested, representing 23 families of digeneans and 6 orders of cestodes, and represent an improvement over existing methods.
Abstract: Digeneans and cestodes are species-rich taxa and can seriously impact human health, fisheries, aqua- and agriculture, and wildlife conservation and management. DNA barcoding using the COI Folmer region could be applied for species detection and identification, but both 'universal' and taxon-specific COI primers fail to amplify in many flatworm taxa. We found that high levels of nucleotide variation at priming sites made it unrealistic to design primers targeting all flatworms. We developed new degenerate primers that enabled acquisition of the COI barcode region from 100% of specimens tested (n = 46), representing 23 families of digeneans and 6 orders of cestodes. This high success rate represents an improvement over existing methods. Primers and methods provided here are critical pieces towards redressing the current paucity of COI barcodes for these taxa in public databases.

Journal ArticleDOI
TL;DR: Which out of cox1 or cox2 is best suited as a universal oomycete barcode, in terms of PCR efficiency for 31 representative genera, as well as for historic herbarium specimens, and sequence polymorphism, intra‐ and interspecific divergence is compared.
Abstract: Oomycetes are a diverse group of eukaryotes in terrestrial, limnic and marine habitats worldwide and include several devastating plant pathogens, for example Phytophthora infestans (potato late blight). The cytochrome c oxidase subunit 2 gene (cox2) has been widely used for identification, taxonomy and phylogeny of various oomycete groups. However, recently the cox1 gene was proposed as a DNA barcode marker instead, together with ITS rDNA. The cox1 locus has been used in some studies of Pythium and Phytophthora, but has rarely been used for other oomycetes, as amplification success of cox1 varies with different lineages and sample ages. To determine which out of cox1 or cox2 is best suited as a universal oomycete barcode, we compared these two genes in terms of (i) PCR efficiency for 31 representative genera, as well as for historic herbarium specimens, and (ii) sequence polymorphism, intra- and interspecific divergence. The primer sets for cox2 successfully amplified all oomycete genera tested, while cox1 failed to amplify three genera. In addition, cox2 exhibited higher PCR efficiency for historic herbarium specimens, providing easier access to barcoding-type material. Sequence data for several historic type specimens exist for cox2, but there are none for cox1. In addition, cox2 yielded higher species identification success, with higher interspecific and lower intraspecific divergences than cox1. Therefore, cox2 is suggested as a partner DNA barcode along with ITS rDNA instead of cox1. The cox2-1 spacer could be a useful marker below species level. Improved protocols and universal primers are presented for all genes to facilitate future barcoding efforts.

Journal ArticleDOI
TL;DR: Taking the morphology, distribution range and habitat of the species into account, DNA barcoding provided additional information for species identification and delivered a preliminary assessment of biodiversity for the large genus Rhododendron in the biodiversity hotspots of the Himalaya–Hengduan Mountains.
Abstract: The Himalaya–Hengduan Mountains encompass two global biodiversity hotspots with high levels of biodiversity and endemism. This area is one of the diversification centres of the genus Rhododendron, which is recognized as one of the most taxonomically challenging plant taxa due to recent adaptive radiations and rampant hybridization. In this study, four DNA barcodes were evaluated on 531 samples representing 173 species of seven sections of four subgenera in Rhododendron, with a high sampling density from the Himalaya–Hengduan Mountains employing three analytical methods. The varied approaches (NJ, PWG and BLAST) had different species identification powers with BLAST performing best. With the PWG analysis, the discrimination rates for single barcodes varied from 12.21% to 25.19% with ITS < rbcL < matK < psbA-trnH. Combinations of ITS + psbA-trnH + matK and the four barcodes showed the highest discrimination ability (both 41.98%) among all possible combinations. As a single barcode, psbA-trnH performed best with a relatively high performance (25.19%). Overall, the three-marker combination of ITS + psbAtrnH + matK was found to be the best DNA barcode for identifying Rhododendron species. The relatively low discriminative efficiency of DNA barcoding in this genus (~42%) may possibly be attributable to too low sequence divergences as a result of a long generation time of Rhododendron and complex speciation patterns involving recent radiations and hybridizations. Taking the morphology, distribution range and habitat of the species into account, DNA barcoding provided additional information for species identification and delivered a preliminary assessment of biodiversity for the large genus Rhododendron in the biodiversity hotspots of the Himalaya–Hengduan Mountains.

Journal ArticleDOI
TL;DR: An R package that allows the evolution of DNA sequences to be simulated according to a range of clock models is presented and the ability of two Bayesian phylogenetic methods to distinguish among different relaxed‐clock models and to quantify rate variation among lineages is assessed.
Abstract: Evolutionary timescales can be estimated from genetic data using phylogenetic methods based on the molecular clock. To account for molecular rate variation among lineages, a number of relaxed-clock models have been developed. Some of these models assume that rates vary among lineages in an autocorrelated manner, so that closely related species share similar rates. In contrast, uncorrelated relaxed clocks allow all of the branch-specific rates to be drawn from a single distribution, without assuming any correlation between rates along neighbouring branches. There is uncertainty about which of these two classes of relaxed-clock models are more appropriate for biological data. We present an R package, NELSI, that allows the evolution of DNA sequences to be simulated according to a range of clock models. Using data generated by this package, we assessed the ability of two Bayesian phylogenetic methods to distinguish among different relaxed-clock models and to quantify rate variation among lineages. The results of our analyses show that rate autocorrelation is typically difficult to detect, even when there is complete taxon sampling. This provides a potential explanation for past failures to detect rate autocorrelation in a range of data sets.

Journal ArticleDOI
TL;DR: Modest gains in discrimination are possible, but using complete plastid genomes or a small number of nuclear genes in DNA barcoding may not substantially raise species discriminatory power in many evolutionarily young lineages.
Abstract: Obtaining accurate phylogenies and effective species discrimination using a small standardized set of plastid genes is challenging in evolutionarily young lineages. Complete plastid genome sequencing offers an increasingly easy-to-access source of characters that helps address this. The usefulness of this approach, however, depends on the extent to which plastid haplotypes track morphological species boundaries. We have tested the power of complete plastid genomes to discriminate among multiple accessions of 11 of 13 New Caledonian Araucaria species, an evolutionarily young lineage where the standard DNA barcoding approach has so far failed and phylogenetic relationships have remained elusive. Additionally, 11 nuclear gene regions were Sanger sequenced for all accessions to ascertain the success of species discrimination using a moderate number of nuclear genes. Overall, fewer than half of the New Caledonian Araucaria species with multiple accessions were monophyletic in the plastid or nuclear trees. However, the plastid data retrieved a phylogeny with a higher resolution compared to any previously published tree of this clade and supported the monophyly of about twice as many species and nodes compared to the nuclear data set. Modest gains in discrimination thus are possible, but using complete plastid genomes or a small number of nuclear genes in DNA barcoding may not substantially raise species discriminatory power in many evolutionarily young lineages. The big challenge therefore remains to develop techniques that allow routine access to large numbers of nuclear markers scaleable to thousands of individuals from phylogenetically disparate sample sets.

Journal ArticleDOI
TL;DR: This study is the first to compare SNPs and microsatellites for analyses of parentage and relatedness in a species that lives in groups with a complex social and kin structure and should prove informative for those interested in developing SNP loci from transcriptome data when published genomes are unavailable.
Abstract: The development of genetic markers has revolutionized molecular studies within and among populations. Although poly-allelic microsatellites are the most commonly used genetic marker for within-population studies of free-living animals, biallelic single nucleotide polymorphisms, or SNPs, have also emerged as a viable option for use in nonmodel systems. We describe a robust method of SNP discovery from the transcriptome of a nonmodel organism that resulted in more than 99% of the markers working successfully during genotyping. We then compare the use of 102 novel SNPs with 15 previously developed microsatellites for studies of parentage and kinship in cooperatively breeding superb starlings (Lamprotornis superbus) that live in highly kin-structured groups. For 95% of the offspring surveyed, SNPs and microsatellites identified the same genetic father, but only when behavioural information about the likely parents at a nest was included to aid in assignment. Moreover, when such behavioural information was available, the number of SNPs necessary for successful parentage assignment was reduced by half. However, in a few cases where candidate fathers were highly related, SNPs did a better job at assigning fathers than microsatellites. Despite high variation between individual pairwise relatedness values, microsatellites and SNPs performed equally well in kinship analyses. This study is the first to compare SNPs and microsatellites for analyses of parentage and relatedness in a species that lives in groups with a complex social and kin structure. It should also prove informative for those interested in developing SNP loci from transcriptome data when published genomes are unavailable.

Journal ArticleDOI
TL;DR: LDna is a useful exploratory tool, able to give a global overview of LD associated with diverse evolutionary phenomena and identify loci potentially involved, and is applicable to any population‐genomic data set, making it especially valuable for nonmodel species.
Abstract: Recent advances in sequencing allow population-genomic data to be generated for virtually any species. However, approaches to analyse such data lag behind the ability to generate it, particularly in nonmodel species. Linkage disequilibrium (LD, the nonrandom association of alleles from different loci) is a highly sensitive indicator of many evolutionary phenomena including chromosomal inversions, local adaptation and geographical structure. Here, we present linkage disequilibrium network analysis (LDna), which accesses information on LD shared between multiple loci genomewide. In LD networks, vertices represent loci, and connections between vertices represent the LD between them. We analysed such networks in two test cases: a new restriction-site-associated DNA sequence (RAD-seq) data set for Anopheles baimaii, a Southeast Asian malaria vector; and a well-characterized single nucleotide polymorphism (SNP) data set from 21 three-spined stickleback individuals. In each case, we readily identified five distinct LD network clusters (single-outlier clusters, SOCs), each comprising many loci connected by high LD. In A. baimaii, further population-genetic analyses supported the inference that each SOC corresponds to a large inversion, consistent with previous cytological studies. For sticklebacks, we inferred that each SOC was associated with a distinct evolutionary phenomenon: two chromosomal inversions, local adaptation, population-demographic history and geographic structure. LDna is thus a useful exploratory tool, able to give a global overview of LD associated with diverse evolutionary phenomena and identify loci potentially involved. LDna does not require a linkage map or reference genome, so it is applicable to any population-genomic data set, making it especially valuable for nonmodel species.

Journal ArticleDOI
TL;DR: The ability to standardise genotype scoring combined with low error rates promises to constitute a major technological advancement and could establish SNPs as a standard marker for future wildlife monitoring.
Abstract: Noninvasive genetics based on microsatellite markers has become an indispensable tool for wildlife monitoring and conservation research over the past decades. However, microsatellites have several drawbacks, such as the lack of standardisation between laboratories and high error rates. Here, we propose an alternative single-nucleotide polymorphism (SNP)-based marker system for noninvasively collected samples, which promises to solve these problems. Using nanofluidic SNP genotyping technology (Fluidigm), we genotyped 158 wolf samples (tissue, scats, hairs, urine) for 192 SNP loci selected from the Affymetrix v2 Canine SNP Array. We carefully selected an optimised final set of 96 SNPs (and discarded the worse half), based on assay performance and reliability. We found rates of missing data in this SNP set of <10% and genotyping error of ~1%, which improves genotyping accuracy by nearly an order of magnitude when compared to published data for other marker types. Our approach provides a tool for rapid and cost-effective genotyping of noninvasively collected wildlife samples. The ability to standardise genotype scoring combined with low error rates promises to constitute a major technological advancement and could establish SNPs as a standard marker for future wildlife monitoring.