scispace - formally typeset
Search or ask a question
Journal ArticleDOI

FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies

01 Nov 2011-Bioinformatics (Oxford University Press)-Vol. 27, Iss: 21, pp 2957-2963
TL;DR: FLASH is a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short and when FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds.
Abstract: Motivation: Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome. Results: We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds. Availability and Implementation: The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash. Contact: moc.liamg@cogam.t

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The PEAR software for merging raw Illumina paired-end reads from target fragments of varying length evaluates all possible paired- end read overlaps and does not require the target fragment size as input, and implements a statistical test for minimizing false-positive results.
Abstract: Motivation The Illumina paired-end sequencing technology can generate reads from both ends of target DNA fragments, which can subsequently be merged to increase the overall read length. There already exist tools for merging these paired-end reads when the target fragments are equally long. However, when fragment lengths vary and, in particular, when either the fragment size is shorter than a single-end read, or longer than twice the size of a single-end read, most state-of-the-art mergers fail to generate reliable results. Therefore, a robust tool is needed to merge paired-end reads that exhibit varying overlap lengths because of varying target fragment lengths. Results We present the PEAR software for merging raw Illumina paired-end reads from target fragments of varying length. The program evaluates all possible paired-end read overlaps and does not require the target fragment size as input. It also implements a statistical test for minimizing false-positive results. Tests on simulated and empirical data show that PEAR consistently generates highly accurate merged paired-end reads. A highly optimized implementation allows for merging millions of paired-end reads within a few minutes on a standard desktop computer. On multi-core architectures, the parallel version of PEAR shows linear speedups compared with the sequential version of PEAR. Availability and implementation PEAR is implemented in C and uses POSIX threads. It is freely available at http://www.exelixis-lab.org/web/software/pear.

3,270 citations

Journal ArticleDOI
05 Oct 2012-Science
TL;DR: High-throughput sequencing revealed that inflammation modifies gut microbial composition in colitis-susceptible interleukin-10–deficient (Il10−/−) mice, suggesting that in mice, colitis can promote tumorigenesis by altering microbial composition and inducing the expansion of microorganisms with genotoxic capabilities.
Abstract: Inflammation alters host physiology to promote cancer, as seen in colitis-associated colorectal cancer (CRC). Here, we identify the intestinal microbiota as a target of inflammation that affects the progression of CRC. High-throughput sequencing revealed that inflammation modifies gut microbial composition in colitis-susceptible interleukin-10-deficient (Il10(-/-)) mice. Monocolonization with the commensal Escherichia coli NC101 promoted invasive carcinoma in azoxymethane (AOM)-treated Il10(-/-) mice. Deletion of the polyketide synthase (pks) genotoxic island from E. coli NC101 decreased tumor multiplicity and invasion in AOM/Il10(-/-) mice, without altering intestinal inflammation. Mucosa-associated pks(+) E. coli were found in a significantly high percentage of inflammatory bowel disease and CRC patients. This suggests that in mice, colitis can promote tumorigenesis by altering microbial composition and inducing the expansion of microorganisms with genotoxic capabilities.

1,720 citations

Journal ArticleDOI
05 Jul 2013-Science
TL;DR: Low-error sequencing data suggest that initial microbial colonizers of infant guts could persist over the life span of an individual, and members of Bacteroidetes and Actinobacteria are significantly more stable components than the population average.
Abstract: A low-error 16S ribosomal RNA amplicon sequencing method, in combination with whole-genome sequencing of >500 cultured isolates, was used to characterize bacterial strain composition in the fecal microbiota of 37 U.S. adults sampled for up to 5 years. Microbiota stability followed a power-law function, which when extrapolated suggests that most strains in an individual are residents for decades. Shared strains were recovered from family members but not from unrelated individuals. Sampling of individuals who consumed a monotonous liquid diet for up to 32 weeks indicated that changes in strain composition were better predicted by changes in weight than by differences in sampling interval. This combination of stability and responsiveness to physiologic change confirms the potential of the gut microbiota as a diagnostic tool and therapeutic target.

1,641 citations

Journal ArticleDOI
29 Apr 2016-Science
TL;DR: Stool consistency showed the largest effect size, whereas medication explained largest total variance and interacted with other covariate-microbiota associations, and proposed disease marker genera associated to host covariates were found associated to microbiota compositional variation with a 92% replication rate.
Abstract: Fecal microbiome variation in the average, healthy population has remained under-investigated. Here, we analyzed two independent, extensively phenotyped cohorts: the Belgian Flemish Gut Flora Project (FGFP; discovery cohort; N = 1106) and the Dutch LifeLines-DEEP study (LLDeep; replication; N = 1135). Integration with global data sets (N combined = 3948) revealed a 14-genera core microbiota, but the 664 identified genera still underexplore total gut diversity. Sixty-nine clinical and questionnaire-based covariates were found associated to microbiota compositional variation with a 92% replication rate. Stool consistency showed the largest effect size, whereas medication explained largest total variance and interacted with other covariate-microbiota associations. Early-life events such as birth mode were not reflected in adult microbiota composition. Finally, we found that proposed disease marker genera associated to host covariates, urging inclusion of the latter in study design.

1,562 citations

Journal ArticleDOI
10 Dec 2015-Nature
TL;DR: A unified signature of gut microbiome shifts in T2D with a depletion of butyrate-producing taxa is reported, highlighting the need to disentangle gut microbiota signatures of specific human diseases from those of medication.
Abstract: In recent years, several associations between common chronic human disorders and altered gut microbiome composition and function have been reported. In most of these reports, treatment regimens were not controlled for and conclusions could thus be confounded by the effects of various drugs on the microbiota, which may obscure microbial causes, protective factors or diagnostically relevant signals. Our study addresses disease and drug signatures in the human gut microbiome of type 2 diabetes mellitus (T2D). Two previous quantitative gut metagenomics studies of T2D patients that were unstratified for treatment yielded divergent conclusions regarding its associated gut microbial dysbiosis. Here we show, using 784 available human gut metagenomes, how antidiabetic medication confounds these results, and analyse in detail the effects of the most widely used antidiabetic drug metformin. We provide support for microbial mediation of the therapeutic effects of metformin through short-chain fatty acid production, as well as for potential microbiota-mediated mechanisms behind known intestinal adverse effects in the form of a relative increase in abundance of Escherichia species. Controlling for metformin treatment, we report a unified signature of gut microbiome shifts in T2D with a depletion of butyrate-producing taxa. These in turn cause functional microbiome shifts, in part alleviated by metformin-induced changes. Overall, the present study emphasizes the need to disentangle gut microbiota signatures of specific human diseases from those of medication.

1,473 citations

References
More filters
Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations


"FLASH: Fast Length Adjustment of Sh..." refers methods in this paper

  • ...The paired-end reads were generated from fragments with a mean length of 180 bp, normally distributed with an SD of 20 bp. Error-free reads were generated using wgsim from the SAMtools package (Li et al., 2009)....

    [...]

Journal ArticleDOI
TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

20,335 citations

Journal ArticleDOI
TL;DR: The newest version of MUMmer easily handles comparisons of large eukaryotic genomes at varying evolutionary distances, as demonstrated by applications to multiple genomes.
Abstract: The newest version of MUMmer easily handles comparisons of large eukaryotic genomes at varying evolutionary distances, as demonstrated by applications to multiple genomes. Two new graphical viewing tools provide alternative ways to analyze genome alignments. The new system is the first version of MUMmer to be released as open-source software. This allows other developers to contribute to the code base and freely redistribute the code. The MUMmer sources are available at http://www.tigr.org/software/mummer.

4,886 citations

Journal ArticleDOI
TL;DR: The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.
Abstract: Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.

2,760 citations

Journal ArticleDOI
TL;DR: The development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform, have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome.
Abstract: Massively parallel DNA sequencing technologies are revolutionizing genomics by making it possible to generate billions of relatively short (~100-base) sequence reads at very low cost. Whereas such data can be readily used for a wide range of biomedical applications, it has proven difficult to use them to generate high-quality de novo genome assemblies of large, repeat-rich vertebrate genomes. To date, the genome assemblies generated from such data have fallen far short of those obtained with the older (but much more expensive) capillary-based sequencing approach. Here, we report the development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome. In particular, the base accuracy is high (≥99.95%) and the scaffold sizes (N50 size = 11.5 Mb for human and 7.2 Mb for mouse) approach those obtained with capillary-based sequencing. The combination of improved sequencing technology and improved computational methods should now make it possible to increase dramatically the de novo sequencing of large genomes. The ALLPATHS-LG program is available at http://www.broadinstitute.org/science/programs/genome-biology/crd.

1,616 citations

Related Papers (5)