scispace - formally typeset
Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTU170

Trimmomatic: a flexible trimmer for Illumina sequence data

01 Aug 2014-Bioinformatics (Oxford University Press)-Vol. 30, Iss: 15, pp 2114-2120
Abstract: Motivation: Although many next-generation sequencing (NGS) read preprocessing tools already existed, we could not find any tool or combination of tools that met our requirements in terms of flexibility, correct handling of paired-end data and high performance. We have developed Trimmomatic as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data. Results: The value of NGS read preprocessing is demonstrated for both reference-based and reference-free tasks. Trimmomatic is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested. Availability and implementation: Trimmomatic is licensed under GPL V3. It is cross-platform (Java 1.5+ required) and available at Contact: ed.nehcaa-htwr.1oib@ledasu Supplementary information: Supplementary data are available at Bioinformatics online. more


Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTU638
15 Jan 2015-Bioinformatics
Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from or from the Python Package Index at Contact: more

11,833 Citations

Open accessJournal ArticleDOI: 10.1038/S41586-020-2008-3
Fan Wu1, Su Zhao2, Bin Yu3, Yan-Mei Chen1  +17 moreInstitutions (4)
03 Feb 2020-Nature
Abstract: Emerging infectious diseases, such as severe acute respiratory syndrome (SARS) and Zika virus disease, present a major threat to public health1–3. Despite intense research efforts, how, when and where new diseases appear are still a source of considerable uncertainty. A severe respiratory disease was recently reported in Wuhan, Hubei province, China. As of 25 January 2020, at least 1,975 cases had been reported since the first patient was hospitalized on 12 December 2019. Epidemiological investigations have suggested that the outbreak was associated with a seafood market in Wuhan. Here we study a single patient who was a worker at the market and who was admitted to the Central Hospital of Wuhan on 26 December 2019 while experiencing a severe respiratory syndrome that included fever, dizziness and a cough. Metagenomic RNA sequencing4 of a sample of bronchoalveolar lavage fluid from the patient identified a new RNA virus strain from the family Coronaviridae, which is designated here ‘WH-Human 1’ coronavirus (and has also been referred to as ‘2019-nCoV’). Phylogenetic analysis of the complete viral genome (29,903 nucleotides) revealed that the virus was most closely related (89.1% nucleotide similarity) to a group of SARS-like coronaviruses (genus Betacoronavirus, subgenus Sarbecovirus) that had previously been found in bats in China5. This outbreak highlights the ongoing ability of viral spill-over from animals to cause severe disease in humans. Phylogenetic and metagenomic analyses of the complete viral genome of a new coronavirus from the family Coronaviridae reveal that the virus is closely related to a group of SARS-like coronaviruses found in bats in China. more

Topics: Coronavirus (62%), Betacoronavirus (59%), Zika virus disease (54%) more

6,266 Citations

Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTY560
Shifu Chen1, Yanqing Zhou, Yaru Chen, Jia Gu1Institutions (1)
01 Sep 2018-Bioinformatics
Abstract: Motivation Quality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g. Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient. Results We developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality pruning and many other operations with a single scan of the FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2-5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools. Availability and implementation The open-source code and corresponding instructions are available at more

Topics: FASTQ format (65%), Preprocessor (50%)

2,501 Citations

Open accessJournal ArticleDOI: 10.1016/J.CELL.2015.04.044
Allon M. Klein1, Linas Mazutis1, Linas Mazutis2, Ilke Akartuna1  +6 moreInstitutions (2)
21 May 2015-Cell
Abstract: It has long been the dream of biologists to map gene expression at the single-cell level With such data one might track heterogeneous cell sub-populations, and infer regulatory relationships between genes and pathways Recently, RNA sequencing has achieved single-cell resolution What is limiting is an effective way to routinely isolate and process large numbers of individual cells for quantitative in-depth sequencing We have developed a high-throughput droplet-microfluidic approach for barcoding the RNA from thousands of individual cells for subsequent analysis by next-generation sequencing The method shows a surprisingly low noise profile and is readily adaptable to other sequencing-based assays We analyzed mouse embryonic stem cells, revealing in detail the population structure and the heterogeneous onset of differentiation after leukemia inhibitory factor (LIF) withdrawal The reproducibility of these high-throughput single-cell data allowed us to deconstruct cell populations and infer gene expression relationships VIDEO ABSTRACT more

2,293 Citations

Open accessJournal ArticleDOI: 10.1186/S12915-014-0087-Z
12 Nov 2014-BMC Biology
Abstract: The study of microbial communities has been revolutionised in recent years by the widespread adoption of culture independent analytical techniques such as 16S rRNA gene sequencing and metagenomics. One potential confounder of these sequence-based approaches is the presence of contamination in DNA extraction kits and other laboratory reagents. In this study we demonstrate that contaminating DNA is ubiquitous in commonly used DNA extraction kits and other laboratory reagents, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass. Contamination impacts both PCR-based 16S rRNA gene surveys and shotgun metagenomics. We provide an extensive list of potential contaminating genera, and guidelines on how to mitigate the effects of contamination. These results suggest that caution should be advised when applying sequence-based techniques to the study of microbiota present in low biomass environments. Concurrent sequencing of negative control samples is strongly advised. more

Topics: DNA Contamination (60%), Metagenomics (57%)

1,968 Citations


Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTP324
Heng Li1, Richard Durbin1Institutions (1)
01 Jul 2009-Bioinformatics
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: Contact: [email protected] more

Topics: Hybrid genome assembly (54%), Sequence assembly (53%), 2 base encoding (52%) more

35,234 Citations

Open accessJournal ArticleDOI: 10.1038/NMETH.1923
01 Apr 2012-Nature Methods
Abstract: As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy. more

27,973 Citations

Journal ArticleDOI: 10.14806/EJ.17.1.200
Marcel Martin1Institutions (1)
02 May 2011-EMBnet.journal
Abstract: When small RNA is sequenced on current sequencing machines, the resulting reads are usually longer than the RNA and therefore contain parts of the 3' adapter. That adapter must be found and removed error-tolerantly from each read before read mapping. Previous solutions are either hard to use or do not offer required features, in particular support for color space data. As an easy to use alternative, we developed the command-line tool cutadapt, which supports 454, Illumina and SOLiD (color space) data, offers two adapter trimming algorithms, and has other useful features. Cutadapt, including its MIT-licensed source code, is available for download at more

Topics: Adapter (genetics) (50%)

13,576 Citations

Open accessJournal ArticleDOI: 10.1101/GR.074492.107
Daniel R. Zerbino1, Ewan Birney1Institutions (1)
01 May 2008-Genome Research
Abstract: We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies. more

Topics: De Bruijn sequence (59%), De Bruijn graph (55%), Assembly software (54%) more

8,867 Citations

Journal ArticleDOI: 10.1146/ANNUREV.GENOM.9.081307.164359
Elaine R. Mardis1Institutions (1)
Abstract: Recent scientific discoveries that resulted from the application of nextgeneration DNA sequencing technologies highlight the striking impact of these massively parallel platforms on genetics. These new methods have expanded previously focused readouts from a variety of DNA preparation protocols to a genome-wide scale and have fine-tuned their resolution to single base precision. The sequencing of RNA also has transitioned and now includes full-length cDNA analyses, serial analysis of gene expression (SAGE)-based methods, and noncoding RNA discovery. Next-generation sequencing has also enabled novel applications such as the sequencing of ancient DNA samples, and has substantially widened the scope of metagenomic analysis of environmentally derived samples. Taken together, an astounding potential exists for these technologies to bring enormous change in genetic and biological research and to enhance our fundamental biological knowledge. more

Topics: Massive parallel sequencing (68%), 2 base encoding (63%), Single cell sequencing (61%) more

2,240 Citations

No. of citations received by the Paper in previous years
Network Information