scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Minimap2: pairwise alignment for nucleotide sequences

Heng Li1
15 Sep 2018-Bioinformatics (Bioinformatics)-Vol. 34, Iss: 18, pp 3094-3100
TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.
Abstract: Motivation Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation https://github.com/lh3/minimap2. Supplementary information Supplementary data are available at Bioinformatics online.
Citations
More filters
Proceedings ArticleDOI
28 Mar 2022
TL;DR: GenASM, a recent algorithm for genomic sequence alignment, is improved on by significantly reducing its memory footprint and bandwidth requirement, and efficiently parallelize the algorithm for GPUs.
Abstract: We improve on GenASM, a recent algorithm for genomic sequence alignment, by significantly reducing its memory footprint and bandwidth requirement. Our algorithmic improvements reduce the memory footprint by 24 × and the number of memory accesses by 12 ×. We efficiently parallelize the algorithm for GPUs, achieving a 4.1 × speedup over a CPU implementation of the same algorithm, a 62× speedup over minimap2's CPU-based KSW2 and a 7.2 × speedup over the CPU-based Edlib for long reads.

3 citations

Posted ContentDOI
12 Aug 2020-bioRxiv
TL;DR: PuffAligner is introduced, a fast, accurate and versatile aligner built on top of the Pufferfish index that strikes a desirable balance with respect to the time, space, and accuracy tradeoffs made by different alignment tools.
Abstract: Motivation Sequence alignment is one of the first steps in many modern genomic analyses, such as variant detection, transcript abundance estimation and metagenomic profiling. Unfortunately, it is often a computationally expensive procedure. As the quantity of data and wealth of different assays and applications continue to grow, the need for accurate and fast alignment tools persists. Results In this paper, we introduce PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly-sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space, and accuracy tradeoffs made by different alignment tools, and provides a promising foundation on which to test new alignment ideas over large collections of sequences. Availability PuffAligner is a free and open-source software. It is implemented in C++14 and can be obtained from https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings

3 citations

Journal ArticleDOI
TL;DR: In this article, the authors describe the semi-automated generation of a de novo TE library using the newly developed EDTA pipeline and DeepTE classifier in a non-model teleost (Corydoras fulleri).
Abstract: Transposable elements (TEs) are significant genomic components which can be detected either through sequence homology against existing databases or de novo, with the latter potentially reducing the risk of underestimating TE abundance. Here, we describe the semi-automated generation of a de novo TE library using the newly developed EDTA pipeline and DeepTE classifier in a non-model teleost (Corydoras fulleri). Using both genomic and transcriptomic data, we assess this de novo pipeline's performance across four TE based metrics: (i) abundance, (ii) composition, (iii) fragmentation, and (iv) age distributions. We then compare the results to those found when using a curated teleost library (Danio rerio). We identify quantitative differences in these metrics and highlight how TE library choice can have major impacts on TE-based estimates in non-model species.

3 citations

Proceedings ArticleDOI
21 Feb 2021
TL;DR: In this article, the authors present an FPGA-based accelerator for Minimap2 with focus on its operation for short reads, which can be integrated into a parallelizable architecture.
Abstract: Recent advances in DNA sequencing technologies include the generation of long reads, with lengths from tens to hundreds of thousands of base pairs each. State-of-the-art algorithm Minimap2 is able to process these data and the most commonly used short reads, but is memory- and computationally-intensive: to process a human genome, its running times can reach up to several hours in powerful machines. As a means of making this technology more available to hospitals and clinics, hardware accelerators have addressed these shortcomings with many short read mappers in the past, and their application to this new generation of softwares is an area of active research. Here we present a FPGA-based accelerator for Minimap2 with focus on its operation for short reads. We gathered profiling behaviors to determine the algorithm's bottleneck. We generated a hardware block for one recurrent loop in the critical function that can be integrated into a parallelizable architecture. Execution with short reads has shown a reduction of 155x in terms of required clock cycles in the accelerated section. Data transfer overhead is measured and discussed.

3 citations

Journal ArticleDOI
TL;DR: In this article, the authors investigated the distance at which paired 5'-overhanging SSBs are mutagenic and which DNA repair pathways are essential for insertion formation in Arabidopsis thaliana.
Abstract: In nature, single-strand breaks (SSBs) in DNA occur more frequently (by orders of magnitude) than double-strand breaks (DSBs). SSBs induced by the CRISPR/Cas9 nickase at a distance of 50-100 bp on opposite strands are highly mutagenic, leading to insertions/deletions (InDels), with insertions mainly occurring as direct tandem duplications. As short tandem repeats are overrepresented in plant genomes, this mechanism seems to be important for genome evolution. We investigated the distance at which paired 5'-overhanging SSBs are mutagenic and which DNA repair pathways are essential for insertion formation in Arabidopsis thaliana. We were able to detect InDel formation up to a distance of 250 bp, although with much reduced efficiency. Surprisingly, the loss of the classical nonhomologous end joining (NHEJ) pathway factors KU70 or DNA ligase 4 completely abolished tandem repeat formation. The microhomology-mediated NHEJ factor POLQ was required only for patch-like insertions, which are well-known from DSB repair as templated insertions from ectopic sites. As SSBs can also be repaired using homology, we furthermore asked whether the classical homologous recombination (HR) pathway is involved in this process in plants. The fact that RAD54 is not required for homology-mediated SSB repair demonstrates that the mechanisms for DSB- and SSB-induced HR differ in plants.

3 citations

References
More filters
Journal ArticleDOI
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

70,111 citations

Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations

Journal ArticleDOI
TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.
Abstract: As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

37,898 citations


"Minimap2: pairwise alignment for nu..." refers background or methods in this paper

  • ...Most of them were five times as slow as mainstream short-read aligners (Langmead and Salzberg, 2012; Li, 2013) in terms of the number of bases mapped per second....

    [...]

  • ...We evaluated minimap2 along with Bowtie2 (v2.3.3; Langmead and Salzberg 2012), BWA-MEM and SNAP (v1....

    [...]

Journal ArticleDOI
TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

30,684 citations