Minimap2: pairwise alignment for nucleotide sequences

doi:10.1093/BIOINFORMATICS/BTY191

Home
/
Papers
/
Minimap2: pairwise alignment for nucleotide sequences

Journal Article•DOI•

Minimap2: pairwise alignment for nucleotide sequences

Heng Li¹•Institutions (1)

Broad Institute¹

15 Sep 2018-Bioinformatics (Bioinformatics)-Vol. 34, Iss: 18, pp 3094-3100

TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.

read less

Abstract: Motivation Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation https://github.com/lh3/minimap2. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Algorithmic Improvement and GPU Acceleration of the GenASM Algorithm

[...]

Joel Lindegger, Damla Senol Cali, Mohammed Alser, Juan G'omez-Luna, Onur Mutlu - Show less +1 more

28 Mar 2022

TL;DR: GenASM, a recent algorithm for genomic sequence alignment, is improved on by significantly reducing its memory footprint and bandwidth requirement, and efficiently parallelize the algorithm for GPUs.

...read moreread less

Abstract: We improve on GenASM, a recent algorithm for genomic sequence alignment, by significantly reducing its memory footprint and bandwidth requirement. Our algorithmic improvements reduce the memory footprint by 24 × and the number of memory accesses by 12 ×. We efficiently parallelize the algorithm for GPUs, achieving a 4.1 × speedup over a CPU implementation of the same algorithm, a 62× speedup over minimap2's CPU-based KSW2 and a 7.2 × speedup over the CPU-based Edlib for long reads.

...read moreread less

3 citations

Posted Content•DOI•

Puffaligner: An Efficient and Accurate Aligner Based on the Pufferfish Index

[...]

Fatemeh Almodaresi, Mohsen Zakeri, Rob Patro

12 Aug 2020-bioRxiv

TL;DR: PuffAligner is introduced, a fast, accurate and versatile aligner built on top of the Pufferfish index that strikes a desirable balance with respect to the time, space, and accuracy tradeoffs made by different alignment tools.

...read moreread less

Abstract: Motivation Sequence alignment is one of the first steps in many modern genomic analyses, such as variant detection, transcript abundance estimation and metagenomic profiling. Unfortunately, it is often a computationally expensive procedure. As the quantity of data and wealth of different assays and applications continue to grow, the need for accurate and fast alignment tools persists. Results In this paper, we introduce PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly-sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space, and accuracy tradeoffs made by different alignment tools, and provides a promising foundation on which to test new alignment ideas over large collections of sequences. Availability PuffAligner is a free and open-source software. It is implemented in C++14 and can be obtained from https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings

...read moreread less

3 citations

Journal Article•DOI•

Transposable element annotation in non-model species - the benefits of species-specific repeat libraries using semi-automated EDTA and DeepTE de novo pipelines

[...]

Ellen A. Bell¹, Christopher L. Butler¹, Claudio Oliveira², Sarah Marburger³, Levi Yant⁴, Martin I. Taylor¹ - Show less +2 more•Institutions (4)

University of East Anglia¹, Sao Paulo State University², John Innes Centre³, University of Nottingham⁴

18 Aug 2021-Molecular Ecology Resources

TL;DR: In this article, the authors describe the semi-automated generation of a de novo TE library using the newly developed EDTA pipeline and DeepTE classifier in a non-model teleost (Corydoras fulleri).

...read moreread less

Abstract: Transposable elements (TEs) are significant genomic components which can be detected either through sequence homology against existing databases or de novo, with the latter potentially reducing the risk of underestimating TE abundance. Here, we describe the semi-automated generation of a de novo TE library using the newly developed EDTA pipeline and DeepTE classifier in a non-model teleost (Corydoras fulleri). Using both genomic and transcriptomic data, we assess this de novo pipeline's performance across four TE based metrics: (i) abundance, (ii) composition, (iii) fragmentation, and (iv) age distributions. We then compare the results to those found when using a curated teleost library (Danio rerio). We identify quantitative differences in these metrics and highlight how TE library choice can have major impacts on TE-based estimates in non-model species.

...read moreread less

3 citations

Proceedings Article•DOI•

Accelerating the base-level alignment step of DNA assembling in Minimap2 Algorithm using FPGA

[...]

Carolina Teng¹, Renan W. Achjian¹, Caio C. Braga¹, Marcelo Knörich Zuffo¹, Wang J. Chau¹ - Show less +1 more•Institutions (1)

University of São Paulo¹

21 Feb 2021

TL;DR: In this article, the authors present an FPGA-based accelerator for Minimap2 with focus on its operation for short reads, which can be integrated into a parallelizable architecture.

...read moreread less

Abstract: Recent advances in DNA sequencing technologies include the generation of long reads, with lengths from tens to hundreds of thousands of base pairs each. State-of-the-art algorithm Minimap2 is able to process these data and the most commonly used short reads, but is memory- and computationally-intensive: to process a human genome, its running times can reach up to several hours in powerful machines. As a means of making this technology more available to hospitals and clinics, hardware accelerators have addressed these shortcomings with many short read mappers in the past, and their application to this new generation of softwares is an area of active research. Here we present a FPGA-based accelerator for Minimap2 with focus on its operation for short reads. We gathered profiling behaviors to determine the algorithm's bottleneck. We generated a hardware block for one recurrent loop in the critical function that can be integrated into a parallelizable architecture. Execution with short reads has shown a reduction of 155x in terms of required clock cycles in the accelerated section. Data transfer overhead is measured and discussed.

...read moreread less

3 citations

Journal Article•DOI•

Different DNA repair pathways are involved in single-strand break-induced genomic changes in plants.

[...]

Felix Wolter¹, Patrick Schindele¹, Natalja Beying¹, Armin Scheben², Holger Puchta¹ - Show less +1 more•Institutions (2)

Karlsruhe Institute of Technology¹, Cold Spring Harbor Laboratory²

04 Nov 2021-The Plant Cell

TL;DR: In this article, the authors investigated the distance at which paired 5'-overhanging SSBs are mutagenic and which DNA repair pathways are essential for insertion formation in Arabidopsis thaliana.

...read moreread less

Abstract: In nature, single-strand breaks (SSBs) in DNA occur more frequently (by orders of magnitude) than double-strand breaks (DSBs). SSBs induced by the CRISPR/Cas9 nickase at a distance of 50-100 bp on opposite strands are highly mutagenic, leading to insertions/deletions (InDels), with insertions mainly occurring as direct tandem duplications. As short tandem repeats are overrepresented in plant genomes, this mechanism seems to be important for genome evolution. We investigated the distance at which paired 5'-overhanging SSBs are mutagenic and which DNA repair pathways are essential for insertion formation in Arabidopsis thaliana. We were able to detect InDel formation up to a distance of 250 bp, although with much reduced efficiency. Surprisingly, the loss of the classical nonhomologous end joining (NHEJ) pathway factors KU70 or DNA ligase 4 completely abolished tandem repeat formation. The microhomology-mediated NHEJ factor POLQ was required only for patch-like insertions, which are well-known from DSB repair as templated insertions from ectopic sites. As SSBs can also be repaired using homology, we furthermore asked whether the classical homologous recombination (HR) pathway is involved in this process in plants. The fact that RAD54 is not required for homology-mediated SSB repair demonstrates that the mechanisms for DSB- and SSB-induced HR differ in plants.

...read moreread less

3 citations

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

[...]

Stephen F. Altschul¹, Thomas L. Madden, Alejandro A. Schäffer¹, Jinghui Zhang, Zheng Zhang², Webb Miller², David J. Lipman - Show less +3 more•Institutions (2)

National Institutes of Health¹, Pennsylvania State University²

01 Sep 1997-Nucleic Acids Research

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.

...read moreread less

Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

...read moreread less

70,111 citations

Journal Article•DOI•

The Sequence Alignment/Map format and SAMtools

[...]

Heng Li¹, Bob Handsaker², Alec Wysoker², T. J. Fennell², Jue Ruan³, Nils Homer², Gabor T. Marth⁴, Gonçalo R. Abecasis², Richard Durbin¹ - Show less +5 more•Institutions (4)

Wellcome Trust Sanger Institute¹, University of California, Los Angeles², Chinese Academy of Sciences³, Boston College⁴

01 Aug 2009-Bioinformatics

TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.

...read moreread less

Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

...read moreread less

45,957 citations

Journal Article•DOI•

Fast and accurate short read alignment with Burrows–Wheeler transform

[...]

Heng Li¹, Richard Durbin¹•Institutions (1)

Wellcome Trust Sanger Institute¹

01 Jul 2009-Bioinformatics

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.

...read moreread less

Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

...read moreread less

43,862 citations

Journal Article•DOI•

Fast gapped-read alignment with Bowtie 2

[...]

Ben Langmead¹, Steven L. Salzberg¹, Steven L. Salzberg², Steven L. Salzberg³•Institutions (3)

University of Maryland, College Park¹, Johns Hopkins University², Johns Hopkins University School of Medicine³

01 Apr 2012-Nature Methods

TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

...read moreread less

Abstract: As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

...read moreread less

37,898 citations

"Minimap2: pairwise alignment for nu..." refers background or methods in this paper

...Most of them were five times as slow as mainstream short-read aligners (Langmead and Salzberg, 2012; Li, 2013) in terms of the number of bases mapped per second....
[...]
...We evaluated minimap2 along with Bowtie2 (v2.3.3; Langmead and Salzberg 2012), BWA-MEM and SNAP (v1....
[...]

Journal Article•DOI•

STAR: ultrafast universal RNA-seq aligner

[...]

Alexander Dobin¹, Carrie A. Davis¹, Felix Schlesinger¹, Jorg Drenkow¹, Chris Zaleski¹, Sonali Jha¹, Philippe Batut¹, Mark Chaisson¹, Thomas R. Gingeras¹ - Show less +5 more•Institutions (1)

Cold Spring Harbor Laboratory¹

01 Jan 2013-Bioinformatics

TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.

...read moreread less

Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

...read moreread less

30,684 citations