Minimap2: pairwise alignment for nucleotide sequences

doi:10.1093/BIOINFORMATICS/BTY191

Home
/
Papers
/
Minimap2: pairwise alignment for nucleotide sequences

Journal Article•DOI•

Minimap2: pairwise alignment for nucleotide sequences

Heng Li¹•Institutions (1)

Broad Institute¹

15 Sep 2018-Bioinformatics (Bioinformatics)-Vol. 34, Iss: 18, pp 3094-3100

TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.

read less

Abstract: Motivation Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation https://github.com/lh3/minimap2. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Improved metagenomic analysis with Kraken 2.

[...]

Derrick E. Wood¹, Jennifer Lu¹, Ben Langmead¹•Institutions (1)

Johns Hopkins University¹

28 Nov 2019-Genome Biology

TL;DR: Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold.

...read moreread less

Abstract: Although Kraken’s k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.

...read moreread less

2,261 citations

Cites methods from "Minimap2: pairwise alignment for nu..."

...A similar minimizer-based approach has proven useful in accelerating read alignment [16]....
[...]

Integrative Genomics Viewer

[...]

James T. Robinson¹, Helga Thorvaldsdottir¹, Wendy Winckler¹, Mitchell Guttman¹, Eric S. Lander¹, Eric S. Lander², Gad Getz¹, Jill P. Mesirov¹ - Show less +4 more•Institutions (2)

Massachusetts Institute of Technology¹, Harvard University²

01 Jan 2011

TL;DR: The sheer volume and scope of data posed by this flood of data pose a significant challenge to the development of efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data.

...read moreread less

Abstract: Rapid improvements in sequencing and array-based platforms are resulting in a flood of diverse genome-wide data, including data from exome and whole-genome sequencing, epigenetic surveys, expression profiling of coding and noncoding RNAs, single nucleotide polymorphism (SNP) and copy number profiling, and functional assays. Analysis of these large, diverse data sets holds the promise of a more comprehensive understanding of the genome and its relation to human disease. Experienced and knowledgeable human review is an essential component of this process, complementing computational approaches. This calls for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data. However, the sheer volume and scope of data pose a significant challenge to the development of such tools.

...read moreread less

2,187 citations

Journal Article•DOI•

The Architecture of SARS-CoV-2 Transcriptome.

[...]

Dong Wan Kim¹, Joo Yeon Lee², Jeong Sun Yang², Jun Won Kim², V. Narry Kim¹, Hyeshik Chang¹ - Show less +2 more•Institutions (2)

Seoul National University¹, Centers for Disease Control and Prevention²

14 May 2020-Cell

TL;DR: Functional investigation of the unknown transcripts and RNA modifications discovered in this study will open new directions to the understanding of the life cycle and pathogenicity of SARS-CoV-2.

...read moreread less

1,626 citations

Cites methods from "Minimap2: pairwise alignment for nu..."

...The sequence reads were aligned to the reference sequence database composed of the C. sabaeus genome (ENSEMBL release 99), a SARS-CoV-2 genome, yeast ENO2 cDNA (SGD: YHR174W), and human ribosomal DNA complete repeat unit (GenBank: U13369.1) using minimap2 2.17 (Li, 2018) with options ''-k 13 -x splice -N 32 -un.'' We used the sequence of the Wuhan-Hu-1 strain (GenBank: NC_045512.2) as a backbone for the viral reference genome, then corrected the four single nucleotide variants found in BetaCoV/Korea/KCDC03/2020; T4402C, G5062T, C8782T, and T28143C (GISAID: EPI_ISL_407193)....
[...]
...1) using minimap2 2.17 (Li, 2018) with options ‘‘-k 13 -x splice -...
[...]
...…and Algorithms guppy 3.4.5 Oxford Nanopore Technologies https://community.nanoporetech.com/ sso/login?next_url=%2Fdownloads minimap2 2.17 Li, 2018 https://github.com/lh3/minimap2 poreplex 0.5.0 Hyeshik Chang, Seoul National University,…...
[...]

Journal Article•DOI•

Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding.

[...]

Tyler N. Starr¹, Allison J. Greaney¹, Allison J. Greaney², Sarah K Hilton¹, Sarah K Hilton², Daniel Ellis², Katharine H.D. Crawford¹, Katharine H.D. Crawford², Adam S. Dingens¹, Mary Jane Navarro², John E. Bowen², M. Alejandra Tortorici², Alexandra C. Walls², Neil P. King², David Veesler², Jesse D. Bloom², Jesse D. Bloom³, Jesse D. Bloom¹ - Show less +14 more•Institutions (3)

Fred Hutchinson Cancer Research Center¹, University of Washington², Howard Hughes Medical Institute³

03 Sep 2020-Cell

TL;DR: It is found that a substantial number of mutations to the RBD are well tolerated or even enhance ACE2 binding, including at ACE2 interface residues that vary across SARS-related coronaviruses.

...read moreread less

1,517 citations

Cites background or methods from "Minimap2: pairwise alignment for nu..."

...To do this, we used alignparse (Crawford and Bloom, 2019), version 0.1.3, which in turn makes use of minimap2 (Li, 2018), version 2.17....
[...]
...3, which in turn makes use of minimap2 (Li, 2018), version 2....
[...]
...…version 0.1.3 Crawford and Bloom, 2019 https://github.com/jbloomlab/alignparse minimap, version 2.17 Li 2018 https://github.com/lh3/minimap2 dms_variants, version 0.6.0 GitHub https://jbloomlab.github.io/dms_variants/ custom code This paper all…...
[...]

Journal Article•DOI•

Performance of neural network basecalling tools for Oxford Nanopore sequencing.

[...]

Ryan R. Wick¹, Louise M. Judd¹, Kathryn E. Holt¹, Kathryn E. Holt²•Institutions (2)

Monash University¹, University of London²

24 Jun 2019-Genome Biology

TL;DR: The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance, and users should consider producing a custom model using a larger neural network and/or training data from the same species.

...read moreread less

Abstract: Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences (‘polishing’) with Nanopolish somewhat negates the accuracy differences in basecallers, but pre-polish accuracy does have an effect on post-polish accuracy. Basecalling accuracy has seen significant improvements over the last 2 years. The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.

...read moreread less

1,488 citations

Cites methods from "Minimap2: pairwise alignment for nu..."

...0 (the current version at the time of read selection), aligning the resulting reads (using minimap2 [18] v2....
[...]
...To assess read accuracy, we aligned each basecalled read set to the reference INF032 genome using minimap2 [18] (v2....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Integrative genomics viewer

[...]

James T. Robinson¹, Helga Thorvaldsdottir¹, Wendy Winckler¹, Mitchell Guttman¹, Eric S. Lander², Eric S. Lander¹, Gad Getz¹, Jill P. Mesirov¹ - Show less +4 more•Institutions (2)

Massachusetts Institute of Technology¹, Harvard University²

10 Jan 2011-Nature Biotechnology

TL;DR: In this article, the authors present an approach for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data.

...read moreread less

10,798 citations

Journal Article•DOI•

A framework for variation discovery and genotyping using next-generation DNA sequencing data

[...]

Mark A. DePristo¹, Eric Banks¹, Ryan Poplin¹, Kiran V. Garimella¹, Jared Maguire¹, Christopher Hartl¹, Anthony A. Philippakis¹, Anthony A. Philippakis², Anthony A. Philippakis³, Guillermo del Angel¹, Manuel A. Rivas¹, Manuel A. Rivas², Matt Hanna¹, Aaron McKenna¹, Timothy Fennell¹, Andrew Kernytsky¹, Andrey Sivachenko¹, Kristian Cibulskis¹, Stacey Gabriel¹, David Altshuler¹, David Altshuler², Mark J. Daly², Mark J. Daly¹ - Show less +19 more•Institutions (3)

Broad Institute¹, Harvard University², Brigham and Women's Hospital³

01 May 2011-Nature Genetics

TL;DR: A unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs is presented.

...read moreread less

Abstract: Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass (~4×) 1000 Genomes Project datasets.

...read moreread less

10,056 citations

Posted Content•DOI•

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

[...]

Heng Li

16 Mar 2013-arXiv: Genomics

TL;DR: BWA-MEM automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment, which is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases.

...read moreread less

Abstract: Summary: BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date. Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at this http URL. Contact: hengli@broadinstitute.org

...read moreread less

8,090 citations

"Minimap2: pairwise alignment for nu..." refers background or methods in this paper

...Several aligners have been developed for such data (Chaisson and Tesler, 2012; Li, 2013; Liu et al., 2016; Sović et al., 2016; Liu et al., 2017; Lin and Hsu, 2017; Sedlazeck et al., 2017)....
[...]
...) sequencing technology and Oxford Nanopore technologies (ONT) produce reads over 10kbp in length at an error rate ˘15%. Several aligners have been developed for such data (Chaisson and Tesler, 2012; Li, 2013; Liu et al., 2016; Sovic et al., 2016; Liu et al., 2017; Lin and Hsu, 2017; Sedlazeck´ et al., 2017). Most of them were ﬁve times as slow as mainstream short-read aligners (Langmead and Salzberg, 201...
[...]
...7.15; Li, 2013), GraphMap (v0....
[...]
...Most of them were five times as slow as mainstream short-read aligners (Langmead and Salzberg, 2012; Li, 2013) in terms of the number of bases mapped per second....
[...]
...ses such as nt from NCBI. 3.1 Aligning long genomic reads As a sanity check, we evaluated minimap2 on simulated human reads along with BLASR (v1.MC.rc64; Chaisson and Tesler, 2012), BWA-MEM (v0.7.15; Li, 2013), GraphMap (v0.5.2; Sovic et al.,´ 2016), Kart (v2.2.5; Lin and Hsu, 2017), minialign (v0.5.3; https://github.com/ocxtal/minialign) and NGMLR (v0.2.5; Sedlazeck et al., 2017). We excluded rHAT (Liu et...
[...]

Journal Article•DOI•

GMAP: a genomic mapping and alignment program for mRNA and EST sequences

[...]

Thomas D. Wu¹, Colin K. Watanabe¹•Institutions (1)

Genentech¹

01 May 2005-Bioinformatics

TL;DR: GMAP, a standalone program for mapping and aligning cDNA sequences to a genome with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets, demonstrates a several-fold increase in speed over existing programs.

...read moreread less

Abstract: Motivation: We introduce gmap, a standalone program for mapping and aligning cDNA sequences to a genome. The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Methodology underlying the program includes a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment, sandwich DP for splice site detection, and microexon identification with statistical significance testing. Results: On a set of human messenger RNAs with random mutations at a 1 and 3% rate, gmap identified all splice sites accurately in over 99.3% of the sequences, which was one-tenth the error rate of existing programs. On a large set of human expressed sequence tags, gmap provided higher-quality alignments more often than blat did. On a set of Arabidopsis cDNAs, gmap performed comparably with GeneSeqer. In these experiments, gmap demonstrated a several-fold increase in speed over existing programs. Availability: Source code for gmap and associated programs is available at http://www.gene.com/share/gmap Contact: [email protected] Supplementary information: http://www.gene.com/share/gmap

...read moreread less

2,058 citations

Journal Article•DOI•

An improved algorithm for matching biological sequences

[...]

Osamu Gotoh

15 Dec 1982-Journal of Molecular Biology

TL;DR: The algorithm of Waterman et al. (1976) for matching biological sequences was modified under some limitations to be accomplished in essentially MN steps, instead of the M 2 N steps necessary in the original algorithm.

...read moreread less

1,760 citations