Minimap2: pairwise alignment for nucleotide sequences

doi:10.1093/BIOINFORMATICS/BTY191

Home
/
Papers
/
Minimap2: pairwise alignment for nucleotide sequences

Journal Article•DOI•

Minimap2: pairwise alignment for nucleotide sequences

Heng Li¹•Institutions (1)

Broad Institute¹

15 Sep 2018-Bioinformatics (Bioinformatics)-Vol. 34, Iss: 18, pp 3094-3100

TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.

read less

Abstract: Motivation Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation https://github.com/lh3/minimap2. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Improved metagenomic analysis with Kraken 2.

[...]

Derrick E. Wood¹, Jennifer Lu¹, Ben Langmead¹•Institutions (1)

Johns Hopkins University¹

28 Nov 2019-Genome Biology

TL;DR: Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold.

...read moreread less

Abstract: Although Kraken’s k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.

...read moreread less

2,261 citations

Cites methods from "Minimap2: pairwise alignment for nu..."

...A similar minimizer-based approach has proven useful in accelerating read alignment [16]....
[...]

Integrative Genomics Viewer

[...]

James T. Robinson¹, Helga Thorvaldsdottir¹, Wendy Winckler¹, Mitchell Guttman¹, Eric S. Lander¹, Eric S. Lander², Gad Getz¹, Jill P. Mesirov¹ - Show less +4 more•Institutions (2)

Massachusetts Institute of Technology¹, Harvard University²

01 Jan 2011

TL;DR: The sheer volume and scope of data posed by this flood of data pose a significant challenge to the development of efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data.

...read moreread less

Abstract: Rapid improvements in sequencing and array-based platforms are resulting in a flood of diverse genome-wide data, including data from exome and whole-genome sequencing, epigenetic surveys, expression profiling of coding and noncoding RNAs, single nucleotide polymorphism (SNP) and copy number profiling, and functional assays. Analysis of these large, diverse data sets holds the promise of a more comprehensive understanding of the genome and its relation to human disease. Experienced and knowledgeable human review is an essential component of this process, complementing computational approaches. This calls for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data. However, the sheer volume and scope of data pose a significant challenge to the development of such tools.

...read moreread less

2,187 citations

Journal Article•DOI•

The Architecture of SARS-CoV-2 Transcriptome.

[...]

Dong Wan Kim¹, Joo Yeon Lee², Jeong Sun Yang², Jun Won Kim², V. Narry Kim¹, Hyeshik Chang¹ - Show less +2 more•Institutions (2)

Seoul National University¹, Centers for Disease Control and Prevention²

14 May 2020-Cell

TL;DR: Functional investigation of the unknown transcripts and RNA modifications discovered in this study will open new directions to the understanding of the life cycle and pathogenicity of SARS-CoV-2.

...read moreread less

1,626 citations

Cites methods from "Minimap2: pairwise alignment for nu..."

...The sequence reads were aligned to the reference sequence database composed of the C. sabaeus genome (ENSEMBL release 99), a SARS-CoV-2 genome, yeast ENO2 cDNA (SGD: YHR174W), and human ribosomal DNA complete repeat unit (GenBank: U13369.1) using minimap2 2.17 (Li, 2018) with options ''-k 13 -x splice -N 32 -un.'' We used the sequence of the Wuhan-Hu-1 strain (GenBank: NC_045512.2) as a backbone for the viral reference genome, then corrected the four single nucleotide variants found in BetaCoV/Korea/KCDC03/2020; T4402C, G5062T, C8782T, and T28143C (GISAID: EPI_ISL_407193)....
[...]
...1) using minimap2 2.17 (Li, 2018) with options ‘‘-k 13 -x splice -...
[...]
...…and Algorithms guppy 3.4.5 Oxford Nanopore Technologies https://community.nanoporetech.com/ sso/login?next_url=%2Fdownloads minimap2 2.17 Li, 2018 https://github.com/lh3/minimap2 poreplex 0.5.0 Hyeshik Chang, Seoul National University,…...
[...]

Journal Article•DOI•

Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding.

[...]

Tyler N. Starr¹, Allison J. Greaney¹, Allison J. Greaney², Sarah K Hilton¹, Sarah K Hilton², Daniel Ellis², Katharine H.D. Crawford¹, Katharine H.D. Crawford², Adam S. Dingens¹, Mary Jane Navarro², John E. Bowen², M. Alejandra Tortorici², Alexandra C. Walls², Neil P. King², David Veesler², Jesse D. Bloom², Jesse D. Bloom³, Jesse D. Bloom¹ - Show less +14 more•Institutions (3)

Fred Hutchinson Cancer Research Center¹, University of Washington², Howard Hughes Medical Institute³

03 Sep 2020-Cell

TL;DR: It is found that a substantial number of mutations to the RBD are well tolerated or even enhance ACE2 binding, including at ACE2 interface residues that vary across SARS-related coronaviruses.

...read moreread less

1,517 citations

Cites background or methods from "Minimap2: pairwise alignment for nu..."

...To do this, we used alignparse (Crawford and Bloom, 2019), version 0.1.3, which in turn makes use of minimap2 (Li, 2018), version 2.17....
[...]
...3, which in turn makes use of minimap2 (Li, 2018), version 2....
[...]
...…version 0.1.3 Crawford and Bloom, 2019 https://github.com/jbloomlab/alignparse minimap, version 2.17 Li 2018 https://github.com/lh3/minimap2 dms_variants, version 0.6.0 GitHub https://jbloomlab.github.io/dms_variants/ custom code This paper all…...
[...]

Journal Article•DOI•

Performance of neural network basecalling tools for Oxford Nanopore sequencing.

[...]

Ryan R. Wick¹, Louise M. Judd¹, Kathryn E. Holt¹, Kathryn E. Holt²•Institutions (2)

Monash University¹, University of London²

24 Jun 2019-Genome Biology

TL;DR: The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance, and users should consider producing a custom model using a larger neural network and/or training data from the same species.

...read moreread less

Abstract: Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences (‘polishing’) with Nanopolish somewhat negates the accuracy differences in basecallers, but pre-polish accuracy does have an effect on post-polish accuracy. Basecalling accuracy has seen significant improvements over the last 2 years. The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.

...read moreread less

1,488 citations

Cites methods from "Minimap2: pairwise alignment for nu..."

...0 (the current version at the time of read selection), aligning the resulting reads (using minimap2 [18] v2....
[...]
...To assess read accuracy, we aligned each basecalled read set to the reference INF032 genome using minimap2 [18] (v2....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Improved spliced alignment from an information theoretic approach

[...]

Miao Zhang¹, Warren Gish¹•Institutions (1)

Washington University in St. Louis¹

01 Jan 2006-Bioinformatics

TL;DR: A novel approach to spliced alignment that meaningfully combined information from sequence similarity with that obtained from PSSM splice site models was taken and the resultant program, EXALIN, performed better than other popular tools tested under a wide range of conditions.

...read moreread less

Abstract: Motivation: mRNA sequences and expressed sequence tags represent some of the most abundant experimental data for identifying genes and alternatively spliced products in metazoans. These transcript sequences are frequently studied by aligning them to a genomic sequence template. For existing programs, error-prone, polymorphic and cross-species data, as well as non-canonical splice sites, still present significant barriers to producing accurate, complete alignments. Results: We took a novel approach to spliced alignment that meaningfully combined information from sequence similarity with that obtained from PSSM splice site models. Scoring systems were chosen to maximize their power of discrimination, and dynamic programming (DP) was employed to guarantee optimal solutions would be found. The resultant program, EXALIN, performed better than other popular tools tested under a wide range of conditions that included detection of micro-exons and human--mouse cross-species comparisons. For improved speed with only a marginal decrease in splice site prediction accuracy, EXALIN could perform limited DP guided by a result from BLASTN. Availability: The source code, binaries, scripts, scoring matrices and splice site models for human, mouse, rice and Caenorhabditis elegans utilized in this study are posted at http://blast.wustl.edu/exalin. The software (scripts, source code and binaries) is copyrighted but free for all to use. Contact: gish@blast.wustl.edu Supplementary information: http://blast.wustl.edu/exalin/exalin-supplement.pdf

...read moreread less

40 citations

"Minimap2: pairwise alignment for nu..." refers methods in this paper

...(6) is almost equivalent to the equation used by EXALIN (Zhang and Gish, 2006) except that we allow insertions immediately followed by deletions and vice versa; in addition, we use Suzuki’s diagonal formulation in actual implementation....
[...]

Journal Article•DOI•

Kart: a divide-and-conquer algorithm for NGS read alignment.

[...]

Hsin-Nan Lin¹, Wen-Lian Hsu¹•Institutions (1)

Academia Sinica¹

01 Aug 2017-Bioinformatics

TL;DR: A divide‐and‐conquer algorithm, called Kart, which can process long reads as fast as short reads by dividing a read into small fragments that can be aligned independently, and can tolerate much higher error rates.

...read moreread less

Abstract: Motivation Next-generation sequencing (NGS) provides a great opportunity to investigate genome-wide variation at nucleotide resolution. Due to the huge amount of data, NGS applications require very fast and accurate alignment algorithms. Most existing algorithms for read mapping basically adopt seed-and-extend strategy, which is sequential in nature and takes much longer time on longer reads. Results We develop a divide-and-conquer algorithm, called Kart, which can process long reads as fast as short reads by dividing a read into small fragments that can be aligned independently. Our experiment result indicates that the average size of fragments requiring the more time-consuming gapped alignment is around 20 bp regardless of the original read length. Furthermore, it can tolerate much higher error rates. The experiments show that Kart spends much less time on longer reads than other aligners and still produce reliable alignments even when the error rate is as high as 15%. Availability and implementation Kart is available at https://github.com/hsinnan75/Kart/ . Contact hsu@iis.sinica.edu.tw. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

39 citations

Journal Article•DOI•

LAMSA: fast split read alignment with long approximate matches.

[...]

Bo Liu¹, Yan Gao¹, Yadong Wang¹•Institutions (1)

Harbin Institute of Technology¹

15 Jan 2017-Bioinformatics

TL;DR: Long approximate matches-based split aligner (LAMSA) as mentioned in this paper takes advantage of the rareness of structural variants to implement a specifically designed two-step strategy, which splits the read into relatively long fragments and co-linearly align them to solve the small variations or sequencing errors, and mitigate the effect of repeats.

...read moreread less

Abstract: Motivation Read length is continuously increasing with the development of novel high-throughput sequencing technologies, which has enormous potentials on cutting-edge genomic studies. However, longer reads could more frequently span the breakpoints of structural variants (SVs) than that of shorter reads. This may greatly influence read alignment, since most state-of-the-art aligners are designed for handling relatively small variants in a co-linear alignment framework. Meanwhile, long read alignment is still not as efficient as that of short reads, which could be also a bottleneck for the upcoming wide application. Results We propose long approximate matches-based split aligner (LAMSA), a novel split read alignment approach. It takes the advantage of the rareness of SVs to implement a specifically designed two-step strategy. That is, LAMSA initially splits the read into relatively long fragments and co-linearly align them to solve the small variations or sequencing errors, and mitigate the effect of repeats. The alignments of the fragments are then used for implementing a sparse dynamic programming-based split alignment approach to handle the large or non-co-linear variants. We benchmarked LAMSA with simulated and real datasets having various read lengths and sequencing error rates, the results demonstrate that it is substantially faster than the state-of-the-art long read aligners; meanwhile, it also has good ability to handle various categories of SVs. Availability and implementation LAMSA is available at https://github.com/hitbc/LAMSA CONTACT: Ydwang@hit.edu.cnSupplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

29 citations

Posted Content•DOI•

New synthetic-diploid benchmark for accurate variant calling evaluation

[...]

Heng Li¹, Jonathan M. Bloom¹, Yossi Farjoun¹, Mark Fleharty¹, Laura D. Gauthier¹, Benjamin M. Neale², Benjamin M. Neale¹, Daniel G. MacArthur², Daniel G. MacArthur¹ - Show less +5 more•Institutions (2)

Broad Institute¹, Harvard University²

22 Nov 2017-bioRxiv

TL;DR: A new benchmark dataset is derived from the de novo PacBio assemblies of two human cell lines that are homozygous across the whole genome that provides a more accurate and less biased estimate of the error rate of small variant calls in a realistic context.

...read moreread less

Abstract: Constructed from the consensus of multiple variant callers based on short-read data, existing benchmark datasets for evaluating variant calling accuracy are biased toward easy regions accessible by known algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two human cell lines that are homozygous across the whole genome. This benchmark provides a more accurate and less biased estimate of the error rate of small variant calls in a realistic context.

...read moreread less

17 citations