Minimap2: pairwise alignment for nucleotide sequences

doi:10.1093/BIOINFORMATICS/BTY191

Home
/
Papers
/
Minimap2: pairwise alignment for nucleotide sequences

Journal Article•DOI•

Minimap2: pairwise alignment for nucleotide sequences

Heng Li¹•Institutions (1)

Broad Institute¹

15 Sep 2018-Bioinformatics (Bioinformatics)-Vol. 34, Iss: 18, pp 3094-3100

TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.

read less

Abstract: Motivation Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation https://github.com/lh3/minimap2. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•DOI•

Towards complete and error-free genome assemblies of all vertebrate species

[...]

Arang Rhie¹, Shane A. McCarthy², Olivier Fedrigo³, Joana Damas⁴, Giulio Formenti³, Sergey Koren¹, Marcela Uliano-Silva², William Chow², Arkarachai Fungtammasan, Gregory Gedman³, Lindsey J. Cantin³, Françoise Thibaud-Nissen¹, Leanne Haggerty⁵, Chul Hee Lee⁶, Byung June Ko⁶, J. H. Kim⁶, Iliana Bista², Michelle Smith², Bettina Haase³, Jacquelyn Mountcastle³, Sylke Winkler⁷, Sadye Paez³, Jason T. Howard⁸, Sonja C. Vernes⁷, Tanya M. Lama⁹, Frank Grützner¹⁰, Wesley C. Warren¹¹, Christopher N. Balakrishnan¹², Dave W Burt¹³, Jimin George¹⁴, Matthew T. Biegler³, David Iorns¹⁵, Andrew Digby, Daryl Eason, Taylor Edwards¹⁶, Mark Wilkinson¹⁷, George F. Turner¹⁸, Axel Meyer¹⁹, Andreas F. Kautt¹⁹, Paolo Franchini¹⁹, H. William Detrich²⁰, Hannes Svardal²¹, Maximilian Wagner²², Gavin J. P. Naylor²³, Martin Pippel⁷, Milan Malinsky², Mark Mooney, Maria Simbirsky, Brett T. Hannigan, Trevor Pesout²⁴, Marlys L. Houck, Ann C Misuraca, Sarah B. Kingan²⁵, Richard Hall²⁵, Zev N. Kronenberg²⁵, Jonas Korlach²⁵, Ivan Sović²⁵, Christopher Dunn²⁵, Zemin Ning², Alex Hastie, Joyce V. Lee, Siddarth Selvaraj, Richard E. Green²⁴, Nicholas H. Putnam, Jay Ghurye²⁶, Erik Garrison²⁴, Ying Sims², Joanna Collins², Sarah Pelan², James Torrance², Alan Tracey², Jonathan Wood², Dengfeng Guan²⁷, Sarah E. London²⁸, David F. Clayton¹⁴, Claudio V. Mello²⁹, Samantha R. Friedrich²⁹, Peter V. Lovell²⁹, Ekaterina Osipova⁷, Farooq O. Al-Ajli³⁰, Simona Secomandi³¹, Heebal Kim⁶, Constantina Theofanopoulou³, Yang Zhou³², Robert S. Harris³³, Kateryna D. Makova³³, Paul Medvedev³³, Jinna Hoffman¹, Patrick Masterson¹, Karen Clark¹, Fergal J. Martin⁵, Kevin L. Howe⁵, Paul Flicek⁵, Brian P. Walenz¹, Woori Kwak, Hiram Clawson²⁴, Mark Diekhans²⁴, Luis R Nassar²⁴, Benedict Paten²⁴, Robert H. S. Kraus¹⁹, Harris A. Lewin⁴, Andrew J. Crawford³⁴, M. Thomas P. Gilbert³², Guojie Zhang³², Byrappa Venkatesh³⁵, Robert W. Murphy³⁶, Klaus-Peter Koepfli³⁷, Beth Shapiro²⁴, Warren E. Johnson³⁷, Federica Di Palma³⁸, Tomas Marques-Bonet³⁹, Emma C. Teeling⁴⁰, Tandy Warnow⁴¹, Jennifer A. Marshall Graves⁴², Oliver A. Ryder⁴³, David Haussler²⁴, Stephen J. O'Brien⁴⁴, Kerstin Howe², Eugene W. Myers⁴⁵, Richard Durbin², Adam M. Phillippy¹, Erich D. Jarvis³ - Show less +118 more•Institutions (45)

National Institutes of Health¹, Wellcome Trust Sanger Institute², Rockefeller University³, University of California, Davis⁴, European Bioinformatics Institute⁵, Seoul National University⁶, Max Planck Society⁷, Durham University⁸, University of Massachusetts Amherst⁹, University of Adelaide¹⁰, University of Missouri¹¹, East Carolina University¹², University of Queensland¹³, Queen Mary University of London¹⁴, Wellington Management Company¹⁵, University of Arizona¹⁶, Natural History Museum¹⁷, Bangor University¹⁸, University of Konstanz¹⁹, Northeastern University²⁰, Naturalis²¹, University of Graz²², Florida Museum of Natural History²³, University of California, Santa Cruz²⁴, Pacific Biosciences²⁵, University of Maryland, College Park²⁶, Harbin Institute of Technology²⁷, University of Chicago²⁸, Oregon Health & Science University²⁹, Monash University Malaysia Campus³⁰, University of Milan³¹, University of Copenhagen³², Pennsylvania State University³³, University of Los Andes³⁴, Agency for Science, Technology and Research³⁵, Royal Ontario Museum³⁶, Smithsonian Conservation Biology Institute³⁷, University of East Anglia³⁸, Pompeu Fabra University³⁹, University College Dublin⁴⁰, University of Illinois at Urbana–Champaign⁴¹, La Trobe University⁴², University of California, San Diego⁴³, UPRRP College of Natural Sciences⁴⁴, Dresden University of Technology⁴⁵

23 May 2020-bioRxiv

TL;DR: The Vertebrate Genomes Project is embarked on, an effort to generate high-quality, complete reference genomes for all ~70,000 extant vertebrate species and help enable a new era of discovery across the life sciences.

...read moreread less

Abstract: High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are only available for a few non-microbial species. To address this issue, the international Genome 10K (G10K) consortium has worked over a five-year period to evaluate and develop cost-effective methods for assembling the most accurate and complete reference genomes to date. Here we summarize these developments, introduce a set of quality standards, and present lessons learned from sequencing and assembling 16 species representing major vertebrate lineages (mammals, birds, reptiles, amphibians, teleost fishes and cartilaginous fishes). We confirm that long-read sequencing technologies are essential for maximizing genome quality and that unresolved complex repeats and haplotype heterozygosity are major sources of error in assemblies. Our new assemblies identify and correct substantial errors in some of the best historical reference genomes. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an effort to generate high-quality, complete reference genomes for all ~70,000 extant vertebrate species and help enable a new era of discovery across the life sciences.

...read moreread less

567 citations

Journal Article•DOI•

Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool.

[...]

Áine O'Toole¹, Emily Scher¹, Anthony Underwood², Ben Jackson¹, Verity Hill¹, John T. McCrone¹, Rachel M. Colquhoun¹, Christopher Ruis³, Khalil Abudahab², Ben Taylor², Corin Yeats², Louis du Plessis², Daniel Maloney¹, Nathan C Medd¹, Stephen W Attwood², David M. Aanensen², Edward C. Holmes⁴, Oliver G. Pybus², Andrew Rambaut¹ - Show less +15 more•Institutions (4)

University of Edinburgh¹, University of Oxford², University of Cambridge³, University of Sydney⁴

30 Jul 2021-Virus Evolution

TL;DR: Pangolin this paper is a computational tool that has been developed to assign the most likely lineage to a given SARS-CoV-2 genome sequence according to the Pango dynamic lineage nomenclature scheme.

...read moreread less

Abstract: The response of the global virus genomics community to the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been unprecedented, with significant advances made towards the 'real-time' generation and sharing of SARS-CoV-2 genomic data. The rapid growth in virus genome data production has necessitated the development of new analytical methods that can deal with orders of magnitude of more genomes than previously available. Here, we present and describe Phylogenetic Assignment of Named Global Outbreak Lineages (pangolin), a computational tool that has been developed to assign the most likely lineage to a given SARS-CoV-2 genome sequence according to the Pango dynamic lineage nomenclature scheme. To date, nearly two million virus genomes have been submitted to the web-application implementation of pangolin, which has facilitated the SARS-CoV-2 genomic epidemiology and provided researchers with access to actionable information about the pandemic's transmission lineages.

...read moreread less

567 citations

Journal Article•DOI•

The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities.

[...]

James J. Davis¹, James J. Davis², Alice R. Wattam³, Alice R. Wattam², Ramy K. Aziz⁴, Thomas Brettin², Thomas Brettin¹, Ralph Butler⁵, Ralph Butler², Rory Butler², Philippe Chlenski, Neal Conrad¹, Neal Conrad², Allan Dickerman³, Emily M. Dietrich², Emily M. Dietrich¹, Joseph L. Gabbard⁶, Svetlana Gerdes, Andrew Guard¹, Ronald W. Kenyon³, Dustin Machi³, Chunhong Mao³, Daniel E. Murphy-Olson², Daniel E. Murphy-Olson¹, Marcus Nguyen², Marcus Nguyen¹, Eric K. Nordberg⁶, Gary J. Olsen⁷, Robert Olson¹, Robert Olson², Jamie C. Overbeek¹, Jamie C. Overbeek², Ross Overbeek¹, Bruce Parrello¹, Bruce Parrello², Gordon D. Pusch, Maulik Shukla², Maulik Shukla¹, Chris Thomas¹, Margo VanOeffelen, Veronika Vonstein, Andrew S. Warren³, Fangfang Xia², Fangfang Xia¹, Dawen Xie³, Hyunseung Yoo², Hyunseung Yoo¹, Rick Stevens¹, Rick Stevens² - Show less +45 more•Institutions (7)

University of Chicago¹, Argonne National Laboratory², University of Virginia³, Cairo University⁴, Middle Tennessee State University⁵, Virginia Tech⁶, University of Illinois at Urbana–Champaign⁷

31 Oct 2019-Nucleic Acids Research

TL;DR: The recent updates to the PATRIC resource are reported, including new web-based comparative analysis tools, eight new services and the release of a command-line interface to access, query and analyze data.

...read moreread less

Abstract: The PathoSystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center funded by the National Institute of Allergy and Infectious Diseases (https://www.patricbrc.org). PATRIC supports bioinformatic analyses of all bacteria with a special emphasis on pathogens, offering a rich comparative analysis environment that provides users with access to over 250 000 uniformly annotated and publicly available genomes with curated metadata. PATRIC offers web-based visualization and comparative analysis tools, a private workspace in which users can analyze their own data in the context of the public collections, services that streamline complex bioinformatic workflows and command-line tools for bulk data analysis. Over the past several years, as genomic and other omics-related experiments have become more cost-effective and widespread, we have observed considerable growth in the usage of and demand for easy-to-use, publicly available bioinformatic tools and services. Here we report the recent updates to the PATRIC resource, including new web-based comparative analysis tools, eight new services and the release of a command-line interface to access, query and analyze data.

...read moreread less

554 citations

Journal Article•DOI•

Telomere-to-telomere assembly of a complete human X chromosome

[...]

Karen H. Miga¹, Sergey Koren², Arang Rhie², Mitchell R. Vollger³, Ariel Gershman⁴, Andrey Bzikadze⁵, Shelise Brooks², Edmund Howe⁶, David Porubsky³, Glennis A. Logsdon³, Valerie A. Schneider², Tamara A. Potapova⁶, Jonathan Wood⁷, William Chow⁷, Joel Armstrong¹, Jeanne Fredrickson³, Evgenia Pak², Kristof Tigyi¹, Milinn Kremitzki⁸, Christopher Markovic⁸, Valerie Maduro², Amalia Dutra², Gerard G. Bouffard², Alexander M. Chang², Nancy F. Hansen², Amy B. Wilfert³, Françoise Thibaud-Nissen², Anthony D. Schmitt, Jon Matthew Belton, Siddarth Selvaraj, Megan Y. Dennis⁹, Daniela C. Soto⁹, Ruta Sahasrabudhe⁹, Gulhan Kaya⁹, Josh Quick¹⁰, Nicholas J. Loman¹⁰, Nadine Holmes¹¹, Matthew Loose¹¹, Urvashi Surti¹², Rosa Ana Risques³, Tina A. Graves Lindsay⁸, Robert S. Fulton⁸, Ira M. Hall⁸, Benedict Paten¹, Kerstin Howe⁷, Winston Timp⁴, Alice Young², James C. Mullikin², Pavel A. Pevzner⁵, Jennifer L. Gerton⁶, Beth A. Sullivan¹³, Evan E. Eichler³, Adam M. Phillippy² - Show less +49 more•Institutions (13)

University of California, Santa Cruz¹, National Institutes of Health², University of Washington³, Johns Hopkins University⁴, University of California, San Diego⁵, Stowers Institute for Medical Research⁶, Wellcome Trust Sanger Institute⁷, Washington University in St. Louis⁸, University of California, Davis⁹, University of Birmingham¹⁰, University of Nottingham¹¹, University of Pittsburgh¹², Duke University¹³

03 Sep 2020-Nature

TL;DR: High-coverage, ultra-long-read nanopore sequencing is used to create a new human genome assembly that improves on the coverage and accuracy of the current reference (GRCh38) and includes the gap-free, telomere-to-telomere sequence of the X chromosome.

...read moreread less

Abstract: After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes. High-coverage, ultra-long-read nanopore sequencing is used to create a new human genome assembly that improves on the coverage and accuracy of the current reference (GRCh38) and includes the gap-free, telomere-to-telomere sequence of the X chromosome.

...read moreread less

502 citations

Journal Article•DOI•

Circulating SARS-CoV-2 spike N439K variants maintain fitness while evading antibody-mediated immunity.

[...]

E. Thomson¹, E. Thomson², Laura E. Rosen, James G Shepherd¹, Roberto Spreafico, Ana da Silva Filipe¹, Jason A. Wojcechowskyj, Chris Davis¹, Luca Piccoli, David J Pascall¹, Josh Dillen, Spyros Lytras¹, Nadine Czudnochowski, Rajiv Shah¹, Marcel Meury, Natasha Jesudason¹, Anna De Marco, Kathy Li¹, Jessica Bassi, Áine O'Toole³, Dora Pinto, Rachel M. Colquhoun³, Katja Culap, Ben Jackson³, Fabrizia Zatta, Andrew Rambaut³, Stefano Jaconi, Vattipally B. Sreenu¹, Jay C. Nix⁴, Ivy Zhang⁵, Ivy Zhang⁶, Ruth F. Jarrett¹, William G. Glass⁶, Martina Beltramello, Kyriaki Nomikou¹, Matteo Samuele Pizzuto, Lily Tong¹, Elisabetta Cameroni, Tristan I. Croll⁷, Natasha Johnson¹, Julia di Iulio, Arthur Wickenhagen¹, Alessandro Ceschi⁸, Alessandro Ceschi⁹, Aoife M. Harbison¹⁰, Daniel Mair¹, Paolo Ferrari¹¹, Katherine Smollett¹, Federica Sallusto¹², Federica Sallusto⁸, Stephen Carmichael¹, Christian Garzoni, Jenna Nichols¹, Massimo Galli, Joseph Hughes¹, Agostino Riva, Antonia Ho¹, Marco Schiuma, Malcolm G Semple¹³, Malcolm G Semple¹⁴, Peter J. M. Openshaw¹⁵, Elisa Fadda¹⁰, J Kenneth Baillie³, John D. Chodera⁶, Suzannah J. Rihn¹, Samantha Lycett³, Herbert W. Virgin¹⁶, Amalio Telenti, Davide Corti, David Robertson¹, Gyorgy Snell - Show less +67 more•Institutions (16)

University of Glasgow¹, University of London², University of Edinburgh³, Lawrence Berkeley National Laboratory⁴, Cornell University⁵, Memorial Sloan Kettering Cancer Center⁶, University of Cambridge⁷, University of Lugano⁸, University of Zurich⁹, Maynooth University¹⁰, University of New South Wales¹¹, ETH Zurich¹², University of Liverpool¹³, Boston Children's Hospital¹⁴, National Institutes of Health¹⁵, Washington University in St. Louis¹⁶

04 Mar 2021-Cell

TL;DR: In this paper, the authors demonstrate that the immunodominant SARS-CoV-2 spike (S) receptor binding motif (RBM) is a highly variable region of S and provide epidemiological, clinical, and molecular characterization of a prevalent, sentinel RBM mutation, N439K.

...read moreread less

483 citations

…
1
2
3
4
5
6
7
…
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

[...]

Stephen F. Altschul¹, Thomas L. Madden, Alejandro A. Schäffer¹, Jinghui Zhang, Zheng Zhang², Webb Miller², David J. Lipman - Show less +3 more•Institutions (2)

National Institutes of Health¹, Pennsylvania State University²

01 Sep 1997-Nucleic Acids Research

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.

...read moreread less

Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

...read moreread less

70,111 citations

Journal Article•DOI•

The Sequence Alignment/Map format and SAMtools

[...]

Heng Li¹, Bob Handsaker², Alec Wysoker², T. J. Fennell², Jue Ruan³, Nils Homer², Gabor T. Marth⁴, Gonçalo R. Abecasis², Richard Durbin¹ - Show less +5 more•Institutions (4)

Wellcome Trust Sanger Institute¹, University of California, Los Angeles², Chinese Academy of Sciences³, Boston College⁴

01 Aug 2009-Bioinformatics

TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.

...read moreread less

Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

...read moreread less

45,957 citations

Journal Article•DOI•

Fast and accurate short read alignment with Burrows–Wheeler transform

[...]

Heng Li¹, Richard Durbin¹•Institutions (1)

Wellcome Trust Sanger Institute¹

01 Jul 2009-Bioinformatics

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.

...read moreread less

Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

...read moreread less

43,862 citations

Journal Article•DOI•

Fast gapped-read alignment with Bowtie 2

[...]

Ben Langmead¹, Steven L. Salzberg¹, Steven L. Salzberg², Steven L. Salzberg³•Institutions (3)

University of Maryland, College Park¹, Johns Hopkins University², Johns Hopkins University School of Medicine³

01 Apr 2012-Nature Methods

TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

...read moreread less

Abstract: As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

...read moreread less

37,898 citations

"Minimap2: pairwise alignment for nu..." refers background or methods in this paper

...Most of them were five times as slow as mainstream short-read aligners (Langmead and Salzberg, 2012; Li, 2013) in terms of the number of bases mapped per second....
[...]
...We evaluated minimap2 along with Bowtie2 (v2.3.3; Langmead and Salzberg 2012), BWA-MEM and SNAP (v1....
[...]

Journal Article•DOI•

STAR: ultrafast universal RNA-seq aligner

[...]

Alexander Dobin¹, Carrie A. Davis¹, Felix Schlesinger¹, Jorg Drenkow¹, Chris Zaleski¹, Sonali Jha¹, Philippe Batut¹, Mark Chaisson¹, Thomas R. Gingeras¹ - Show less +5 more•Institutions (1)

Cold Spring Harbor Laboratory¹

01 Jan 2013-Bioinformatics

TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.

...read moreread less

Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

...read moreread less

30,684 citations