The GEM mapper: fast, accurate and versatile alignment by filtration

doi:10.1038/NMETH.2221

Home
/
Papers
/
The GEM mapper: fast, accurate and versatile alignment by filtration

Journal Article•DOI•

The GEM mapper: fast, accurate and versatile alignment by filtration

Santiago Marco-Sola, Michael Sammeth, Roderic Guigó¹, Paolo Ribeca•Institutions (1)

Pompeu Fabra University¹

01 Dec 2012-Nature Methods (Nat Methods)-Vol. 9, Iss: 12, pp 1185-1188

TL;DR: The Genome Multitool (GEM) mapper can leverage string matching by filtration to search the alignment space more efficiently, simultaneously delivering precision and speed.

read less

Abstract: Because of ever-increasing throughput requirements of sequencing data, most existing short-read aligners have been designed to focus on speed at the expense of accuracy. The Genome Multitool (GEM) mapper can leverage string matching by filtration to search the alignment space more efficiently, simultaneously delivering precision (performing fully tunable exhaustive searches that return all existing matches, including gapped ones) and speed (being several times faster than comparable state-of-the-art tools).

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features

[...]

Yang Liao¹, Gordon K. Smyth¹, Wei Shi¹•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

01 Apr 2014-Bioinformatics

TL;DR: FeatureCounts as discussed by the authors is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments, which implements highly efficient chromosome hashing and feature blocking techniques.

...read moreread less

Abstract: MOTIVATION: Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to each genomic feature, for example to each exon or each gene. The process of counting reads is called read summarization. Read summarization is required for a great variety of genomic analyses but has so far received relatively little attention in the literature. RESULTS: We present featureCounts, a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. featureCounts implements highly efficient chromosome hashing and feature blocking techniques. It is considerably faster than existing methods (by an order of magnitude for gene-level summarization) and requires far less computer memory. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications. AVAILABILITY AND IMPLEMENTATION: featureCounts is available under GNU General Public License as part of the Subread (http://subread.sourceforge.net) or Rsubread (http://www.bioconductor.org) software packages.

...read moreread less

14,103 citations

Posted Content•DOI•

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

[...]

Heng Li

16 Mar 2013-arXiv: Genomics

TL;DR: BWA-MEM automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment, which is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases.

...read moreread less

Abstract: Summary: BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date. Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at this http URL. Contact: hengli@broadinstitute.org

...read moreread less

8,090 citations

Cites methods from "The GEM mapper: fast, accurate and ..."

...On speed, BWA-MEM is similar to GEM and Bowtie2 for this data set, but is about 6 times as fast as Bowtie2 and Cushaw2 for a 650bp long-read data set....
[...]
...In this background, a few long-read alignment algorithms, notably including BWA-SW (Li and Durbin, 2010), Bowtie2 (Langmead and Salzberg, 2012), Cushaw2 (Liu and Schmidt, 2012) and GEM (Marco-Sola et al., 2012), have been developed....
[...]
...While GEM is both fast and accur te for up to approximately 1000bp reads, it mandates end-to-end alignment and does not perform affine-gap alignment, which limits its uses for long-read alignment....
[...]
...We evaluated the performance of BWA-MEM on simulated data together with NovoAlign (http://novocraft.com), GEM, Bowtie2, Cushaw2, SeqAlto (Mu et al., 2012), BWA-SW and BWA (Figure 1)....
[...]
...BWA-MEM is close to NovoAlign for PE reads and is comparable to GEM and Cushaw2 for SE....
[...]

Journal Article•DOI•

The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

[...]

Yang Liao¹, Gordon K. Smyth¹, Wei Shi¹•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

01 May 2013-Nucleic Acids Research

TL;DR: This article proposes an elegantly simple multi-seed strategy, called seed-and-vote, for mapping reads to a reference genome, which uses a relatively large number of short seeds extracted from each read and allows all the seeds to vote on the optimal location.

...read moreread less

Abstract: Read alignment is an ongoing challenge for the analysis of data from sequencing technologies. This article proposes an elegantly simple multi-seed strategy, called seed-and-vote, for mapping reads to a reference genome. The new strategy chooses the mapped genomic location for the read directly from the seeds. It uses a relatively large number of short seeds (called subreads) extracted from each read and allows all the seeds to vote on the optimal location. When the read length is <160 bp, overlapping subreads are used. More conventional alignment algorithms are then used to fill in detailed mismatch and indel information between the subreads that make up the winning voting block. The strategy is fast because the overall genomic location has already been chosen before the detailed alignment is done. It is sensitive because no individual subread is required to map exactly, nor are individual subreads constrained to map close by other subreads. It is accurate because the final location must be supported by several different subreads. The strategy extends easily to find exon junctions, by locating reads that contain sets of subreads mapping to different exons of the same gene. It scales up efficiently for longer reads.

...read moreread less

2,228 citations

Cites background from "The GEM mapper: fast, accurate and ..."

...Most aligners then work out from the location that the seed mapped to, trying to match the remainder of the read to the genome surrounding the original location, a process often called the extension step (2)....
[...]

Journal Article•DOI•

A survey of best practices for RNA-seq data analysis

[...]

Ana Conesa¹, Pedro Madrigal², Pedro Madrigal³, Sonia Tarazona⁴, David Gomez-Cabrero, Alejandra Cervera⁵, Andrew McPherson⁶, Michał Wojciech Szcześniak⁷, Daniel J. Gaffney², Laura L. Elo⁸, Xuegong Zhang⁹, Ali Mortazavi¹⁰ - Show less +8 more•Institutions (10)

University of Florida¹, Wellcome Trust Sanger Institute², University of Cambridge³, Polytechnic University of Valencia⁴, University of Helsinki⁵, Simon Fraser University⁶, Adam Mickiewicz University in Poznań⁷, Åbo Akademi University⁸, Tsinghua University⁹, University of California, Irvine¹⁰

26 Jan 2016-Genome Biology

TL;DR: All of the major steps in RNA-seq data analysis are reviewed, including experimental design, quality control, read alignment, quantification of gene and transcript levels, visualization, differential gene expression, alternative splicing, functional analysis, gene fusion detection and eQTL mapping.

...read moreread less

Abstract: RNA-sequencing (RNA-seq) has a wide variety of applications, but no single analysis pipeline can be used in all cases. We review all of the major steps in RNA-seq data analysis, including experimental design, quality control, read alignment, quantification of gene and transcript levels, visualization, differential gene expression, alternative splicing, functional analysis, gene fusion detection and eQTL mapping. We highlight the challenges associated with each step. We discuss the analysis of small RNAs and the integration of RNA-seq with other functional genomics techniques. Finally, we discuss the outlook for novel technologies that are changing the state of the art in transcriptomics.

...read moreread less

1,963 citations

Cites background from "The GEM mapper: fast, accurate and ..."

...[204]), achieve ultra-fast mapping (GEM [205]) or map long-reads...
[...]

Journal Article•DOI•

Transcriptome and genome sequencing uncovers functional variation in humans

[...]

Tuuli Lappalainen¹, Michael Sammeth, Marc R. Friedländer, Peter A C 't Hoen², Jean Monlong³, Manuel A. Rivas⁴, Mar Gonzàlez-Porta⁵, Natalja Kurbatova⁵, Thasso Griebel, Pedro G. Ferreira³, Matthias Barann⁶, Thomas Wieland, Liliana Greger⁵, Maarten van Iterson², Jonas Carlsson Almlöf⁷, Paolo Ribeca, Irina Pulyakhina², Daniela Esser⁶, Thomas Giger¹, Andrew Tikhonov⁵, Marc Sultan⁸, Gabrielle Bertier³, Daniel G. MacArthur⁹, Daniel G. MacArthur¹⁰, Monkol Lek¹⁰, Monkol Lek⁹, Esther Lizano, Henk P. J. Buermans², Ismael Padioleau¹¹, Ismael Padioleau¹, Thomas Schwarzmayr, Olof Karlberg⁷, Halit Ongen¹¹, Halit Ongen¹, Helena Kilpinen¹¹, Helena Kilpinen¹, Sergi Beltran, Marta Gut, Katja Kahlem, Vyacheslav Amstislavskiy⁸, Oliver Stegle⁵, Matti Pirinen⁴, Stephen B. Montgomery¹², Stephen B. Montgomery¹, Peter Donnelly⁴, Mark I. McCarthy⁴, Mark I. McCarthy¹³, Paul Flicek⁵, Tim M. Strom¹⁴, Hans Lehrach⁸, Stefan Schreiber⁶, Ralf Sudbrak⁸, Angel Carracedo¹⁵, Stylianos E. Antonarakis¹, Robert Häsler⁶, Ann-Christine Syvänen⁷, Gert-Jan B. van Ommen², Alvis Brazma⁵, Thomas Meitinger¹⁴, Philip Rosenstiel⁶, Roderic Guigó³, Ivo Gut, Xavier Estivill, Emmanouil T. Dermitzakis¹, Emmanouil T. Dermitzakis¹¹ - Show less +61 more•Institutions (15)

University of Geneva¹, Leiden University Medical Center², Pompeu Fabra University³, Wellcome Trust Centre for Human Genetics⁴, European Bioinformatics Institute⁵, University of Kiel⁶, Science for Life Laboratory⁷, Max Planck Society⁸, Broad Institute⁹, Harvard University¹⁰, Swiss Institute of Bioinformatics¹¹, Stanford University¹², University of Oxford¹³, Technische Universität München¹⁴, University of Santiago de Compostela¹⁵

26 Sep 2013-Nature

TL;DR: Se sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project—the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences discover extremely widespread genetic variation affecting the regulation of most genes.

...read moreread less

Abstract: Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project--the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.

...read moreread less

1,892 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

The Sequence Alignment/Map format and SAMtools

[...]

Heng Li¹, Bob Handsaker², Alec Wysoker², T. J. Fennell², Jue Ruan³, Nils Homer², Gabor T. Marth⁴, Gonçalo R. Abecasis², Richard Durbin¹ - Show less +5 more•Institutions (4)

Wellcome Trust Sanger Institute¹, University of California, Los Angeles², Chinese Academy of Sciences³, Boston College⁴

01 Aug 2009-Bioinformatics

TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.

...read moreread less

Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

...read moreread less

45,957 citations

Journal Article•DOI•

Fast and accurate short read alignment with Burrows–Wheeler transform

[...]

Heng Li¹, Richard Durbin¹•Institutions (1)

Wellcome Trust Sanger Institute¹

01 Jul 2009-Bioinformatics

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.

...read moreread less

Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

...read moreread less

43,862 citations

Journal Article•DOI•

Fast gapped-read alignment with Bowtie 2

[...]

Ben Langmead¹, Steven L. Salzberg¹, Steven L. Salzberg², Steven L. Salzberg³•Institutions (3)

University of Maryland, College Park¹, Johns Hopkins University², Johns Hopkins University School of Medicine³

01 Apr 2012-Nature Methods

TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

...read moreread less

Abstract: As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

...read moreread less

37,898 citations

Journal Article•DOI•

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

[...]

Ben Langmead¹, Cole Trapnell¹, Mihai Pop¹, Steven L. Salzberg¹•Institutions (1)

University of Maryland, College Park¹

04 Mar 2009-Genome Biology

TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.

...read moreread less

Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

...read moreread less

20,335 citations

Journal Article•DOI•

Sequencing technologies-the next generation

[...]

Michael L. Metzker¹•Institutions (1)

Baylor College of Medicine¹

01 Jan 2010-Nature Reviews Genetics

TL;DR: A technical review of template preparation, sequencing and imaging, genome alignment and assembly approaches, and recent advances in current and near-term commercially available NGS instruments is presented.

...read moreread less

Abstract: Demand has never been greater for revolutionary technologies that deliver fast, inexpensive and accurate genome information. This challenge has catalysed the development of next-generation sequencing (NGS) technologies. The inexpensive production of large volumes of sequence data is the primary advantage over conventional methods. Here, I present a technical review of template preparation, sequencing and imaging, genome alignment and assembly approaches, and recent advances in current and near-term commercially available NGS instruments. I also outline the broad range of applications for NGS technologies, in addition to providing guidelines for platform selection to address biological questions of interest.

...read moreread less

7,023 citations