Simultaneous alignment of short reads against multiple genomes

doi:10.1186/GB-2009-10-9-R98

Home
/
Papers
/
Simultaneous alignment of short reads against multiple genomes

Journal Article•DOI•

Simultaneous alignment of short reads against multiple genomes

Korbinian Schneeberger¹, Jörg Hagmann¹, Stephan Ossowski¹, Norman Warthmann¹, Sandra Gesing², Oliver Kohlbacher², Detlef Weigel¹ - Show less +3 more•Institutions (2)

Max Planck Society¹, University of Tübingen²

17 Sep 2009-Genome Biology (BioMed Central)-Vol. 10, Iss: 9, pp 1-12

TL;DR: GenomeMapper supports simultaneous mapping of short reads against multiple genomes by integrating related genomes into a single graph structure and introduces representations for alignments against complex structures.

read less

Abstract: Genome resequencing with short reads generally relies on alignments against a single reference. GenomeMapper supports simultaneous mapping of short reads against multiple genomes by integrating related genomes (e.g., individuals of the same species) into a single graph structure. It constitutes the first approach for handling multiple references and introduces representations for alignments against complex structures. Demonstrated benefits include access to polymorphisms that cannot be identified by alignments against the reference alone. Download GenomeMapper at http://1001genomes.org.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Assembly algorithms for next-generation sequencing data.

[...]

Jason R. Miller¹, Sergey Koren¹, Granger G. Sutton¹•Institutions (1)

J. Craig Venter Institute¹

01 Jun 2010-Genomics

TL;DR: This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo to compare the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.

...read moreread less

1,176 citations

Cites methods from "Simultaneous alignment of short rea..."

...Published software includes SOAP [68,69], MAQ [70], Bowtie [71], RMAP [72], CloudBurst [73], SHRiMP [74], RazerS [75], PerM[76], segemehl [77], GenomeMapper [78], and BOAT [79]....
[...]
...Published software includes SOAP [68;69], MAQ [70], Bowtie [71], RMAP [72], CloudBurst [73], SHRiMP [74], RazerS [75], PerM [76], segemehl [77], GenomeMapper [78], and BOAT [79]....
[...]

Journal Article•DOI•

The Rate and Molecular Spectrum of Spontaneous Mutations in Arabidopsis thaliana

[...]

Stephan Ossowski¹, Korbinian Schneeberger¹, José Ignacio Lucas-Lledó², Norman Warthmann¹, Richard M. Clark³, Ruth G. Shaw, Detlef Weigel¹, Michael Lynch² - Show less +4 more•Institutions (3)

Max Planck Society¹, Indiana University², University of Utah³

01 Jan 2010-Science

TL;DR: This work searched for de novo spontaneous mutations in the complete nuclear genomes of five Arabidopsis thaliana mutation accumulation lines that had been maintained by single-seed descent for 30 generations, and identified and validated 99 base substitutions and 17 small and large insertions and deletions.

...read moreread less

Abstract: To take complete advantage of information on within-species polymorphism and divergence from close relatives, one needs to know the rate and the molecular spectrum of spontaneous mutations. To this end, we have searched for de novo spontaneous mutations in the complete nuclear genomes of five Arabidopsis thaliana mutation accumulation lines that had been maintained by single-seed descent for 30 generations. We identified and validated 99 base substitutions and 17 small and large insertions and deletions. Our results imply a spontaneous mutation rate of 7 × 10−9 base substitutions per site per generation, the majority of which are G:C→A:T transitions. We explain this very biased spectrum of base substitution mutations as a result of two main processes: deamination of methylated cytosines and ultraviolet light–induced mutagenesis.

...read moreread less

1,021 citations

Journal Article•DOI•

Whole-genome sequencing of multiple Arabidopsis thaliana populations

[...]

Jun Cao¹, Korbinian Schneeberger¹, Stephan Ossowski¹, Stephan Ossowski², Torsten Günther³, Sebastian Bender¹, Joffrey Fitz¹, Daniel Koenig¹, Christa Lanz¹, Oliver Stegle¹, Christoph Lippert¹, Xi Wang¹, Felix Ott¹, Jonas Müller¹, Carlos Alonso-Blanco⁴, Karsten M. Borgwardt¹, Karl Schmid³, Detlef Weigel¹ - Show less +14 more•Institutions (4)

Max Planck Society¹, Pompeu Fabra University², University of Hohenheim³, Spanish National Research Council⁴

01 Oct 2011-Nature Genetics

TL;DR: The majority of common small-scale polymorphisms as well as many larger insertions and deletions in the A. thaliana pan-genome are described, their effects on gene function, and the patterns of local and global linkage among these variants.

...read moreread less

Abstract: The plant Arabidopsis thaliana occurs naturally in many different habitats throughout Eurasia. As a foundation for identifying genetic variation contributing to adaptation to diverse environments, a 1001 Genomes Project to sequence geographically diverse A. thaliana strains has been initiated. Here we present the first phase of this project, based on population-scale sequencing of 80 strains drawn from eight regions throughout the species' native range. We describe the majority of common small-scale polymorphisms as well as many larger insertions and deletions in the A. thaliana pan-genome, their effects on gene function, and the patterns of local and global linkage among these variants. The action of processes other than spontaneous mutation is identified by comparing the spectrum of mutations that have accumulated since A. thaliana diverged from its closest relative 10 million years ago with the spectrum observed in the laboratory. Recent species-wide selective sweeps are rare, and potentially deleterious mutations are more common in marginal populations.

...read moreread less

965 citations

Journal Article•DOI•

A survey of sequence alignment algorithms for next-generation sequencing.

[...]

Heng Li¹, Nils Homer•Institutions (1)

Broad Institute¹

01 Sep 2010-Briefings in Bioinformatics

TL;DR: A wide variety of alignment algorithms and software have been developed over the past two years as discussed by the authors, and the current development of these algorithms and their practical applications on different types of experimental data.

...read moreread less

Abstract: Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing.

...read moreread less

958 citations

Journal Article•DOI•

The Arabidopsis lyrata genome sequence and the basis of rapid genome size change

[...]

Tina T. Hu¹, Pedro Pattyn², E. G. Bakker, Jun Cao³, Jan Fang Cheng⁴, Richard M. Clark³, Noah Fahlgren⁵, Jeffrey A. Fawcett², Jane Grimwood⁴, Heidrun Gundlach, Georg Haberer, Jesse D. Hollister⁶, Stephan Ossowski³, Robert P. Ottilar⁴, Asaf Salamov⁴, Korbinian Schneeberger³, Manuel Spannagl, Xi-Mo Wang, Liang Yang⁶, Mikhail E. Nasrallah⁷, Joy Bergelson⁸, James C. Carrington⁵, Brandon S. Gaut⁶, Jeremy Schmutz⁴, Klaus F. X. Mayer, Yves Van de Peer², Igor V. Grigoriev⁴, Magnus Nordborg⁹, Magnus Nordborg¹, Detlef Weigel³, Ya-Long Guo³ - Show less +27 more•Institutions (9)

University of Southern California¹, Ghent University², Max Planck Society³, United States Department of Energy⁴, Oregon State University⁵, University of California, Irvine⁶, Cornell University⁷, University of Chicago⁸, Gregor Mendel Institute⁹

01 May 2011-Nature Genetics

TL;DR: The 207-Mb genome sequence of the North American Arabidopsis lyrata strain MN47, based on 8.3× dideoxy sequence coverage, is reported, indicating pervasive selection for a smaller genome in this outcrossing species.

...read moreread less

Abstract: We present the 207 Mb genome sequence of the outcrosser Arabidopsis lyrata, which diverged from the self-fertilizing species A. thaliana about 10 million years ago. It is generally assumed that the much smaller A. thaliana genome, which is only 125 Mb, constitutes the derived state for the family. Apparent genome reduction in this genus can be partially attributed to the loss of DNA from large-scale rearrangements, but the main cause lies in the hundreds of thousands of small deletions found throughout the genome. These occurred primarily in non-coding DNA and transposons, but protein-coding multi-gene families are smaller in A. thaliana as well. Analysis of deletions and insertions still segregating in A. thaliana indicates that the process of DNA loss is ongoing, suggesting pervasive selection for a smaller genome.

...read moreread less

845 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Fast and accurate short read alignment with Burrows–Wheeler transform

[...]

Heng Li¹, Richard Durbin¹•Institutions (1)

Wellcome Trust Sanger Institute¹

01 Jul 2009-Bioinformatics

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.

...read moreread less

Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

...read moreread less

43,862 citations

"Simultaneous alignment of short rea..." refers methods in this paper

...Different approaches for fast mapping of short reads have been suggested, including methods for indexing substrings of either the short reads or the reference sequence with the use of k-mers or spaced seeds (academic tools such as Bowtie, BWA, CloudBurst, MAQ, MOM, MosaikAligner, mrFAST, mrsFAST, Pash, PASS, PatMaN, RazorS, RMAP, SeqMap, SHRiMP, SliderII, SOAP, SOAP2, ssaha2 [2,11-28], and commercial tools such as ZOOM [29])....
[...]

Journal Article•DOI•

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

[...]

Ben Langmead¹, Cole Trapnell¹, Mihai Pop¹, Steven L. Salzberg¹•Institutions (1)

University of Maryland, College Park¹

04 Mar 2009-Genome Biology

TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.

...read moreread less

Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

...read moreread less

20,335 citations

"Simultaneous alignment of short rea..." refers background or methods in this paper

..., BurrowsWheeler indexing [16]), the latter are usually geared toward ungapped alignments and are not easily extendable to nonlinear structures imposed by multiple genomes....
[...]
...It has been reported that the current high demand for rapid alignments, to accommodate the flood of data generated by efforts such as the 1000 Genomes Project, can be met with new indexing strategies [16]....
[...]
...Different approaches for fast mapping of short reads have been suggested, including methods for indexing substrings of either the short reads or the reference sequence with the use of k-mers or spaced seeds (academic tools such as Bowtie, BWA, CloudBurst, MAQ, MOM, MosaikAligner, mrFAST, mrsFAST, Pash, PASS, PatMaN, RazorS, RMAP, SeqMap, SHRiMP, SliderII, SOAP, SOAP2, ssaha2 [2,11-28], and commercial tools such as ZOOM [29])....
[...]
...SOAP and MAQ were previously compared with bowtie [16], but with a human target....
[...]

Journal Article•DOI•

A general method applicable to the search for similarities in the amino acid sequence of two proteins

[...]

Saul B. Needleman¹, Christian D. Wunsch¹•Institutions (1)

Northwestern University¹

28 Mar 1970-Journal of Molecular Biology

TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

...read moreread less

11,844 citations

Journal Article•DOI•

Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.

[...]

Arabidopsis Genome Initiative¹•Institutions (1)

J. Craig Venter Institute¹

14 Dec 2000-Nature

TL;DR: This is the first complete genome sequence of a plant and provides the foundations for more comprehensive comparison of conserved processes in all eukaryotes, identifying a wide range of plant-specific gene functions and establishing rapid systematic ways to identify genes for crop improvement.

...read moreread less

Abstract: The flowering plant Arabidopsis thaliana is an important model system for identifying genes and determining their functions. Here we report the analysis of the genomic sequence of Arabidopsis. The sequenced regions cover 115.4 megabases of the 125-megabase genome and extend into centromeric regions. The evolution of Arabidopsis involved a whole-genome duplication, followed by subsequent gene loss and extensive local gene duplications, giving rise to a dynamic genome enriched by lateral gene transfer from a cyanobacterial-like ancestor of the plastid. The genome contains 25,498 genes encoding proteins from 11,000 families, similar to the functional diversity of Drosophila and Caenorhabditis elegans--the other sequenced multicellular eukaryotes. Arabidopsis has many families of new proteins but also lacks several common protein families, indicating that the sets of common proteins have undergone differential expansion and contraction in the three multicellular eukaryotes. This is the first complete genome sequence of a plant and provides the foundations for more comprehensive comparison of conserved processes in all eukaryotes, identifying a wide range of plant-specific gene functions and establishing rapid systematic ways to identify genes for crop improvement.

...read moreread less

8,742 citations

Journal Article•DOI•

Accurate whole human genome sequencing using reversible terminator chemistry

[...]

David R. Bentley¹, Shankar Balasubramanian², Harold Swerdlow³, Harold Swerdlow¹ +198 more•Institutions (4)

06 Nov 2008-Nature

TL;DR: An approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost is reported, effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.

...read moreread less

Abstract: DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.

...read moreread less

3,802 citations

"Simultaneous alignment of short rea..." refers methods in this paper

...The initial resequencing of Caenorhabditis elegans and Arabidopsis thaliana (Arabidopsis) strains with Illumina reads [1,2] was recently complemented by genome sequences of several human individuals, generated with data derived from technologies from Illumina (San Diego, CA, USA), Applied Biosystems (Foster City, CA, USA), and Helicos (Cambridge, MA, USA) [3-10]....
[...]