Searching for SNPs with cloud computing

doi:10.1186/GB-2009-10-11-R134

Home
/
Papers
/
Searching for SNPs with cloud computing

Journal Article•DOI•

Searching for SNPs with cloud computing

Ben Langmead¹, Ben Langmead², Michael C. Schatz¹, Jimmy Lin¹, Mihai Pop¹, Steven L. Salzberg¹ - Show less +2 more•Institutions (2)

University of Maryland, College Park¹, Johns Hopkins University²

20 Nov 2009-Genome Biology (BioMed Central)-Vol. 10, Iss: 11, pp 1-10

TL;DR: Crossbow is a cloud-computing software tool that combines the aligner Bowtie and the SNP caller SOAPsnp that analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85.

read less

Abstract: As DNA sequencing outpaces improvements in computer speed, there is a critical need to accelerate tasks like alignment and SNP calling. Crossbow is a cloud-computing software tool that combines the aligner Bowtie and the SNP caller SOAPsnp. Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85. Crossbow is available from http://bowtie-bio.sourceforge.net/crossbow/.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A framework for variation discovery and genotyping using next-generation DNA sequencing data

[...]

Mark A. DePristo¹, Eric Banks¹, Ryan Poplin¹, Kiran V. Garimella¹, Jared Maguire¹, Christopher Hartl¹, Anthony A. Philippakis¹, Anthony A. Philippakis², Anthony A. Philippakis³, Guillermo del Angel¹, Manuel A. Rivas², Manuel A. Rivas¹, Matt Hanna¹, Aaron McKenna¹, Timothy Fennell¹, Andrew Kernytsky¹, Andrey Sivachenko¹, Kristian Cibulskis¹, Stacey Gabriel¹, David Altshuler¹, David Altshuler², Mark J. Daly², Mark J. Daly¹ - Show less +19 more•Institutions (3)

Broad Institute¹, Harvard University², Brigham and Women's Hospital³

01 May 2011-Nature Genetics

TL;DR: A unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs is presented.

...read moreread less

Abstract: Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass (~4×) 1000 Genomes Project datasets.

...read moreread less

10,056 citations

Additional excerpts

...Comparison of this calling pipeline to Crossbow To calibrate the additional value of the tools described here, we contrasted our results with SNPs called on our raw NA12878 exome data using Crossbo...
[...]

Journal Article•DOI•

Big data: Astronomical or genomical?

[...]

Zachary D. Stephens¹, Skylar Y. Lee¹, Faraz Faghri¹, Roy H. Campbell¹, ChengXiang Zhai¹, Miles Efron¹, Ravishankar K. Iyer¹, Michael C. Schatz², Saurabh Sinha¹, Gene E. Robinson¹ - Show less +6 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Cold Spring Harbor Laboratory²

07 Jul 2015-PLOS Biology

TL;DR: Estimates show that genomics is a “four-headed beast”—it is either on par with or the most demanding of the domains analyzed here in terms of data acquisition, storage, distribution, and analysis.

...read moreread less

Abstract: Genomics is a Big Data science and is going to get much bigger, very soon, but it is not known whether the needs of genomics will exceed other Big Data domains. Projecting to the year 2025, we compared genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. Our estimates show that genomics is a “four-headed beast”—it is either on par with or the most demanding of the domains analyzed here in terms of data acquisition, storage, distribution, and analysis. We discuss aspects of new technologies that will need to be developed to rise up and meet the computational challenges that genomics poses for the near future. Now is the time for concerted, community-wide planning for the “genomical” challenges of the next decade.

...read moreread less

1,128 citations

Additional excerpts

...Variant calling on 2 billion genomes per year, with 100,000 CPUs in parallel, would require methods that process 2 genomes per CPU-hour, three-to-four orders of magnitude faster than current capabilities [42]....
[...]

Journal Article•DOI•

c-Myc Is a Universal Amplifier of Expressed Genes in Lymphocytes and Embryonic Stem Cells

[...]

Zuqin Nie, Gangqing Hu¹, Gang Wei¹, Kairong Cui¹, Arito Yamane, Wolfgang Resch, Ruoning Wang², Douglas R. Green², Lino Tessarollo, Rafael Casellas, Keji Zhao¹, David Levens - Show less +8 more•Institutions (2)

National Institutes of Health¹, St. Jude Children's Research Hospital²

28 Sep 2012-Cell

TL;DR: To observe Myc target expression and function in a system where Myc is temporally and physiologically regulated, the transcriptomes and the genome-wide distributions of Myc, RNA polymerase II, and chromatin modifications were compared during lymphocyte activation and in ES cells as well.

...read moreread less

970 citations

Cites methods from "Searching for SNPs with cloud compu..."

...Sequence reads of 25 bp for ChIP-Seq and 36 bp for RNA-Seq were generated from an Illumina Genome Analyzer, mapped to mouse genome (mm8) by using Bowtie (Langmead et al., 2009)....
[...]

Journal Article•DOI•

A survey of sequence alignment algorithms for next-generation sequencing.

[...]

Heng Li¹, Nils Homer•Institutions (1)

Broad Institute¹

01 Sep 2010-Briefings in Bioinformatics

TL;DR: A wide variety of alignment algorithms and software have been developed over the past two years as discussed by the authors, and the current development of these algorithms and their practical applications on different types of experimental data.

...read moreread less

Abstract: Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing.

...read moreread less

958 citations

Journal Article•DOI•

Computational solutions to large-scale data management and analysis

[...]

Eric E. Schadt¹, Michael D. Linderman², Jon M. Sorenson¹, Lawrence Lee¹, Garry P. Nolan² - Show less +1 more•Institutions (2)

Pacific Biosciences¹, Stanford University²

01 Sep 2010-Nature Reviews Genetics

TL;DR: How to master the different types of computational environments that exist — such as cloud and heterogeneous computing — to successfully tackle the authors' big data problems is discussed.

...read moreread less

Abstract: Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist — such as cloud and heterogeneous computing — to successfully tackle our big data problems.

...read moreread less

612 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

[...]

Ben Langmead¹, Cole Trapnell¹, Mihai Pop¹, Steven L. Salzberg¹•Institutions (1)

University of Maryland, College Park¹

04 Mar 2009-Genome Biology

TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.

...read moreread less

Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

...read moreread less

20,335 citations

"Searching for SNPs with cloud compu..." refers methods in this paper

...For alignment, Crossbow uses Bowtie [17], which employs a Burrows-Wheeler index [25] based on the full-text minute-space (FM) index [26] to enable fast and memory-efficient alignment of short reads to mammalian genomes....
[...]
...We present Crossbow, a Hadoop-based software tool that combines the speed of the short read aligner Bowtie [17] with the accuracy of the SNP caller SOAPsnp [18] to perform alignment and SNP detection for multiple whole-human datasets per day....
[...]

Journal Article•DOI•

MapReduce: simplified data processing on large clusters

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

...read moreread less

20,309 citations

Journal Article•DOI•

Mapping and quantifying mammalian transcriptomes by RNA-Seq.

[...]

Ali Mortazavi¹, Brian A. Williams¹, Kenneth McCue¹, Lorian Schaeffer¹, Barbara J. Wold¹ - Show less +1 more•Institutions (1)

California Institute of Technology¹

29 Jun 2008-Nature Methods

TL;DR: Although >90% of uniquely mapped reads fell within known exons, the remaining data suggest new and revised gene models, including changed or additional promoters, exons and 3′ untranscribed regions, as well as new candidate microRNA precursors.

...read moreread less

Abstract: We have mapped and quantified mouse transcriptomes by deeply sequencing them and recording how frequently each gene is represented in the sequence sample (RNA-Seq). This provides a digital measure of the presence and prevalence of transcripts from known and previously unknown genes. We report reference measurements composed of 41–52 million mapped 25-base-pair reads for poly(A)-selected RNA from adult mouse brain, liver and skeletal muscle tissues. We used RNA standards to quantify transcript prevalence and to test the linear range of transcript detection, which spanned five orders of magnitude. Although >90% of uniquely mapped reads fell within known exons, the remaining data suggest new and revised gene models, including changed or additional promoters, exons and 3′ untranscribed regions, as well as new candidate microRNA precursors. RNA splice events, which are not readily measured by standard gene expression microarray or serial analysis of gene expression methods, were detected directly by mapping splice-crossing sequence reads. We observed 1.45 × 10 5 distinct splices, and alternative splices were prominent, with 3,500 different genes expressing one or more alternate internal splices. The mRNA population specifies a cell’s identity and helps to govern its present and future activities. This has made transcriptome analysis a general phenotyping method, with expression microarrays of many kinds in routine use. Here we explore the possibility that transcriptome analysis, transcript discovery and transcript refinement can be done effectively in large and complex mammalian genomes by ultra-high-throughput sequencing. Expression microarrays are currently the most widely used methodology for transcriptome analysis, although some limitations persist. These include hybridization and cross-hybridization artifacts 1–3 , dye-based detection issues and design constraints that preclude or seriously limit the detection of RNA splice patterns and previously unmapped genes. These issues have made it difficult for standard array designs to provide full sequence comprehensiveness (coverage of all possible genes, including unknown ones, in large genomes) or transcriptome comprehensiveness (reliable detection of all RNAs of all prevalence classes, including the least abundant ones that are physiologically relevant). Other

...read moreread less

12,293 citations

"Searching for SNPs with cloud compu..." refers background in this paper

...Technologies from Illumina (San Diego, CA, USA), Applied Biosystems (Foster City, CA, USA) and 454 Life Sciences (Branford, CT, USA) have been used to detect genomic variations among humans [1-5], to profile methylation patterns [6], to map DNA-protein interactions [7], and to identify differentially expressed genes and novel splice junctions [8,9]....
[...]

Journal Article•DOI•

TopHat: discovering splice junctions with RNA-Seq

[...]

Cole Trapnell¹, Lior Pachter¹, Steven L. Salzberg¹•Institutions (1)

University of Maryland, College Park¹

01 May 2009-Bioinformatics

TL;DR: The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer.

...read moreread less

Abstract: Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or ‘reads’, can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development. Availability: TopHat is free, open-source software available from http://tophat.cbcb.umd.edu Contact: ude.dmu.sc@eloc Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

11,473 citations

"Searching for SNPs with cloud compu..." refers background in this paper

...Technologies from Illumina (San Diego, CA, USA), Applied Biosystems (Foster City, CA, USA) and 454 Life Sciences (Branford, CT, USA) have been used to detect genomic variations among humans [1-5], to profile methylation patterns [6], to map DNA-protein interactions [7], and to identify differentially expressed genes and novel splice junctions [8,9]....
[...]

Journal Article•DOI•

dbSNP: the NCBI database of genetic variation

[...]

Stephen T. Sherry¹, Minghong Ward, Michael Kholodov, Jonathan Baker, Lon Phan, Elizabeth M. Smigielski, Karl Sirotkin - Show less +3 more•Institutions (1)

National Institutes of Health¹

01 Jan 2001-Nucleic Acids Research

TL;DR: The dbSNP database is a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, and is integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data.

...read moreread less

Abstract: In response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, the National Center for Biotechnology Information (NCBI) has established the dbSNP database [S.T.Sherry, M.Ward and K.Sirotkin (1999) Genome Res., 9, 677–679]. Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data. The complete contents of dbSNP are available to the public at website: http://www.ncbi.nlm.nih.gov/SNP. The complete contents of dbSNP can also be downloaded in multiple formats via anonymous FTP at ftp:// ncbi.nlm.nih.gov/snp/.

...read moreread less

6,449 citations

"Searching for SNPs with cloud compu..." refers methods in this paper

...Positions for known SNPs were calculated according to data in dbSNP [28] versions 128 and 130, and allele frequencies were calculated according to data from the HapMap project [22]....
[...]
...Files containing known SNP locations and allele frequencies derived from dbSNP [28] are distributed to worker nodes via the same mechanism used to Crossbow workflow Figure 2 Crossbow workflow....
[...]