Toil enables reproducible, open source, big biomedical data analyses

doi:10.1038/NBT.3772

Home
/
Papers
/
Toil enables reproducible, open source, big biomedical data analyses

Journal Article•DOI•

Toil enables reproducible, open source, big biomedical data analyses

John Vivian¹, Arjun A. Rao¹, Frank Austin Nothaft², Christopher Ketchum¹, Joel Armstrong¹, Adam M. Novak¹, Jacob Pfeil¹, Jake Narkizian¹, Alden Deran¹, Audrey Musselman-Brown¹, Hannes Schmidt¹, Peter Amstutz, Brian Craft¹, Mary Goldman¹, Kate R. Rosenbloom¹, Melissa S. Cline¹, Brian O'Connor¹, Megan Hanna³, Chet Birger³, W. James Kent¹, David A. Patterson², Anthony D. Joseph², Jingchun Zhu¹, Sasha Zaranek, Gad Getz³, David Haussler¹, Benedict Paten¹ - Show less +23 more•Institutions (3)

University of California, Santa Cruz¹, University of California, Berkeley², Massachusetts Institute of Technology³

01 Apr 2017-Nature Biotechnology (Nature Research)-Vol. 35, Iss: 4, pp 314-316

TL;DR: It is envisage that in future individual research laboratories, or clusters of colocated laboratories, will have in-house, low-cost automation work cells but will access DNA foundries via the cloud to carry out complex experimental workflows and accelerate the development and sharing of standardized protocols and metrology standards.

read less

Abstract: Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

xCell: digitally portraying the tissue cellular heterogeneity landscape

[...]

Dvir Aran¹, Zicheng Hu¹, Atul J. Butte¹•Institutions (1)

University of California, San Francisco¹

15 Nov 2017-Genome Biology

TL;DR: This work presents xCell, a novel gene signature-based method, and uses it to infer 64 immune and stromal cell types and shows that xCell outperforms other methods.

...read moreread less

Abstract: Tissues are complex milieus consisting of numerous cell types. Several recent methods have attempted to enumerate cell subsets from transcriptomes. However, the available methods have used limited sources for training and give only a partial portrayal of the full cellular landscape. Here we present xCell, a novel gene signature-based method, and use it to infer 64 immune and stromal cell types. We harmonized 1822 pure human cell type transcriptomes from various sources and employed a curve fitting approach for linear comparison of cell types and introduced a novel spillover compensation technique for separating them. Using extensive in silico analyses and comparison to cytometry immunophenotyping, we show that xCell outperforms other methods. xCell is available at http://xCell.ucsf.edu/ .

...read moreread less

2,040 citations

Cites methods from "Toil enables reproducible, open sou..."

...We next applied our methodology to 9947 primary tumor samples across 37 cancer types from the TCGA and TARGET projects [30] (Additional file 2: Figure S10)....
[...]

Journal Article•DOI•

GEPIA2: an enhanced web server for large-scale expression profiling and interactive analysis.

[...]

Zefang Tang¹, Boxi Kang¹, Chenwei Li¹, Tianxiang Chen¹, Zemin Zhang¹ - Show less +1 more•Institutions (1)

Peking University¹

02 Jul 2019-Nucleic Acids Research

TL;DR: GEPIA2 has adopted new analysis techniques of gene signature quantification inspired by single-cell sequencing studies, and provides customized analysis where users can upload their own RNA-seq data and compare them with TCGA and GTEx samples.

...read moreread less

Abstract: Introduced in 2017, the GEPIA (Gene Expression Profiling Interactive Analysis) web server has been a valuable and highly cited resource for gene expression analysis based on tumor and normal samples from the TCGA and the GTEx databases. Here, we present GEPIA2, an updated and enhanced version to provide insights with higher resolution and more functionalities. Featuring 198 619 isoforms and 84 cancer subtypes, GEPIA2 has extended gene expression quantification from the gene level to the transcript level, and supports analysis of a specific cancer subtype, and comparison between subtypes. In addition, GEPIA2 has adopted new analysis techniques of gene signature quantification inspired by single-cell sequencing studies, and provides customized analysis where users can upload their own RNA-seq data and compare them with TCGA and GTEx samples. We also offer an API for batch process and easy retrieval of the analysis results. The updated web server is publicly accessible at http://gepia2.cancer-pku.cn/.

...read moreread less

1,988 citations

Cites methods from "Toil enables reproducible, open sou..."

...We downloaded the TCGA and GTEx isoform expression data that are re-computed from raw RNA-Seq data by the UCSC Xena project (11) based on a uniform pipeline and collected different cancer subtypes information from TCGA papers....
[...]
...In GEPIA2, we utilized the UCSC Xena (11) recomputed data of TCGA and GTEx for 198,619 coding and a series of other non-coding transcripts, and developed new computational functionalities to explore such events....
[...]

Journal Article•DOI•

Visualizing and interpreting cancer genomics data via the Xena platform.

[...]

Mary Goldman¹, Brian Craft¹, Mim Hastie, Kristupas Repečka², Fran McDade, Akhil Kamath³, Ayan Banerjee⁴, Yunhai Luo⁵, Dave Rogers, Angela N. Brooks¹, Jingchun Zhu¹, David Haussler¹ - Show less +8 more•Institutions (5)

University of California, Santa Cruz¹, Vilnius University², Birla Institute of Technology and Science³, National Institute of Technology, Durgapur⁴, Stanford University⁵

22 May 2020-Nature Biotechnology

TL;DR: Xena’s Visual Spreadsheet visualization integrates gene-centric and genomic-coordinate-centric views across multiple data modalities, providing a deep, comprehensive view of genomic events within a cohort of tumors.

...read moreread less

Abstract: To the Editor — There is a great need for easy-to-use cancer genomics visualization tools for both large public data resources such as TCGA (The Cancer Genome Atlas)1 and the GDC (Genomic Data Commons)2, as well as smaller-scale datasets generated by individual labs. Commonly used interactive visualization tools are either web-based portals or desktop applications. Data portals have a dedicated back end and are a powerful means of viewing centrally hosted resource datasets (for example, Xena’s predecessor, the University of California, Santa Cruz (UCSC) Cancer Browser (currently retired3), cBioPortal4, ICGC (International Cancer Genomics Consortium) Data Portal5, GDC Data Portal2). However, researchers wishing to use a data portal to explore their own data have to either redeploy the entire platform, a difficult task even for bioinformaticians, or upload private data to a server outside the user’s control, a non-starter for protected patient data, such as germline variants (for example, MAGI (Mutation Annotation and Genome Interpretation6), WebMeV7 or Ordino8). Desktop tools can view a user’s own data securely (for example, Integrated Genomics Viewer (IGV)9, Gitools10), but lack well-maintained, prebuilt files for the ever-evolving and expanding public data resources. This dichotomy between data portals and desktop tools highlights the challenge of using a single platform for both large public data and smaller-scale datasets generated by individual labs. Complicating this dichotomy is the expanding amount, and complexity, of cancer genomics data resulting from numerous technological advances, including lower-cost high-throughput sequencing and single-cell-based technologies. Cancer genomics datasets are now being generated using new assays, such as whole-genome sequencing11, DNA methylation whole-genome bisulfite sequencing12 and ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing13). Visualizing and exploring these diverse data modalities is important but challenging, especially as many tools have traditionally specialized in only one or perhaps a few data types. And although these complex datasets generate insights individually, integration with other omics datasets is crucial to help researchers discover and validate findings. UCSC Xena was developed as a high-performance visualization and analysis tool for both large public repositories and private datasets. It was built to scale with the current and future data growth and complexity. Xena’s privacy-aware architecture enables cancer researchers of all computational backgrounds to explore large, diverse datasets. Researchers use the same system to securely explore their own data, together or separately from the public data, all the while keeping private data secure. The system easily supports many tens of thousands of samples and has been tested with up to a million cells. The simple and flexible architecture supports a variety of common and uncommon data types. Xena’s Visual Spreadsheet visualization integrates gene-centric and genomic-coordinate-centric views across multiple data modalities, providing a deep, comprehensive view of genomic events within a cohort of tumors. UCSC Xena (http://xena.ucsc.edu) has two components: the front end Xena Browser and the back end Xena Hubs (Fig. 1). The web-based Xena Browser empowers biologists to explore data across multiple Xena Hubs with a variety of visualizations and analyses. The back end Xena Hubs host genomics data from laptops, public servers, behind a firewall, or in the cloud, and can be public or private (Supplementary Fig. 1). The Xena Browser receives data simultaneously from multiple Xena Hubs and integrates them into a single coherent visualization within the browser. A private Xena Hub is a hub installed on a user’s own computer (Supplementary Fig. 2). It is configured to only respond to requests from the computer’s localhost network interface (that is, http://127.0.0.1). This ensures that the hub only communicates with the computer on which the hub is installed. A public hub is configured to respond to requests from external computers. There are two types of public Xena Hubs (Supplementary Fig. 2). The first type is an open-public hub, which is a public hub accessible by everyone. While we host several open-public hubs (Supplementary Table 1), users can also set up their own as a way to share data. An example of one is the Treehouse Hub set up by the Childhood Cancer Initiative to share pediatric cancer RNA-seq gene expression data (Supplementary Note). The second type W eb s er ve r

...read moreread less

1,644 citations

Posted Content•DOI•

xCell: Digitally portraying the tissue cellular heterogeneity landscape

[...]

Dvir Aran¹, Zicheng Hu¹, Atul J. Butte¹•Institutions (1)

University of California, San Francisco¹

08 Mar 2017-bioRxiv

TL;DR: XCell as mentioned in this paper is a gene-signature based method for inferring 64 immune and stroma cell types from 1,822 transcriptomic profiles of pure human cells from various sources, employed a curve fitting approach for linear comparison of cell types, and introduced a novel spillover compensation technique for separating closely related cell types.

...read moreread less

Abstract: Tissues are a complex milieu consisting of numerous cell types. For example, understanding the cellular heterogeneity the tumor microenvironment is an emerging field of research. Numerous methods have been published in recent years for the enumeration of cell subsets from tissue expression profiles. However, the available methods suffer from three major problems: inferring cell subset based on gene sets learned and verified from limited sources; displaying only partial portrayal of the full cellular heterogeneity; and insufficient validation in mixed tissues. To address these issues we developed xCell, a novel gene-signature based method for inferring 64 immune and stroma cell types. We first curated and harmonized 1,822 transcriptomic profiles of pure human cell types from various sources, employed a curve fitting approach for linear comparison of cell types, and introduced a novel spillover compensation technique for separating between closely related cell types. We test the ability of our model learned from pure cell types to infer enrichments of cell types in mixed tissues, using both comprehensive in silico analyses, and by comparison to cytometry immunophenotyping to show that our scores outperform previously published methods. Finally, we explore the cell type enrichments in tumor samples and show that the cellular heterogeneity of the tumor microenvironment uniquely characterizes different cancer types. We provide our method for inferring cell type abundances as a public resource to allow researchers to portray the cellular heterogeneity landscape of tissue expression profiles: http://xCell.ucsf.edu/.

...read moreread less

995 citations

Cites methods from "Toil enables reproducible, open sou..."

...Cell types enrichments in tumor samples We next applied our methodology on 9,947 primary tumor samples across thirty-seven cancer types from the TCGA and TARGET projects [29] (Supplementary Figure 9)....
[...]

Journal Article•DOI•

The UCSC Genome Browser database: 2019 update.

[...]

Maximilian Haeussler¹, Ann S. Zweig¹, Cath Tyner¹, Matthew L. Speir¹, Kate R. Rosenbloom¹, Brian J. Raney¹, Christopher Lee¹, Brian T. Lee¹, Angie S. Hinrichs¹, Jairo Navarro Gonzalez¹, David Gibson¹, Mark Diekhans¹, Hiram Clawson¹, Jonathan Casper¹, Galt P. Barber¹, David Haussler¹, Robert M. Kuhn¹, W. James Kent¹ - Show less +14 more•Institutions (1)

University of California, Santa Cruz¹

08 Jan 2019-Nucleic Acids Research

TL;DR: A new tool is added that lets users interactively arrange existing graphing tracks into new groups and create a 30-way primate alignment on the human genome in the UCSC Genome Browser.

...read moreread less

Abstract: The UCSC Genome Browser (https://genome.ucsc.edu) is a graphical viewer for exploring genome annotations. For almost two decades, the Browser has provided visualization tools for genetics and molecular biology and continues to add new data and features. This year, we added a new tool that lets users interactively arrange existing graphing tracks into new groups. Other software additions include new formats for chromosome interactions, a ChIP-Seq peak display for track hubs and improved support for HGVS. On the annotation side, we have added gnomAD, TCGA expression, RefSeq Functional elements, GTEx eQTLs, CRISPR Guides, SNPpedia and created a 30-way primate alignment on the human genome. Nine assemblies now have RefSeq-mapped gene models.

...read moreread less

649 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

STAR: ultrafast universal RNA-seq aligner

[...]

Alexander Dobin¹, Carrie A. Davis¹, Felix Schlesinger¹, Jorg Drenkow¹, Chris Zaleski¹, Sonali Jha¹, Philippe Batut¹, Mark Chaisson¹, Thomas R. Gingeras¹ - Show less +5 more•Institutions (1)

Cold Spring Harbor Laboratory¹

01 Jan 2013-Bioinformatics

TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.

...read moreread less

Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

...read moreread less

30,684 citations

Journal Article•DOI•

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

[...]

Aaron McKenna¹, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran V. Garimella, David Altshuler, Stacey Gabriel, Mark J. Daly, Mark A. DePristo - Show less +7 more•Institutions (1)

Broad Institute¹

01 Sep 2010-Genome Research

TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

...read moreread less

Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

...read moreread less

20,557 citations

Journal Article•DOI•

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

[...]

Bo Li¹, Colin N. Dewey¹•Institutions (1)

University of Wisconsin-Madison¹

04 Aug 2011-BMC Bioinformatics

TL;DR: It is shown that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads, and estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired- end reads, depending on the number of possible splice forms for each gene.

...read moreread less

Abstract: RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.

...read moreread less

14,524 citations

Journal Article•DOI•

The Human Genome Browser at UCSC

[...]

W. James Kent¹, Charles W. Sugnet¹, Terrence S. Furey¹, Krishna M. Roskin¹, Tom H. Pringle, Alan M. Zahler¹, and David Haussler¹ - Show less +3 more•Institutions (1)

University of California, Santa Cruz¹

01 Jun 2002-Genome Research

TL;DR: A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu.

...read moreread less

Abstract: As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users.

...read moreread less

9,605 citations

Journal Article•DOI•

Near-optimal probabilistic RNA-seq quantification

[...]

Nicolas Bray¹, Harold Pimentel¹, Páll Melsted², Lior Pachter¹•Institutions (2)

University of California, Berkeley¹, University of Iceland²

01 May 2016-Nature Biotechnology

TL;DR: Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases, which removes a major computational bottleneck in RNA-seq analysis.

...read moreread less

Abstract: We present kallisto, an RNA-seq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned paired-end RNA-seq reads in <10 min on a standard laptop computer. This removes a major computational bottleneck in RNA-seq analysis.

...read moreread less

6,468 citations