scispace - formally typeset
Open AccessPosted ContentDOI

MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale

Reads0
Chats0
TLDR
This work presents MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories, and introduces the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set.
Abstract
The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by an index and its query performance. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph indexes can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework’s scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI’s Sequence Read Archive, representing a total input of more than three petabases. Meta-Graphprovides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Notably, processing of data sets ranging from 1 TB of raw WGS reads to 20 TB of human RNA-sequencing data results in indexes whose memory footprints are small enough to host on standard desktop workstations. Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including indexes of over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 40,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes will be available for download and in the cloud. In total, indexes comprising more than 1 million sequencing records are available for download. As an example of our indexes’ integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.

read more

Citations
More filters

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Glenn Tesler
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Journal ArticleDOI

Petabase-scale sequence alignment catalyses viral discovery

TL;DR: This article developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude.
Journal ArticleDOI

A global metagenomic map of urban microbiomes and antimicrobial resistance

David Danko, +681 more
- 24 Jun 2021 - 
TL;DR: This paper presented a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over three years, representing the first systematic, worldwide catalog of the urban microbial ecosystem.
Journal ArticleDOI

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

TL;DR: In this paper, the authors perform a large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory.
References
More filters
Journal ArticleDOI

Basic Local Alignment Search Tool

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Journal ArticleDOI

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Journal ArticleDOI

STAR: ultrafast universal RNA-seq aligner

TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Glenn Tesler
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Related Papers (5)