Showing papers by "Rob Patro published in 2015"

PDF

Open Access

Posted Content•DOI•

Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-seq Data using Lightweight-Alignment

[...]

Rob Patro¹, Geet Duggal², Carl Kingsford²•Institutions (2)

Stony Brook University¹, Carnegie Mellon University²

27 Jun 2015-bioRxiv

TL;DR: Salmon is introduced, a novel method and software tool for transcript quantication that exhibits state-of-the-art accuracy while being signicantly faster than most other tools.

...read moreread less

Abstract: Transcript quantication is a central task in the analysis of RNA-seq data. Accurate computational methods for the quantication of transcript abundances are essential for downstream analysis. However, most existing approaches are much slower than is necessary for their degree of accuracy. We introduce Salmon, a novel method and software tool for transcript quantication that exhibits state-of-the-art accuracy while being signicantly faster than most other tools. Salmon achieves this through the combined application of a two-phase inference procedure, a reduced data representation, and a novel lightweight read alignment algorithm. Salmon is written in C++11, and is available under the GPL v3 license as open-source software at https://combine-lab.github.io/salmon.

...read moreread less

132 citations

Posted Content•DOI•

TransRate: reference free quality assessment of de-novo transcriptome assemblies

[...]

Richard Smith-Unna¹, Chris Boursnell¹, Rob Patro², Julian M. Hibberd¹, Steven L. Kelly³ - Show less +1 more•Institutions (3)

University of Cambridge¹, Stony Brook University², University of Oxford³

27 Jun 2015-bioRxiv

TL;DR: TransRate can accurately evaluate assemblies of conserved and novel RNA molecules of any kind in any species and is more accurate than comparable methods and demonstrates its use on a variety of data.

...read moreread less

Abstract: TransRate is a tool for reference-free quality assessment of de novo transcriptome assemblies. Using only sequenced reads as the input, TransRate measures the quality of individual contigs and whole assemblies, enabling assembly optimization and comparison. TransRate can accurately evaluate assemblies of conserved and novel RNA molecules of any kind in any species. We show that it is more accurate than comparable methods and demonstrate its use on a variety of data.

...read moreread less

118 citations

Posted Content•DOI•

Maximum Likelihood Estimation of Biological Relatedness from Low Coverage Sequencing Data

[...]

Mikhail Lipatov¹, Sanjeev K¹, Rob Patro¹, Krishna R. Veeramah¹•Institutions (1)

Stony Brook University¹

29 Jul 2015-bioRxiv

TL;DR: An approach that utilizes genotype likelihoods rather than a single observed best genotype to estimate ϕ is described and it is demonstrated that this method can accurately infer relatedness in both simulated and real 2nd generation sequencing data from a wide variety of human populations down to at least the third degree.

...read moreread less

Abstract: The inference of biological relatedness from DNA sequence data has a wide array of applications, such as in the study of human disease, anthropology and ecology. One of the most common analytical frameworks for performing this inference is to genotype individuals for large numbers of independent genomewide markers and use population allele frequencies to infer the probability of identity-by-descent (IBD) given observed genotypes. Current implementations of this class of methods assume genotypes are known without error. However, with the advent of second generation sequencing data there are now an increasing number of situations where the confidence attached to a particular genotype may be poor because of low coverage. Such scenarios may lead to biased estimates of the kinship coefficient, Φ. We describe an approach that utilizes genotype likelihoods rather than a single observed best genotype to estimate Φ and demonstrate that we can accurately infer relatedness in both simulated and real second generation sequencing data from a wide variety of human populations down to at least the third degree when coverage is as low as 2x for both individuals, while other commonly used methods such as PLINK exhibit large biases in such situations. In addition the method appears to be robust when the assumed population allele frequencies are diverged from the true frequencies for realistic levels of genetic drift. This approach has been implemented in the C++ software lcmlkin.

...read moreread less

84 citations

Posted Content•DOI•

Accurate, fast, and model-aware transcript expression quantification with Salmon

[...]

Rob Patro¹, Geet Duggal², Carl Kingsford²•Institutions (2)

Stony Brook University¹, Carnegie Mellon University²

03 Oct 2015-bioRxiv

TL;DR: Salmon, a quantification method that overcomes this restriction by combining a novel "lightweight" alignment procedure with a streaming parallel inference algorithm and a feature-rich model, yields both exceptional accuracy and order-of-magnitude speed benefits over traditional alignment-based methods.

...read moreread less

Abstract: Existing methods for quantifying transcript abundance require a fundamental compromise: either use high quality read alignments and experiment-specific models or sacrifice them for speed. We introduce Salmon, a quantification method that overcomes this restriction by combining a novel "lightweight" alignment procedure with a streaming parallel inference algorithm and a feature-rich model. These innovations yield both exceptional accuracy and order-of-magnitude speed benefits over traditional alignment-based methods.

...read moreread less

62 citations

Journal Article•DOI•

AtRTD – a comprehensive reference transcript dataset resource for accurate quantification of transcript‐specific expression in Arabidopsis thaliana

[...]

Runxuan Zhang¹, Cristiane P. G. Calixto², Nikoleta A. Tzioutziou², Allan James³, Craig G. Simpson¹, Wenbin Guo², Wenbin Guo¹, Yamile Marquez⁴, Maria Kalyna⁵, Rob Patro⁶, Eduardo Eyras⁷, Eduardo Eyras⁸, Andrea Barta⁴, Hugh G. Nimmo³, John W. S. Brown², John W. S. Brown¹ - Show less +12 more•Institutions (8)

James Hutton Institute¹, University of Dundee², University of Glasgow³, Medical University of Vienna⁴, University of Natural Resources and Life Sciences, Vienna⁵, Stony Brook University⁶, Pompeu Fabra University⁷, Catalan Institution for Research and Advanced Studies⁸

01 Oct 2015-New Phytologist

TL;DR: A comprehensive, nonredundant Arabidopsis reference transcript dataset (AtRTD) containing over 74 000 transcripts for use with algorithms to quantify AS transcript isoforms in RNA‐seq, demonstrating the accuracy of abundances calculated for individual transcripts inRNA‐seq.

...read moreread less

Abstract: RNA-sequencing (RNA-seq) allows global gene expression analysis at the individual transcript level. Accurate quantification of transcript variants generated by alternative splicing (AS) remains a challenge. We have developed a comprehensive, nonredundant Arabidopsis reference transcript dataset (AtRTD) containing over 74 000 transcripts for use with algorithms to quantify AS transcript isoforms in RNA-seq. The AtRTD was formed by merging transcripts from TAIR10 and novel transcripts identified in an AS discovery project. We have estimated transcript abundance in RNA-seq data using the transcriptome-based alignment-free programmes Sailfish and Salmon and have validated quantification of splicing ratios from RNA-seq by high resolution reverse transcription polymerase chain reaction (HR RT-PCR). Good correlations between splicing ratios from RNA-seq and HR RT-PCR were obtained demonstrating the accuracy of abundances calculated for individual transcripts in RNA-seq. The AtRTD is a resource that will have immediate utility in analysing Arabidopsis RNA-seq data to quantify differential transcript abundance and expression.

...read moreread less

46 citations

Journal Article•DOI•

Reference-based compression of short-read sequences using path encoding

[...]

Carl Kingsford¹, Rob Patro¹•Institutions (1)

Carnegie Mellon University¹

15 Jun 2015-Bioinformatics

TL;DR: An approach to compression that reduces the difficulty of managing large-scale sequencing data is presented and is able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches.

...read moreread less

Abstract: Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/~ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X. Contact: ude.umc.sc@klrac. Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

45 citations

Journal Article•DOI•

Data-dependent bucketing improves reference-free compression of sequencing reads.

[...]

Rob Patro¹, Carl Kingsford²•Institutions (2)

Stony Brook University¹, Carnegie Mellon University²

01 Sep 2015-Bioinformatics

TL;DR: A novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file, which achieves up to a 45% reduction in file sizes compared with existing state-of-the-art de novo compression schemes.

...read moreread less

Abstract: Motivation: The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data. Results: We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes. Availability and implementation: Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/~ckingsf/software/mince. Contact: ude.umc.sc@klrac Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

40 citations

Posted Content•DOI•

RapMap: A Rapid, Sensitive and Accurate Tool for Mapping RNA-seq Reads to Transcriptomes

[...]

Avi Srivastava¹, Hirak Sarkar¹, Rob Patro¹•Institutions (1)

Stony Brook University¹

28 Oct 2015-bioRxiv

TL;DR: It is demonstrated how quasi-mapping can be successfully applied to the problems of transcript-level quantification from RNA-seq reads and the clustering of contigs from de novo assembled transcriptomes into biologically-meaningful groups.

...read moreread less

Abstract: Motivation: The alignment of sequencing reads to a transcriptome is a common and important step in many RNA-seq analysis tasks. When aligning RNA-seq reads directly to a transcriptome (as is common in the de novo setting or when a trusted reference annotation is available), care must be taken to report the potentially large number of multi-mapping locations per read. This can pose a substantial computational burden for existing aligners, and can considerably slow downstream analysis. Results: We introduce a novel algorithm, quasi-mapping, for mapping sequencing reads to a transcriptome. By attempting only to report the potential loci of origin of a sequencing read, and not the base-to-base alignment by which it derives from the reference, RapMap --- the tool implementing this quasi-mapping algorithm --- is capable of mapping sequencing reads to a target transcriptome substantially faster than existing alignment tools. The quasi-mapping algorithm itself uses several efficient data structures and takes advantage of the special structure of shared sequence prevalent in transcriptomes to rapidly provide highly-accurate mapping information. Availability: RapMap is implemented in C++11 and is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/RapMap.

...read moreread less

15 citations

Posted Content•DOI•

Rich chromatin structure prediction from Hi-C data

[...]

Laraib Malik¹, Rob Patro¹•Institutions (1)

Stony Brook University¹

26 Nov 2015-bioRxiv

TL;DR: This work presents an efficient methodology to predict hierarchical chromatin domains using chromatin conformation capture data and uses these to construct a hierarchy that is based on intrinsic properties of the chromatin data, which is highly enriched for CTCF and various other chromatin markers.

...read moreread less

Abstract: Recent studies involving the 3-dimensional conformation of chromatin have revealed the important role it has to play in different processes within the cell. These studies have also led to the discovery of densely interacting segments of the chromosome, called topologically associating domains. The accurate identification of these domains from Hi-C interaction data is an interesting and important computational problem for which numerous methods have been proposed. Unfortunately, most existing algorithms designed to identify these domains assume that they are non-overlapping whereas there is substantial evidence to believe a nested structure exists. We present an efficient methodology to predict hierarchical chromatin domains using chromatin conformation capture data. Our method predicts domains at different resolutions and uses these to construct a hierarchy that is based on intrinsic properties of the chromatin data. The hierarchy consists of a set of non-overlapping domains, that maximize intra-domain interaction frequencies, at each level. We show that our predicted structure is highly enriched for CTCF and various other chromatin markers. We also show that large-scale domains, at multiple resolutions within our hierarchy, are conserved across cell types and species. Our software, Matryoshka, is written in C++11 and licensed under GPL v3; it is available at https://github.com/COMBINE-lab/matryoshka.

...read moreread less

12 citations

Book Chapter•DOI•

Optimizing Read Reversals for Sequence Compression

[...]

Zhong Sichen¹, Lu Zhao¹, Yan Liang¹, Mohammadzaman Zamani¹, Rob Patro¹, Rezaul Chowdhury¹, Esther M. Arkin¹, Joseph S. B. Mitchell¹, Steven Skiena¹ - Show less +5 more•Institutions (1)

Stony Brook University¹

10 Sep 2015

TL;DR: Compression of sequence read files is an important problem because the sequential order of the reads in these files typically conveys no biologically significant information, providing the freedom to reorder them so as to facilitate compression.

...read moreread less

Abstract: New generation sequencing technologies produce massive data sets of millions of reads, making the compression of sequence read files an important problem. The sequential order of the reads in these files typically conveys no biologically significant information, providing the freedom to reorder them so as to facilitate compression. Similarly, for many problems the orientation of the reads (original or reverse complement) are indistinguishable from an information-theoretic perspective, providing the freedom to optimize the orientation of each read.

...read moreread less

3 citations

Proceedings Article•

Optimizing Read Reversals for Sequence Compression - (Extended Abstract).

[...]

Zhong Sichen¹, Lu Zhao¹, Yan Liang¹, Mohammadzaman Zamani¹, Rob Patro², Rezaul Chowdhury¹, Esther M. Arkin¹, Joseph S. B. Mitchell¹, Steven Skiena¹ - Show less +5 more•Institutions (2)

Stony Brook University¹, Carnegie Mellon University²

01 Jan 2015