Other affiliations: Harvard University, Brigham and Women's Hospital, Fred Hutchinson Cancer Research Center ...read more
Bio: Robert Gentleman is an academic researcher from Genentech. The author has contributed to research in topic(s): Bioconductor & Gene expression profiling. The author has an hindex of 52, co-authored 139 publication(s) receiving 48510 citation(s). Previous affiliations of Robert Gentleman include Harvard University & Brigham and Women's Hospital.
Papers published on a yearly basis
Harvard University1, Brigham and Women's Hospital2, University of Wisconsin-Madison3, University of California, Berkeley4, Technical University of Denmark5, Icahn School of Medicine at Mount Sinai6, Vienna University of Technology7, University of Erlangen-Nuremberg8, German Cancer Research Center9, University of Milan10, Johns Hopkins University11, University of Washington12, Scripps Research Institute13, Walter and Eliza Hall Institute of Medical Research14, University of Iowa15
15 Sep 2004-Genome Biology
TL;DR: Details of the aims and methods of Bioconductor, the collaborative creation of extensible software for computational biology and bioinformatics, and current challenges are described.
Abstract: The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.
TL;DR: In this article, the authors discuss their experience designing and implementing a statistical computing language, which combines what they felt were useful features from two existing computer languages, and they feel that the new language provides advantages in the areas of portability, computational efficiency, memory management, and scope.
Abstract: In this article we discuss our experience designing and implementing a statistical computing language. In developing this new language, we sought to combine what we felt were useful features from two existing computer languages. We feel that the new language provides advantages in the areas of portability, computational efficiency, memory management, and scoping.
TL;DR: It is shown here that miRNAs are present in human plasma in a remarkably stable form that is protected from endogenous RNase activity and established the measurement of tumor-derived mi RNAs in serum or plasma as an important approach for the blood-based detection of human cancer.
Abstract: Improved approaches for the detection of common epithelial malignancies are urgently needed to reduce the worldwide morbidity and mortality caused by cancer. MicroRNAs (miRNAs) are small (≈22 nt) regulatory RNAs that are frequently dysregulated in cancer and have shown promise as tissue-based markers for cancer classification and prognostication. We show here that miRNAs are present in human plasma in a remarkably stable form that is protected from endogenous RNase activity. miRNAs originating from human prostate cancer xenografts enter the circulation, are readily measured in plasma, and can robustly distinguish xenografted mice from controls. This concept extends to cancer in humans, where serum levels of miR-141 (a miRNA expressed in prostate cancer) can distinguish patients with prostate cancer from healthy controls. Our results establish the measurement of tumor-derived miRNAs in serum or plasma as an important approach for the blood-based detection of human cancer.
01 Jan 2015
Harvard University1, Genentech2, Fred Hutchinson Cancer Research Center3, State University of Campinas4, University of Maryland, College Park5, National Institutes of Health6, University of Cambridge7, University of California, Riverside8, Novartis9, Johns Hopkins University10, University of Washington11, Walter and Eliza Hall Institute of Medical Research12, City University of New York13
01 Feb 2015-Nature Methods
TL;DR: An overview of Bioconductor, an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology, which comprises 934 interoperable packages contributed by a large, diverse community of scientists.
Abstract: Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.
05 Dec 2014-Genome Biology
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .
01 Jan 2010-Bioinformatics
TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).
01 Sep 2010-Genome Research
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
20 Apr 2015-Nucleic Acids Research
TL;DR: The philosophy and design of the limma package is reviewed, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.
Abstract: limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.
15 Jan 2015-Bioinformatics
TL;DR: This work presents HTSeq, a Python library to facilitate the rapid development of custom scripts for high-throughput sequencing data analysis, and presents htseq-count, a tool developed with HTSequ that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.
Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: email@example.com