scispace - formally typeset
Search or ask a question

Showing papers by "Robert Gentleman published in 2005"


Book
01 Aug 2005
TL;DR: Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.
Abstract: Full four-color book. Some of the editors created the Bioconductor project and Robert Gentleman is one of the two originators of R. All methods are illustrated with publicly available data, and a major section of the book is devoted to fully worked case studies. Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.

211 citations


Journal ArticleDOI
TL;DR: It is found that VPA and other HDAC inhibitors cause very similar and characteristic developmental defects whereas VPA analogs with poor inhibitory activity in vivo have little teratogenic effect.
Abstract: Chemically induced birth defects are an important public health and human problem. Here we use Xenopus and zebrafish as models to investigate the mechanism of action of a well-known teratogen, valproic acid (VPA). VPA is a drug used in treatment of epilepsy and bipolar disorder but causes spina bifida if taken during pregnancy. VPA has several biochemical activities, including inhibition of histone deacetylases (HDACs). To investigate the mechanism of action of VPA, we compared its effects in Xenopus and zebrafish embryos with those of known HDAC inhibitors and noninhibitory VPA analogs. We found that VPA and other HDAC inhibitors cause very similar and characteristic developmental defects whereas VPA analogs with poor inhibitory activity in vivo have little teratogenic effect. Unbiased microarray analysis revealed that the effects of VPA and trichostatin A (TSA), a structurally unrelated HDAC inhibitor, are strikingly concordant. The concordance is apparent both by en masse correlation of fold-changes and by detailed similarity of dose-response profiles of individual genes. Together, the results demonstrate that the teratogenic effects of VPA are very likely mediated specifically by inhibition of HDACs.

180 citations


Journal ArticleDOI
TL;DR: The authors apply these concepts to a seminal paper in bioinformatics, namely The Molecular Classification of Cancer, Golub et al (1999), and demonstrate that such a reproduction is possible and instead concentrate on demonstrating the usefulness of the compendium concept itself.
Abstract: While scientific research and the methodologies involved have gone through substantial technological evolution the technology involved in the publication of the results of these endeavors has remained relatively stagnant. Publication is largely done in the same manner today as it was fifty years ago. Many journals have adopted electronic formats, however, their orientation and style is little different from a printed document. The documents tend to be static and take little advantage of computational resources that might be available. Recent work, Gentleman and Temple Lang (2003), suggests a methodology and basic infrastructure that can be used to publish documents in a substantially different way. Their approach is suitable for the publication of papers whose message relies on computation. Stated quite simply, Gentleman and Temple Lang (2003) propose a paradigm where documents are mixtures of code and text. Such documents may be self-contained or they may be a component of a compendium which provides the infrastructure needed to provide access to data and supporting software. These documents, or compendiums, can be processed in a number of different ways. One transformation will be to replace the code with its output -- thereby providing the familiar, but limited, static document. In this paper we apply these concepts to a seminal paper in bioinformatics, namely The Molecular Classification of Cancer, Golub et al (1999). The authors of that paper have generously provided data and other information that have allowed us to largely reproduce their results. Rather than reproduce this paper exactly we demonstrate that such a reproduction is possible and instead concentrate on demonstrating the usefulness of the compendium concept itself.

142 citations


Journal ArticleDOI
TL;DR: Genomic signatures are associated with phenotypically and molecularly well defined subgroups of adult ALL, which identifies genes associated with poor outcome in cases without molecular aberrations and specific genes that may be new therapeutic targets in adult ALL.
Abstract: Purpose: To characterize gene expression signatures in acute lymphocytic leukemia (ALL) cells associated with known genotypic abnormalities in adult patients. Experimental Design: Gene expression profiles from 128 adult patients with newly diagnosed ALL were characterized using high-density oligonucleotide microarrays. All patients were enrolled in the Italian GIMEMA multicenter clinical trial 0496 and samples had >90% leukemic cells. Uniform phenotypic, cytogenetic, and molecular data were also available for all cases. Results: T-lineage ALL was characterized by a homogeneous gene expression pattern, whereas several subgroups of B-lineage ALL were evident. Within B-lineage ALL, distinct signatures were associated with ALL1/AF4 and E2A/PBX1 gene rearrangements. Expression profiles associated with ALL1/AF4 and E2A/PBX1 are similar in adults and children. BCR/ABL + gene expression pattern was more heterogeneous and was most similar to ALL without known molecular rearrangements. We also identified a set of 83 genes that were highly expressed in leukemia blasts from patients without known molecular abnormalities who subsequently relapsed following therapy. Supervised analysis of kinase genes revealed a high-level FLT3 expression in a subset of cases without molecular rearrangements. Two other kinases (PRKCB1 and DDR1) were highly expressed in cases without molecular rearrangements, as well as in BCR/ABL-positive ALL. Conclusions: Genomic signatures are associated with phenotypically and molecularly well defined subgroups of adult ALL. Genomic profiling also identifies genes associated with poor outcome in cases without molecular aberrations and specific genes that may be new therapeutic targets in adult ALL.

140 citations


Journal ArticleDOI
TL;DR: This work extends partial least squares (PLS), a popular dimension reduction tool in chemometrics, in the context of generalized linear regression, based on a previous approach, iteratively reweightedpartial least squares, that is, IRWPLS, and shows that by phrasing the problem in a generalized linear model setting and by applying Firth's procedure to avoid (quasi)separation, one gets lower classification error rates.
Abstract: Advances in computational biology have made simultaneous monitoring of thousands of features possible. The high throughput technologies not only bring about a much richer information context in which to study various aspects of gene function, but they also present the challenge of analyzing data with a large number of covariates and few samples. As an integral part of machine learning, classification of samples into two or more categories is almost always of interest to scientists. We address the question of classification in this setting by extending partial least squares (PLS), a popular dimension reduction tool in chemometrics, in the context of generalized linear regression, based on a previous approach, iteratively reweighted partial least squares, that is, IRWPLS. We compare our results with two-stage PLS and with other classifiers. We show that by phrasing the problem in a generalized linear model setting and by applying Firth's procedure to avoid (quasi)separation, we often get lower classificatio...

104 citations


Journal ArticleDOI
TL;DR: The local modeling methodology proposed by Scholtens and Gentleman (2004) is applied to two publicly available datasets and it is formally shown that accurate local interactome models require both Y2H and AP-MS data, even in idealized situations.
Abstract: Motivation: Systems biology requires accurate models of protein complexes, including physical interactions that assemble and regulate these molecular machines. Yeast two-hybrid (Y2H) and affinity--purification/mass-spectrometry (AP--MS) technologies measure different protein--protein relationships, and issues of completeness, sensitivity and specificity fuel debate over which is best for high-throughput 'interactome' data collection. Static graphs currently used to model Y2H and AP--MS data neglect dynamic and spatial aspects of macromolecular complexes and pleiotropic protein function. Results: We apply the local modeling methodology proposed by Scholtens and Gentleman (2004) to two publicly available datasets and demonstrate its uses, interpretation and limitations. Specifically, we use this technology to address four major issues pertaining to protein--protein networks. (1) We motivate the need to move from static global interactome graphs to local protein complex models. (2) We formally show that accurate local interactome models require both Y2H and AP--MS data, even in idealized situations. (3) We briefly discuss experimental design issues and how bait selection affects interpretability of results. (4) We point to the implications of local modeling for systems biology including functional annotation, new complex prediction, pathway interactivity and coordination with gene-expression data. Availability: The local modeling algorithm and all protein complex estimates reported here can be found in the R package apComplex, available at http://www.bioconductor.org Contact: dscholtens@northwestern.edu Supplementary information: http://daisy.prevmed.northwestern.edu/~denise/pubs/LocalModeling

84 citations


Journal ArticleDOI
TL;DR: Interfaces to open source resources for visualization and network algorithms have been developed to support analysis of graphical structures in genomics and computational biology.
Abstract: Summary: In this paper, we review the central concepts and implementations of tools for working with network structures in Bioconductor. Interfaces to open source resources for visualization (AT&T Graphviz) and network algorithms (Boost) have been developed to support analysis of graphical structures in genomics and computational biology. Availability: Packages graph, Rgraphviz, RBGL of Bioconductor (www.bioconductor.org). Contact: stvjc@channing.harvard.edu

81 citations


Book ChapterDOI
15 Nov 2005
TL;DR: Different approaches to the identification of changes in gene expression that are associated with particular biological conditions are discussed and how they can be applied using software from the Bioconductor Project is illustrated.
Abstract: A basic, yet challenging task in the analysis of microarray gene expression data is the identification of changes in gene expression that are associated with particular biological conditions. We discuss different approaches to this task and illustrate how they can be applied using software from the Bioconductor Project. A central problem is the high dimensionality of gene expression space, which prohibits a comprehensive statistical analysis without focusing on particular aspects of the joint distribution of the genes' expression levels. Possible strategies are to do univariate gene-by-gene analysis, and to perform data-driven nonspecific filtering of genes before the actual statistical analysis. However, more focused strategies that make use of biologically relevant knowledge are more likely to increase our understanding of the data. Keywords: differential gene expression; microarrays; multiple testing; statistical software; biological metadata

51 citations


Book ChapterDOI
01 Jan 2005
TL;DR: Both supervised and unsupervised machine learning techniques require selection of a measure of distance between, or similarity among, the objects to be classified or clustered.
Abstract: Both supervised and unsupervised machine learning techniques require selection of a measure of distance between, or similarity among, the objects to be classified or clustered. Different measures of distance or similarity will lead to different machine learning performance. The appropriateness of a distance measure will typically depend on the types of features being used in the learning process.

33 citations


Journal ArticleDOI
TL;DR: A Bayesian error‐in‐variable model is described for the analysis of microarray data from a clinical study of patients with acute lymphoblastic leukemia, focusing in particular on the problem of identifying genes whose expression patterns are associated with duration of remission.
Abstract: DNA microarrays in conjunction with statistical models may help gain a deeper understanding of the molecular basis for specific diseases. An intense area of research is concerned with the identification of genes related to particular phenotypes. The technology, however, is subject to various sources of error that may lead to expression readings that are substantially different from the true transcript levels. Few methods for microarray data analysis have accounted for measurement error in a substantial way and that is the purpose of this investigation. We describe a Bayesian error-in-variable model for the analysis of microarray data from a clinical study of patients with acute lymphoblastic leukemia. We focus in particular on the problem of identifying genes whose expression patterns are associated with duration of remission. This is a question of great practical interest since relapse is a major concern in the treatment of this disease. We explore the effects of ignoring the uncertainty in the expression estimates on the selection and ranking of genes.

22 citations


Journal ArticleDOI
TL;DR: This article proposes controlling the positive FDR using a Bayesian approach where the rejection rule is based on the posterior probabilities of the null hypotheses, and illustrates the procedure with an application to wavelet thresholding.
Abstract: La procedure du «False Discovery Rate>> (FDR) est aujourd'hui une methode largement repandue pour gerer les questions de multiplicite liees a des donnees multidimensionnelles. La definition du FDR trouveune interpretation naturelle dans le cadre bayesien, a savoir la proportion attendue d'hypotheses nulles rejetees a tort, a seuil de significativite donne. Dans cet article, nous proposons de controler le FDR positif par une approche bayesienne dont la regle de rejet est basee sur les probabilites posterieures des hypotheses nulles. Le lien entre les degres de significativite bayesiens et frequentistes a ete etudie dans plusieurs contextes. Ici, nous elargissons la comparaison aux tests multiples avec controle du FDR, et illustrons la procedure par une application au seuillage en ondelettes. Le probleme consiste a recuperer un signal a partir de mesures bruitees. Ceci implique d'extraire des coefficients d'ondelettes resultant du vrai signal, et peut etre formule comme un probleme de test d'hypotheses multiples. Nous utilisons des exemples simules pour comparer les performances de notre approche a la procedure de Benjamini et Hochberg (1995). Nous illustrons egalement cette methode au moyen de donnees issues de l'analyse spectrale par resonance magnetique nucleaire du cerveau humain.

Journal ArticleDOI
TL;DR: A graph-theoretic/statistical algorithm is discussed for local dynamic modeling of protein complexes using data from affinity purification-mass spectrometry experiments that readily accommodates multicomplex membership by individual proteins and dynamic complex composition, two biological realities not accounted for in existing topological descriptions of the overall protein network.
Abstract: Accurate systems biology modeling requires a complete catalog of protein complexes and their constituent proteins. We discuss a graph-theoretic/statistical algorithm for local dynamic modeling of protein complexes using data from affinity purification-mass spectrometry experiments. The algorithm readily accommodates multicomplex membership by individual proteins and dynamic complex composition, two biological realities not accounted for in existing topological descriptions of the overall protein network. A likelihood-based objective function guides the protein complex modeling algorithm. With an accurate complex membership catalog in place, systems biology can proceed with greater precision.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a reduction technique and versions of the EM algorithm and the vertex ex change method to perform constrained nonparametric maximum likelihood estimation of the cumulative distribution function given interval censored data.
Abstract: The authors propose a reduction technique and versions of the EM algorithm and the vertex ex change method to perform constrained nonparametric maximum likelihood estimation of the cumulative distribution function given interval censored data. The constrained vertex exchange method can be used in practice to produce likelihood intervals for the cumulative distribution function. In particular, the authors show how to produce a confidence interval with known asymptotic coverage for the survival function given current status data.

Journal ArticleDOI
TL;DR: The book’s impressive breadth and depth make it an essential reference for any researcher interested in understanding the state-of-the-art methods and potential applications in latent multilevel, longitudinal, and structural equation modeling.
Abstract: approaches with spatial dependence using empirical Bayes methods have been applied in Chapter 11 to examine the geographical distribution of diseases. This chapter analyses disease mapping and small-area estimation using count models with a spatial dependence structure for non-Gaussian random effects that are correlated with surrounding random effects. The Bayesian fitting of the model relies on sampling-based approximations to the distribution of interest via Markov chain Monte Carlo (MCMC) methods. The interested reader may find it useful to compare the results of GLLAMM with those in MLwiN, which also uses MCMC methods (Browne 2003) to estimate a conditional autoregressive distribution of the random effects (see Besag, York, and Molliè 1991). Overall, I find the book to be an exceedingly valuable reference that would be ideal for graduate-level courses on generalized latent variable modeling. It is very straightforward to build from it a comprehensive course where the statistical section is complemented with a multidisciplinary set of easily replicated examples, because both the datasets and the software are available online. In addition, the book’s impressive breadth and depth make it an essential reference for any researcher interested in understanding the state-of-the-art methods and potential applications in latent multilevel, longitudinal, and structural equation modeling.

Book ChapterDOI
01 Jan 2005
TL;DR: This chapter considers four specific data-analytic and inferential problems that can be addressed using graphs and shows how one can investigate relationships between gene expression and protein-protein interaction data, how GO annotations can be used to analyze gene sets, and how literature citations can be related to experimental data.
Abstract: In this chapter we consider four specific data-analytic and inferential problems that can be addressed using graphs. We demonstrate the use of the software and methods described in Chapters 20 and 21 on real problems in computational biology.We will show how one can investigate relationships between gene expression and protein-protein interaction data, how GO annotations can be used to analyze gene sets, how literature citations can be related to experimental data, and how gene expression data can be mapped on pathways.

01 Jan 2005
TL;DR: The synthesis of different microarray data sets using a random effects paradigm is considered and it is demonstrated how relatively standard statistical approaches yield good results.
Abstract: With many different investigators studying the same disease and with a strong commitment to publish supporting data in the scientific community,there are often many different datasets available for any given disease. Hence there is substantial interest in finding methods for combining these datasets to provide better and more detailed understanding of the underlying biology. We consider the synthesis of different microarray data sets using a random effects paradigm and demonstrate how relatively standard statistical approaches yield good results. We identify a number of important and substantive areas which require further investigation.

Book ChapterDOI
01 Jan 2005
TL;DR: This chapter describes software tools for creating, manipulating, and visualizing graphs in the Bioconductor project and gives the rationale for the design decisions and brief outlines of how to make use of these tools.
Abstract: We describe software tools for creating, manipulating, and visualizing graphs in the Bioconductor project. We give the rationale for our design decisions and provide brief outlines of how to make use of these tools. The discussion mirrors that of Chapter 20 where the different mathematical constructs were described. It is worth differentiating between packages that are mainly infrastructure (sets of tools that can be used to create other pieces of software) and packages that are designed to provide an end-user application. The packages graph, RBGL, and Rgraphviz are infrastructure packages. Software developers may use these packages to construct tools aimed at specific applications areas, such as the GOstats package.

Journal Article
TL;DR: In this paper, the authors argue for an increased emphasis on computing in the training of statisticians and in their professional practice, and they describe some of the current technological challenges and demonstrate the importance for statisticians of becoming more active in computational aspects of their work and specifically in producing software for carrying out statistical procedures.
Abstract: The author argues for an increased emphasis on computing in the training of statisticians and in their professional practice. He describes some of the current technological challenges and demonstrates the importance for statisticians of becoming more active in computational aspects of their work and specifically in producing software for carrying out statistical procedures. Such a reorientation will require substantial changes in thinking, pedagogy and infrastructure; the author mentions some of the conditions required to achieve these goals. Quelques points de vue sur le calcul statistique L'auteur plaide en faveur d'une part accrue pour le calcul iinformatique dans la formation des statisticiens et dans leur exercice de la profession. II evoque quelques-uns des defis technologiques actuels et montre l'importance pour les statisticiens de s'engager plus activement dans les aspects numeriques de leur travail et notamment dans l'elaboration de logiciels statistiques. Une telle reorientation necessitera des changements profonds aux plans conceptuels, pedagogiques et des infrastructures; l'auteur enumere certaines des conditions requises pour atteindre ces objectifs.

Book ChapterDOI
15 Apr 2005
TL;DR: The requirements, language features, and methodology of design and development guiding the evolution of this project are described, which are expected to foster the propagation of standards of transparency and explicit reproducibility from wet-lab science, to in silico biology, where explicit reproduction of important published results is often very difficult.
Abstract: Bioconductor is an open source initiative for the creation and dissemination of methods in statistical genomics and computational biology based on R. This article describes the requirements, language features, and methodology of design and development guiding the evolution of this project. Commitments to software interoperability, computable task-oriented documentation, and full transparency of algorithm development and use are found to be valuable in reducing barriers to access faced by statistical, computational, or biological researchers attempting interdisciplinary work. These commitments are expected to foster the propagation of standards of transparency and explicit reproducibility from wet-lab science, where they are well accepted, to in silico biology, where explicit reproduction of important published results is often very difficult. Keywords: computational biology; open source software; object-oriented programming; documentation; network algorithms; software quality assurance; reproducible research

01 Jan 2005
TL;DR: The authors apply these concepts to a seminal paper in bioinformatics, namely The Molecular Classification of Cancer, Golub et al (1999), and demonstrate that such a reproduction is possible and instead concentrate on demonstrating the usefulness of the compendium concept itself.
Abstract: While scientific research and the methodologies involved have gone through substantial technological evolution the technology involved in the publication of the results of these endeavors has remained relatively stagnant. Publication is largely done in the same manner today as it was fifty years ago. Many journals have adopted electronic formats, however, their orientation and style is little different from a printed document. The documents tend to be static and take little advantage of computational resources that might be available. Recent work, Gentleman and Temple Lang (2003), suggests a methodology and basic infrastructure that can be used to publish documents in a substantially different way. Their approach is suitable for the publication of papers whose message relies on computation. Stated quite simply, Gentleman and Temple Lang (2003) propose a paradigm where documents are mixtures of code and text. Such documents may be self-contained or they may be a component of a compendium which provides the infrastructure needed to provide access to data and supporting software. These documents, or compendiums, can be processed in a number of different ways. One transformation will be to replace the code with its output – thereby providing the familiar, but limited, static document. In this paper we apply these concepts to a seminal paper in bioinformatics, namely The Molecular Classification of Cancer, Golub et al (1999). The authors of that paper have generously provided data and other information that have allowed us to largely reproduce their results. Rather than reproduce this paper exactly we demonstrate that such a reproduction is possible and instead concentrate on demonstrating the usefulness of the compendium concept itself.

Book ChapterDOI
01 Jan 2005
TL;DR: This section considers some of the different sources of biological information as well as the software tools that can be used to access these data and to integrate them into an analysis.
Abstract: Closing the gap between knowledge of sequence and knowledge of function requires aggressive, integrative use of biological research databases of many different types. For greatest effectiveness, analysis processes and interpretation of analytic results must be guided using relevant knowledge about the systems under investigation. However, this knowledge is often widely scattered and encoded in a variety of formats. In this section, we consider some of the different sources of biological information as well as the software tools that can be used to access these data and to integrate them into an analysis. Bioconductor provides tools for creating, distributing, and accessing annotation resources in ways that have been found effective in workflows for statistical analysis of microarray and other high-throughput assays.

Journal ArticleDOI
TL;DR: The book’s impressive breadth and depth make it an essential reference for any researcher interested in understanding the state-of-the-art methods and potential applications in latent multilevel, longitudinal, and structural equation modeling.
Abstract: approaches with spatial dependence using empirical Bayes methods have been applied in Chapter 11 to examine the geographical distribution of diseases. This chapter analyses disease mapping and small-area estimation using count models with a spatial dependence structure for non-Gaussian random effects that are correlated with surrounding random effects. The Bayesian fitting of the model relies on sampling-based approximations to the distribution of interest via Markov chain Monte Carlo (MCMC) methods. The interested reader may find it useful to compare the results of GLLAMM with those in MLwiN, which also uses MCMC methods (Browne 2003) to estimate a conditional autoregressive distribution of the random effects (see Besag, York, and Molliè 1991). Overall, I find the book to be an exceedingly valuable reference that would be ideal for graduate-level courses on generalized latent variable modeling. It is very straightforward to build from it a comprehensive course where the statistical section is complemented with a multidisciplinary set of easily replicated examples, because both the datasets and the software are available online. In addition, the book’s impressive breadth and depth make it an essential reference for any researcher interested in understanding the state-of-the-art methods and potential applications in latent multilevel, longitudinal, and structural equation modeling.