scispace - formally typeset
Search or ask a question

Showing papers by "Robert Gentleman published in 2004"


Journal ArticleDOI
TL;DR: Details of the aims and methods of Bioconductor, the collaborative creation of extensible software for computational biology and bioinformatics, and current challenges are described.
Abstract: The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.

12,142 citations


Journal ArticleDOI
TL;DR: The default ad hoc adjustment, provided as part of the Affymetrix system, can be improved through the use of estimators derived from a statistical model that uses probe sequence information, which greatly improves the performance of the technology in various practical applications.
Abstract: High-density oligonucleotide expression arrays are widely used in many areas of biomedical research. Affymetrix GeneChip arrays are the most popular. In the Affymetrix system, a fair amount of further preprocessing and data reduction occurs after the image-processing step. Statistical procedures developed by academic groups have been successful in improving the default algorithms provided by the Affymetrix system. In this article we present a solution to one of the preprocessing steps—background adjustment—based on a formal statistical framework. Our solution greatly improves the performance of the technology in various practical applications. These arrays use short oligonucleotides to probe for genes in an RNA sample. Typically, each gene is represented by 11–20 pairs of oligonucleotide probes. The first component of these pairs is referred to as a perfect match probe and is designed to hybridize only with transcripts from the intended gene (i. e., specific hybridization). However, hybridization by other...

1,925 citations


Journal ArticleDOI
01 Apr 2004-Blood
TL;DR: It is demonstrated that gene expression profiling can identify a limited number of genes that are predictive of response to induction therapy and remission duration in adult patients with T-ALL.

367 citations


Posted Content
TL;DR: In this article, a statistical framework for background adjustment of Affymetrix GeneChip arrays is presented, which is based on simple hybridization theory from molecular biology and experiments specifically designed to help develop it.
Abstract: High density oligonucleotide expression arrays are widely used in many areas of biomedical research. Affymetrix GeneChip arrays are the most popular. In the Affymetrix system, a fair amount of further pre-processing and data reduction occurs following the image processing step. Statistical procedures developed by academic groups have been successful at improving the default algorithms provided by the Affymetrix system. In this paper we present a solution to one of the pre-processing steps, background adjustment, based on a formal statistical framework. Our solution greatly improves the performance of the technology in various practical applications.Affymetrix GeneChip arrays use short oligonucleotides to probe for genes in an RNA sample. Typically each gene will be represented by 11-20 pairs of oligonucleotide probes. The first component of these pairs is referred to as a perfect match probe and is designed to hybridize only with transcripts from the intended gene (specific hybridization). However, hybridization by other sequences (non-specific hybridization) is unavoidable. Furthermore, hybridization strengths are measured by a scanner that introduces optical noise. Therefore, the observed intensities need to be adjusted to give accurate measurements of specific hybridization. One approach to adjusting is to pair each perfect match probe with a mismatch probe that is designed with the intention of measuring non-specific hybridization. The default adjustment, provided as part of the Affymetrix system, is based on the difference between perfect match and mismatch probe intensities. We have found that this approach can be improved via the use of estimators derived from a statistical model that use probe sequence information. The model is based on simple hybridization theory from molecular biology and experiments specifically designed to help develop it.A final step in the pre-processing of these arrays is to combine the 11-20 probe pair intensities,after background adjustment and normalization, for a given gene to define a measure of expression that represents the amount of the corresponding mRNA species. In this paper we illustrate the practical consequences of not adjusting appropriately for the presence of nonspecific hybridization and provide a solution based on our background adjustment procedure. Software that computes our adjustment is available as part of the Bioconductor project (http://www.bioconductor.

159 citations


Book ChapterDOI
01 Jan 2004
TL;DR: Meta-data packages from the Bioconductor Project are used to carry out statistical analyses of gene expression data and many of the methods described here could be applied to other types of high-throughput data.
Abstract: In this paper we use meta-data packages from the Bioconductor Project to carry out statistical analyses of gene expression data. But would like to note that the potential scope of these applications is much broader and many of the methods described here could be applied to other types of high-throughput data. To provide context we make use of data from an investigation into acute lymphoblastic leukemia.

49 citations


Journal ArticleDOI
TL;DR: A graph-theoretic approach is presented to test the significance of the association between multiple disparate sources of functional genomics data by proposing two statistical tests, namely edge permutation and node label permutation tests.
Abstract: Motivation: The last few years have seen the advent of high-throughput technologies to analyze various properties of the transcriptome and proteome of several organisms. The congruency of these different data sources, or lack thereof, can shed light on the mechanisms that govern cellular function. A central challenge for bioinformatics research is to develop a unified framework for combining the multiple sources of functional genomics information and testing associations between them, thus obtaining a robust and integrated view of the underlying biology. Results: We present a graph-theoretic approach to test the significance of the association between multiple disparate sources of functional genomics data by proposing two statistical tests, namely edge permutation and node label permutation tests. We demonstrate the use of the proposed tests by finding significant association between a Gene Ontology-derived predictome and data obtained from mRNA expression and phenotypic experiments for Saccharomyces cerevisiae. Moreover, we employ the graph-theoretic framework to recast a surprising discrepancy presented elsewhere between gene expression and knockout phenotype, using expression data from a different set of experiments. Availability: An R software package, GraphAT, containing the data and statistical procedures is available from Bioconductor: http://www.bioconductor.org

46 citations


Journal ArticleDOI
TL;DR: Software and a paradigm for the creation of data packages for curating, distributing and working with probe sequence data in a uniform, across-types-of-microarrays manner are described.
Abstract: Summary: The nucleotide sequences of the probes on a microarray can be used for a variety of purposes in the analysis of microarray experiments. We describe software and a paradigm for the creation of data packages for curating, distributing and working with probe sequence data in a uniform, across-types-of-microarrays manner. While the implementation is specific to the Bioconductor project, the ideas and general strategies are more general and could be easily adopted by other projects. Availability: The R package matchprobes is available under LGPL at http://www.bioconductor.org Supplementary information: The package contains documentation in the form of a vignette and manual pages.

40 citations


Journal ArticleDOI
TL;DR: The book is an edited collection of contributions by a wide range of researchers from different backgrounds and provides a good overview of the current state of the art in computational molecular biology, with little reference to topics that one might classify as bioinformatics.
Abstract: The book is an edited collection of contributions by a wide range of researchers from different backgrounds. In all, there are 19 chapters organized in four sections: Introduction, Comparative Sequence and Genome Analysis, Data Mining and Pattern Discovery, and Computational Structural Biology. I believe that any statistician working in the area, or interested in working in the area, will Ž nd this a useful resource and source of ideas, with references to the relevant literature, software, and web-based resources. The last three sections cover three of the major areas of research in computational molecular biology. Within these sections, the topics covered are quite varied and provide the reader with a good overview of the current state of the art. Each chapter is a self-contained, comprehensive overview of the subject. The chapter authors have good credentials, and the materials are largely accessible to statisticians. The authors tend to focus on the mathematical and statistical problems that need to be addressed, rather than on the biological issues. In the Preface, the editors suggest that the book is about bioinformatics; my own view is that it is not. Rather, the book is largely about computational molecular biology. There is little reference to topics that one might classify as bioinformatics (Chaps. 11 and 12 being notable exceptions). Readers interested in this Ž eld will not Ž nd much that is helpful here. In fact, the introduction by T. F. Smith is in some ways at odds with the remainder of the book, as it does focus on bioinformatics to some extent and is forward-looking in a way that few of the other contributions are. The emphasis throughout is on the mathematical and statistical models used. Although some discussion of the biology is often given in the introduction of each chapter, the exposition is mainly mathematical and often quite detailed. Most of the chapters are complete and self-contained. There is often overlap in broad concepts between chapters, but where there is overlap, different chapters tend to emphasize different aspects. Chapters 4, 5, and 6 all touch on the concept of evolution, and the study of evolution, at the genomic level. However, they are focused largely on different areas, and, unfortunately, there is no comparison of these methods or guidance on when one might be more appropriate than the others. As in all books that are collections of chapters, some chapters are better written than others, and some provide very good motivation for the problems, whereas others do not. Overall, the level is good and the chapters are well written. The one deŽ ciency I found was the index. It is intended to be comprehensive but does not achieve that goal. The reader cannot easily Ž nd related concepts reported and used in different contexts. Some of the reasons why statisticians should be interested in this area of research are noted by Jun Liu in Chapter 2 (p. 11): Formal statistical modeling together with advanced statistical algorithms provide us a powerful “workbench” for developing innovative computational strategies for making proper inferences to account for estimation uncertainties.

29 citations


Journal ArticleDOI
TL;DR: An analytic approach for framing biological questions in terms of statistical parameters to efficiently and confidently answer questions of interest using microarray data from factorial designed experiments is discussed.

25 citations


Journal ArticleDOI
TL;DR: It is shown that the RECORD libraries may be used for digital karyotyping and for pathogen identification by computational subtraction in genomic representation using Type IIB restriction endonucleases.
Abstract: We have developed a method for genomic representation using Type IIB restriction endonucleases. Representation by concatenation of restriction digests, or RECORD, is an approach to sample the fragments generated by cleavage with these enzymes. Here, we show that the RECORD libraries may be used for digital karyotyping and for pathogen identification by computational subtraction.

24 citations



Journal ArticleDOI
TL;DR: In this paper, the authors argue for an increased emphasis on computing in the training of statisticians and in their professional practice, and they describe some of the current technological challenges and demonstrate the importance for statisticians of becoming more active in computational aspects of their work and specifically in producing software for carrying out statistical procedures.
Abstract: The author argues for an increased emphasis on computing in the training of statisticians and in their professional practice. He describes some of the current technological challenges and demonstrates the importance for statisticians of becoming more active in computational aspects of their work and specifically in producing software for carrying out statistical procedures. Such a reorientation will require substantial changes in thinking, pedagogy and infrastructure; the author mentions some of the conditions required to achieve these goals. Quelques points de vue sur le calcul statistique L'auteur plaide en faveur d'une part accrue pour le calcul iinformatique dans la formation des statisticiens et dans leur exercice de la profession. II evoque quelques-uns des defis technologiques actuels et montre l'importance pour les statisticiens de s'engager plus activement dans les aspects numeriques de leur travail et notamment dans l'elaboration de logiciels statistiques. Une telle reorientation necessitera des changements profonds aux plans conceptuels, pedagogiques et des infrastructures; l'auteur enumere certaines des conditions requises pour atteindre ces objectifs.