scispace - formally typeset
Search or ask a question

Showing papers by "Robert Gentleman published in 2007"


Journal ArticleDOI
TL;DR: The capabilities of GOstats, a Bioconductor package written in R, that allows users to test GO terms for over or under-representation using either a classical hypergeometric test or a conditionalhypergeometric that uses the relationships among GO terms to decorrelate the results are discussed.
Abstract: Motivation: Functional analyses based on the association of Gene Ontology (GO) terms to genes in a selected gene list are useful bioinformatic tools and the GOstats package has been widely used to perform such computations. In this paper we report significant improvements and extensions such as support for conditional testing. Results: We discuss the capabilities of GOstats, a Bioconductor package written in R, that allows users to test GO terms for over or under-representation using either a classical hypergeometric test or a conditional hypergeometric that uses the relationships among GO terms to decorrelate the results. Availability: GOstats is available as an R package from the Bioconductor project: http://bioconductor.org Contact: [email protected]

1,890 citations


Journal ArticleDOI
TL;DR: This article describes a software framework for both authoring and distributing integrated, dynamic documents that contain text, code, data, and any auxiliary content needed to recreate the computations in data analyses, methodological descriptions, simulations, and so on.
Abstract: It is important, if not essential, to integrate the computations and code used in data analyses, methodological descriptions, simulations, and so on with the documents that describe and rely on them. This integration allows readers to both verify and adapt the claims in the documents. Authors can easily reproduce the results in the future, and they can present the document's contents in a different medium, for example, with interactive controls. This article describes a software framework for both authoring and distributing these integrated, dynamic documents that contain text, code, data, and any auxiliary content needed to recreate the computations. The documents are dynamic in that the contents—including figures, tables, and so on—can be recalculated each time a view of the document is generated. Our model treats a dynamic document as a master or “source” document from which one can generate different views in the form of traditional, derived documents for different audiences.We introduce the concept o...

272 citations


Journal ArticleDOI
TL;DR: A well-defined procedure to address interpretation issues that can raise when gene sets have substantial overlap is provided and it is shown how standard dimension reduction methods, such as PCA, can be used to help further interpret GSEA.
Abstract: Motivation: Gene Set Enrichment Analysis (GSEA) has been developed recently to capture changes in the expression of pre-defined sets of genes. We propose number of extensions to GSEA, including the use of different statistics to describe the association between genes and phenotypes of interest. We make use of dimension reduction procedures, such as principle component analysis, to identify gene sets with correlated expression. We also address issues that arise when gene sets overlap. Results: Our proposals extend the range of applicability of GSEA and allow for adjustments based on other covariates. We have provided a well-defined procedure to address interpretation issues that can raise when gene sets have substantial overlap. We have shown how standard dimension reduction methods, such as PCA, can be used to help further interpret GSEA. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

230 citations


Journal ArticleDOI
TL;DR: Graph theoretical concepts are given a brief introduction into some of the concepts and their areas of application in molecular biology and a simple application to the integration of a protein-protein interaction and a co-expression network is presented.
Abstract: Graph theoretical concepts are useful for the description and analysis of interactions and relationships in biological systems. We give a brief introduction into some of the concepts and their areas of application in molecular biology. We discuss software that is available through the Bioconductor project and present a simple example application to the integration of a protein-protein interaction and a co-expression network.

136 citations


Journal ArticleDOI
TL;DR: The recent development of semiautomated techniques for staining and analyzing flow cytometry samples has presented new challenges and the experience suggests that significant bottlenecks remain in the development of high throughput flow cytometer methods for data analysis and display.
Abstract: Traditionally, flow cytometry (FCM) has been a tube-based technique limited to small-scale laboratory and clinical studies. High throughput methods for FCM have recently been developed for drug discovery and advanced research methods (1-4). As an example, flow cytometry high content screening (FC-HCS) can process up to a thousand samples daily at a single workstation, and the results have been equivalent or superior to traditional manual multiparameter staining and analysis techniques. The amount of information generated by high throughput technologies, such as FC-HCS, need to be transformed into executive summaries, which are brief enough for creative studies by a human researcher (5). Quality control and quality assessment are crucial steps in the development, and use of new high throughput technologies and their associated information services (5-7). Quality control in clinical cell analysis by FCM has been considered (8,9). As an example, Edwards et al. (9) proposed some quality scores for monitoring the quality of immunophenotyping process (e.g., blood acquisition, cell preparation, lymphocyte staining). They showed that a low degree of temporal parameter variation exits within individual whereas significant variations can exist between donors with respect to the parameter monitored. However little has been done with high throughput FCM. For example, quality control of FCM experiments should include the assessment of instrument parameters that affect the accuracy and precision of data. In that respect, Gratama et al. (10) have proposed some guidelines such as monitoring the fluorescence measurements by computing calibration plots for each fluorescent parameter. However, such procedures are not yet systematically applied, and data quality assessment is often needed to overcome a lack of data quality control. The aim of data quality assessment is to detect whether any measurements of any samples are substantially different from the others, in ways that were not likely to be biologically motivated. The rationale is that such samples should be identified, investigated, and potentially removed from any downstream analyses. Quality control, on the other hand, measures such quantities during the assaying procedure and can alert the user to problems at a time where they can be corrected. Data quality assessment in high throughput FCM experiment is complicated by the volume of data involved and by the many processing steps required to produce those data. Each instrument manufacturer has created software to drive the data acquisition process of the cytometer (e.g., CellQuest Pro by BD Biosciences, San Jose, CA; Summit by DakoCytomation, Fort Collins, CO; or Expo32 by Beckman Coulter, Fullerton, CA). These tools are primarily designed for their proprietary instrument interface and offer few, or no, data quality assessment functions. Third party analysis and management tools, such as FlowJo (Tree Star, Ashland, OR), WinList (Verity Software House, Topsham, ME) or FCSExpress (Denovo Software, Thornhill, Canada) provide researchers with more capable “offline” analysis tools but remain limited in term of data quality assessment. We propose a number of one- and two-dimensional graphical methods for exploring the data in the hope that they would be of some use to the investigators. The basis of our approach is that, given a cell line, or a single sample, divided in several aliquots, the distribution of the same physical or chemical characteristics (e.g., side light scatter - SSC-or forward light scatter -FSC-) should be similar between aliquots. To test this hypothesis, we made use of graphical exploratory data analysis (EDA). Five distinct visualization methods were implemented to explore the distributions and densities of ungated FCM data: Empirical Cumulative Distribution Function (ECDF) plots, histograms, boxplots, and two types of bivariate plots. These different graphical methods should provide investigators with different views of the data. ECDF plots have been widely used in the analysis of microarray data where they help to detect defective print tips, or plates of reagents that have not been well handled (11). These plots can quickly reveal differences in the distributions, but are not particularly useful for understanding the shape of a distribution. Histograms help to visualize the shape of the distribution and can reveal structure, such as the mode. Boxplots summarize the location of the distribution and can reveal asymmetry but are mainly applicable to unimodal distributions. Boxplots are also commonly used in the processing of microarray data where they help to identify hybridization artifacts and assess the need for between-array normalization to deal with scale differences among different arrays (11). Finally, we use bivariate plots representation in two different ways. In fact in some cases, when comparing two samples, we found two-dimensional displays more informative, i.e., two-dimensional summaries can show differences in samples, while the one-dimensional summaries, mentioned earlier, are similar. One common use of bivariate plots in FCM experiments is to display the joint distribution of two continuous variables as dot plots (e.g., FSC versus SSC). However the analysis of such dot plots might be a challenge as the high density of plotted data points (an average of 10,000 data points per sample) might form a blot and the frequency of the observations might not be easily appreciated. To overcome this issue we propose to use contour plots where contour lines might be interpreted as the frequency of observations with respect to the x–y plane. The second use of bivariate plots, for high throughput FCM data, is to render per well summary statistics for a particular plate in the format of a scatterplot. In this view each point represents a single well and the x and y values are chosen to be various summary statistics. We illustrate the need and usefulness of those visualization tools to assess FCM data quality through examination of two FC-HCS datasets. Our results demonstrate that the application of these graphical analysis methods to ungated FCM data provides a systematic and efficient method of data quality assessment, preventing time-consuming gating and further analysis of unreliable samples. Although the methods we propose are primarily aimed at the discovery of data quality problems, they may detect differences that are biologically motivated. Hence, we discourage the automatic removal of aberrant samples and emphasize the need to check whether such underlying biological causes are present.

47 citations


Journal ArticleDOI
TL;DR: This work reviews the estimation of coverage and error rate in high-throughput protein-protein interaction datasets and argues that reports of the low quality of such data are to a substantial extent based on misinterpretations.
Abstract: We review the estimation of coverage and error rate in high-throughput protein-protein interaction datasets and argue that reports of the low quality of such data are to a substantial extent based on misinterpretations. Probabilistic statistical models and methods can be used to estimate properties of interest and to make the best use of the available data.

44 citations


Journal ArticleDOI
TL;DR: This work assessed the error statistics in all published large-scale datasets for Saccharomyces cerevisiae and characterized them by three traits: the set of tested interactions, artifacts that lead to false-positive or false-negative observations, and estimates of the stochastic error rates that affect the data.
Abstract: Using a directed graph model for bait to prey systems and a multinomial error model, we assessed the error statistics in all published large-scale datasets for Saccharomyces cerevisiae and characterized them by three traits: the set of tested interactions, artifacts that lead to false-positive or false-negative observations, and estimates of the stochastic error rates that affect the data These traits provide a prerequisite for the estimation of the protein interactome and its modules

41 citations


Journal ArticleDOI
TL;DR: The central concepts and implementation of data structures and methods for studying genetics of gene expression with the GGtools package of Bioconductor are reviewed.
Abstract: Summary: This paper reviews the central concepts and implementation of data structures and methods for studying genetics of gene expression with the GGtools package of Bioconductor. Illustration with a HapMap+expression dataset is provided. Availability: Package GGtools is part of Bioconductor 1.9 (http://bioconductor.org). Open source with Artistic License. Contact: stvjc@channing.harvard.edu

13 citations


Proceedings ArticleDOI
01 Dec 2007
TL;DR: The approach consists of data capture and modeling processes rooted in R/Bioconductor, sample annotation and sequence constituent ontology management based in R, secure data archiving in PostgreSQL, and browser-based workspace creation and management rooted in Zope.
Abstract: This paper describes a framework for collecting, annotating, and archiving high-throughput assays from multiple experiments conducted on one or more series of samples. Specific applications include support for large-scale surveys of related transcriptional profiling studies, for investigations of the genetics of gene expression and for joint analysis of copy number variation and mRNA abundance. Our approach consists of data capture and modeling processes rooted in R/Bioconductor, sample annotation and sequence constituent ontology management based in R, secure data archiving in PostgreSQL, and browser-based workspace creation and management rooted in Zope. This effort has generated a completely transparent, extensible, and customizable interface to large archives of high-throughput assays. Sources and prototype interfaces are accessible at www.sgdi.org/software.

4 citations


Posted Content
TL;DR: The Bioconductor project as discussed by the authors is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics, which aims to foster collaborative development and widespread use of innovative software, reduce barriers to entry into interdisciplinary scientific research, and promote the achievement of remote reproducibility of research results.
Abstract: The Bioconductor project is an initiative for the collaborative creation of the extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methodes, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.

4 citations


01 Jan 2007
TL;DR: Graphical exploratory data analytic tools are quick and useful means of assessing data quality and should be used as quality assessment tools and where possible, be used for quality control.
Abstract: Background: The recent development of semi-automated techniques for staining and analyzing flow cytometry samples has presented new challenges. Quality control and quality assessment are critical when developing new high throughput technologies and their associated information services. Our experience suggests that significant bottlenecks remain in the development of high throughput flow cytometry methods for data analysis and display. Especially, data quality control and quality assessment are crucial steps in processing and analyzing high throughput flow cytometry data. Methods: We propose a variety of graphical exploratory data analytic tools for exploring ungated flow cytometry data. We have implemented a number of specialized functions and methods in the Bioconductor package rflowcyt. We demonstrate the use of these approaches by investigating two independent sets of high throughput flow cytometry data. Results: We found that graphical representations can reveal substantial non-biological differences in samples. Empirical Cumulative Distribution Function and summary scatterplots were especially useful in the rapid identification of problems not identified by manual review. Conclusions: Graphical exploratory data analytic tools are quick and useful means of assessing data quality. We propose that the described visualizations should be used as quality assessment tools and where possible, be used for quality control. Data Quality Assessment of Ungated Flow Cytometry Data in High Throughput Experiments Nolwenn Le Meur a∗, Anthony Rossini b, Maura Gasparetto c, Clay Smith c, Ryan R. Brinkman c and Robert Gentlemana a Fred Hutchinson Cancer Research Center, Seattle, Washington, USA; b Novartis Pharma AG, Basel, Switzerland; c Terry Fox Laboratory, British Columbia Cancer Agency, Vancouver, BC, Canada ∗Correspondence to Nolwenn Le Meur, Fred Hutchinson Cancer Research Center, Computational Biology, Division of Public Health Science, 1100 Fairview Ave. N., M2-B876, Seattle, Washington 98109-1024 Phone: 206-667-5434 Fax: 206-667-1319 Email: nlemeur@fhcrc.org Funded by: NIH-NIBIB