scispace - formally typeset
Search or ask a question

Showing papers by "Pablo Tamayo published in 2003"


Journal ArticleDOI
TL;DR: An analytical strategy is introduced, Gene Set Enrichment Analysis, designed to detect modest but coordinate changes in the expression of groups of functionally related genes, which identifies a set of genes involved in oxidative phosphorylation whose expression is coordinately decreased in human diabetic muscle.
Abstract: DNA microarrays can be used to identify gene expression changes characteristic of human disease. This is challenging, however, when relevant differences are subtle at the level of individual genes. We introduce an analytical strategy, Gene Set Enrichment Analysis, designed to detect modest but coordinate changes in the expression of groups of functionally related genes. Using this approach, we identify a set of genes involved in oxidative phosphorylation whose expression is coordinately decreased in human diabetic muscle. Expression of these genes is high at sites of insulin-mediated glucose disposal, activated by PGC-1α and correlated with total-body aerobic capacity. Our results associate this gene set with clinically important variation in human metabolism and illustrate the value of pathway relationships in the analysis of genomic profiling experiments.

7,997 citations


Journal ArticleDOI
TL;DR: A new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data is presented, and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters.
Abstract: In this paper we present a new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data. The method can best be thought of as an analysis approach, to guide and assist in the use of any of a wide range of available clustering algorithms. We call the new methodology consensus clustering, and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. Finally, it provides for a visualization tool to inspect cluster number, membership, and boundaries. We present the results of our experiments on both simulated data and real gene expression data aimed at evaluating the effectiveness of the methodology in discovering biologically meaningful clusters.

1,831 citations


Journal Article
TL;DR: It is suggested that class prediction models, based on defined molecular profiles, classify diagnostically challenging malignant gliomas in a manner that better correlates with clinical outcome than does standard pathology.
Abstract: In modern clinical neuro-oncology, histopathological diagnosis affects therapeutic decisions and prognostic estimation more than any other variable. Among high-grade gliomas, histologically classic glioblastomas and anaplastic oligodendrogliomas follow markedly different clinical courses. Unfortunately, many malignant gliomas are diagnostically challenging; these nonclassic lesions are difficult to classify by histological features, generating considerable interobserver variability and limited diagnostic reproducibility. The resulting tentative pathological diagnoses create significant clinical confusion. We investigated whether gene expression profiling, coupled with class prediction methodology, could be used to classify high-grade gliomas in a manner more objective, explicit, and consistent than standard pathology. Microarray analysis was used to determine the expression of ∼12,000 genes in a set of 50 gliomas, 28 glioblastomas and 22 anaplastic oligodendrogliomas. Supervised learning approaches were used to build a two-class prediction model based on a subset of 14 glioblastomas and 7 anaplastic oligodendrogliomas with classic histology. A 20-feature k -nearest neighbor model correctly classified 18 of the 21 classic cases in leave-one-out cross-validation when compared with pathological diagnoses. This model was then used to predict the classification of clinically common, histologically nonclassic samples. When tumors were classified according to pathology, the survival of patients with nonclassic glioblastoma and nonclassic anaplastic oligodendroglioma was not significantly different ( P = 0.19). However, class distinctions according to the model were significantly associated with survival outcome ( P = 0.05). This class prediction model was capable of classifying high-grade, nonclassic glial tumors objectively and reproducibly. Moreover, the model provided a more accurate predictor of prognosis in these nonclassic lesions than did pathological classification. These data suggest that class prediction models, based on defined molecular profiles, classify diagnostically challenging malignant gliomas in a manner that better correlates with clinical outcome than does standard pathology.

926 citations


Journal ArticleDOI
TL;DR: A statistical methodology for estimating dataset size requirements for classifying microarray data using learning curves is introduced, based on fitting inverse power-law models to construct empirical learning curves.
Abstract: A statistical methodology for estimating dataset size requirements for classifying microarray data using learning curves is introduced. The goal is to use existing classification results to estimate dataset size requirements for future classification experiments and to evaluate the gain in accuracy and significance of classifiers built with additional data. The method is based on fitting inverse power-law models to construct empirical learning curves. It also includes a permutation test procedure to assess the statistical significance of classification performance for a given dataset size. This procedure is applied to several molecular classification problems representing a broad spectrum of levels of complexity.

274 citations


Journal ArticleDOI
TL;DR: The different patterns of gene expression following carefully tuned biological programs, according to tissue type, developmental stage, environment and genetic background account for the huge variety of different cells states and types.
Abstract: All organisms on Earth, except for viruses, consist of cells. Yeast, for example, has one cell, while humans have trillions of cells. All cells have a nucleus, and inside nucleus there is DNA, which encodes the “program” for making future organisms. DNA has coding and non-coding segments, and coding segments, called “genes”, specify the structure of proteins, which are large molecules, like hemoglobin, that do the essential work in every organism. Practically all cells in the same organism have the same genes, but these genes can be expressed differently at different times and under different conditions. Genes make proteins in two steps. First, DNA is transcribed into messenger RNA or mRNA, which in turn is translated into proteins. The different patterns of gene expression following carefully tuned biological programs, according to tissue type, developmental stage, environment and genetic background account for the huge variety of different cells states and types. Virtually all major differences in cell state or type are correlated with changes in the mRNA levels of many genes.

196 citations


01 Jan 2003
TL;DR: A new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data is presented and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters.
Abstract: In this paper we present a new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data. The method can best be thought of as an analysis approach, to guide and assist in the use of any of a wide range of available clustering algorithms. We call the new methodology consensus clustering, and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. Finally, it provides for a vi- sualization tool to inspect cluster number, membership, and boundaries. We present the results of our experiments on both simulated data and real gene expression data aimed at evaluating the eectiveness of the methodology in discovering biologically meaningful clusters.

118 citations


Journal ArticleDOI
TL;DR: In this paper, a computational methodology for multiclass prediction that combines class-specific (one vs. all) binary support vector machines was proposed for the diagnosis of multiple common adult malignancies using DNA microarray data.
Abstract: Modern cancer treatment relies upon microscopic tissue examination to classify tumors according to anatomical site of origin. This approach is effective but subjective and variable even among experienced clinicians and pathologists. Recently, DNA microarray-generated gene expression data has been used to build molecular cancer classifiers. Previous work from our group and others demonstrated methods for solving pairwise classification problems using such global gene expression patterns. However, classification across multiple primary tumor classes poses new methodological and computational challenges. In this paper we describe a computational methodology for multiclass prediction that combines class-specific (one vs. all) binary support vector machines. We apply this methodology to the diagnosis of multiple common adult malignancies using DNA microarray data from a collection of 198 tumor samples, spanning 14 of the most common tumor types. Overall classification accuracy is 78%, far exceeding the expecte...

63 citations


Patent
06 Aug 2003
TL;DR: In this paper, a large-scale Bayes classification framework for across platform and multiple dataset classification is proposed. In one embodiment, the systems combine a Large Bayes classifier with a definition of combined relative features to represent the original values.
Abstract: Systems and methods for across platform and multiple dataset classification. In one embodiment the systems combine a Large Bayes classification framework, constructed from discovered itemsets or common patterns of data, with a definition of combined relative features to represent the original values. One realization of this method is that different datasets representing the same biological system display some amount of invariant biological characteristics independent of the idiosyncrasies of sample sources, preparation and the technological platform used to obtain the measurements. These invariant biological characteristics, when captured and exposed, can provide the basis to build robust, general and accurate classification models based on reproducible biological behavior

12 citations