scispace - formally typeset
Search or ask a question
Journal ArticleDOI

An Analytical Method for Multiclass Molecular Cancer Classification

01 Jan 2003-Siam Review (Society for Industrial and Applied Mathematics)-Vol. 45, Iss: 4, pp 706-723
TL;DR: In this paper, a computational methodology for multiclass prediction that combines class-specific (one vs. all) binary support vector machines was proposed for the diagnosis of multiple common adult malignancies using DNA microarray data.
Abstract: Modern cancer treatment relies upon microscopic tissue examination to classify tumors according to anatomical site of origin. This approach is effective but subjective and variable even among experienced clinicians and pathologists. Recently, DNA microarray-generated gene expression data has been used to build molecular cancer classifiers. Previous work from our group and others demonstrated methods for solving pairwise classification problems using such global gene expression patterns. However, classification across multiple primary tumor classes poses new methodological and computational challenges. In this paper we describe a computational methodology for multiclass prediction that combines class-specific (one vs. all) binary support vector machines. We apply this methodology to the diagnosis of multiple common adult malignancies using DNA microarray data from a collection of 198 tumor samples, spanning 14 of the most common tumor types. Overall classification accuracy is 78%, far exceeding the expecte...
Citations
More filters
Journal ArticleDOI
TL;DR: Both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.
Abstract: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

616 citations

Journal ArticleDOI
Xin Zhou1, David Tuck1
TL;DR: A family of four extensions to SVM-RFE is proposed to solve the multiclass gene selection problem, based on different frameworks of multiclass SVMs, and identifies genes leading to more accurate classification.
Abstract: Motivation: Given the thousands of genes and the small number of samples, gene selection has emerged as an important research problem in microarray data analysis. Support Vector Machine—Recursive Feature Elimination (SVM-RFE) is one of a group of recently described algorithms which represent the stat-of-the-art for gene selection. Just like SVM itself, SVM-RFE was originally designed to solve binary gene selection problems. Several groups have extended SVM-RFE to solve multiclass problems using one-versus-all techniques. However, the genes selected from one binary gene selection problem may reduce the classification performance in other binary problems. Results: In the present study, we propose a family of four extensions to SVM-RFE (called MSVM-RFE) to solve the multiclass gene selection problem, based on different frameworks of multiclass SVMs. By simultaneously considering all classes during the gene selection stages, our proposed extensions identify genes leading to more accurate classification. Contact: david.tuck@yale.edu Supplementary information: Supplementary materials, including a detailed review of both binary and multiclass SVMs, and complete experimental results, are available at Bioinformatics online.

231 citations

Journal ArticleDOI
TL;DR: It is found that random forests, support vector machines, kernel ridge regression, and Bayesian logistic regression with Laplace priors are the most effective machine learning techniques for performing accurate classification from these microbiomic data.
Abstract: Recent advances in next-generation DNA sequencing enable rapid high-throughput quantitation of microbial community composition in human samples, opening up a new field of microbiomics. One of the promises of this field is linking abundances of microbial taxa to phenotypic and physiological states, which can inform development of new diagnostic, personalized medicine, and forensic modalities. Prior research has demonstrated the feasibility of applying machine learning methods to perform body site and subject classification with microbiomic data. However, it is currently unknown which classifiers perform best among the many available alternatives for classification with microbiomic data. In this work, we performed a systematic comparison of 18 major classification methods, 5 feature selection methods, and 2 accuracy metrics using 8 datasets spanning 1,802 human samples and various classification tasks: body site and subject classification and diagnosis. We found that random forests, support vector machines, kernel ridge regression, and Bayesian logistic regression with Laplace priors are the most effective machine learning techniques for performing accurate classification from these microbiomic data.

166 citations

Journal ArticleDOI
01 Jan 2004
TL;DR: It is demonstrated that the self-organizing clusters of cancerous cells exhibit distinctive graph metrics that distinguish them from the healthy cells and the unhealthy inflamed cells at the cellular level with an accuracy of at least 85%.
Abstract: Summary: We report a novel, proof-of-concept, computational method that models a type of brain cancer (glioma) only by using the topological properties of its cells in the tissue image. From low-magnification (80×) tissue images of 384 × 384 pixels, we construct the graphs of the cells based on the locations of the cells within the images. We generate such cell graphs of 1000--3000 cells (nodes) with 2000--10 000 links, each of which is calculated as a decaying exponential function of the Euclidean distance between every pair of cells in accordance with the Waxman model. At the cellular level, we compute the graph metrics of the cell graphs, including the degree, clustering coefficient, eccentricity and closeness for each cell. Working with a total of 285 tissue samples surgically removed from 12 different patients, we demonstrate that the self-organizing clusters of cancerous cells exhibit distinctive graph metrics that distinguish them from the healthy cells and the unhealthy inflamed cells at the cellular level with an accuracy of at least 85%. At the tissue level, we accomplish correct tissue classifications of cancerous, healthy and non-neoplastic inflamed tissue samples with an accuracy of 100% by requiring correct classification for the majority of the cells within the tissue sample.

165 citations

Journal ArticleDOI
TL;DR: It is shown how a metagene projection methodology can greatly reduce the number of features used to characterize microarray data, and how this approach can help assess and interpret similarities and differences between independent data sets, enable cross-platform and cross-species analysis, improve clustering and class prediction, and provide a computational means to detect and remove sample contamination.
Abstract: The high dimensionality of global transcription profiles, the expression level of 20,000 genes in a much small number of samples, presents challenges that affect the sensitivity and general applicability of analysis results. In principle, it would be better to describe the data in terms of a small number of metagenes, positive linear combinations of genes, which could reduce noise while still capturing the invariant biological features of the data. Here, we describe how to accomplish such a reduction in dimension by a metagene projection methodology, which can greatly reduce the number of features used to characterize microarray data. We show, in applications to the analysis of leukemia and lung cancer data sets, how this approach can help assess and interpret similarities and differences between independent data sets, enable cross-platform and cross-species analysis, improve clustering and class prediction, and provide a computational means to detect and remove sample contamination.

154 citations

References
More filters
Book
01 Jan 1993
TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Abstract: This article presents bootstrap methods for estimation, using simple arguments. Minitab macros for implementing these methods are given.

37,183 citations

01 Jan 1998
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,531 citations

Journal ArticleDOI
15 Oct 1999-Science
TL;DR: A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case and suggests a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Abstract: Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

12,530 citations

Book
01 Jan 1951
TL;DR: Saari as mentioned in this paper introduced Arrow's Theorem and founded the field of social choice theory in economics and political science, and introduced a new foreword by Nobel laureate Eric Maskin, introducing Arrow's seminal book to a new generation of students and researchers.
Abstract: Originally published in 1951, Social Choice and Individual Values introduced "Arrow's Impossibility Theorem" and founded the field of social choice theory in economics and political science. This new edition, including a new foreword by Nobel laureate Eric Maskin, reintroduces Arrow's seminal book to a new generation of students and researchers. "Far beyond a classic, this small book unleashed the ongoing explosion of interest in social choice and voting theory. A half-century later, the book remains full of profound insight: its central message, 'Arrow's Theorem,' has changed the way we think."-Donald G. Saari, author of Decisions and Elections: Explaining the Unexpected

8,219 citations