An Analytical Method for Multiclass Molecular Cancer Classification

doi:10.1137/S0036144502411986

Home
/
Papers
/
An Analytical Method for Multiclass Molecular Cancer Classification

Journal Article•DOI•

An Analytical Method for Multiclass Molecular Cancer Classification

Ryan Rifkin, Sayan Mukherjee, Pablo Tamayo, Sridhar Ramaswamy, Chen-Hsiang Yeang, Michael Angelo, Michael R. Reich, Tomaso Poggio, Eric S. Lander, Todd R. Golub, Jill P. Mesirov - Show less +7 more

01 Jan 2003-Siam Review (Society for Industrial and Applied Mathematics)-Vol. 45, Iss: 4, pp 706-723

TL;DR: In this paper, a computational methodology for multiclass prediction that combines class-specific (one vs. all) binary support vector machines was proposed for the diagnosis of multiple common adult malignancies using DNA microarray data.

read less

Abstract: Modern cancer treatment relies upon microscopic tissue examination to classify tumors according to anatomical site of origin. This approach is effective but subjective and variable even among experienced clinicians and pathologists. Recently, DNA microarray-generated gene expression data has been used to build molecular cancer classifiers. Previous work from our group and others demonstrated methods for solving pairwise classification problems using such global gene expression patterns. However, classification across multiple primary tumor classes poses new methodological and computational challenges. In this paper we describe a computational methodology for multiclass prediction that combines class-specific (one vs. all) binary support vector machines. We apply this methodology to the diagnosis of multiple common adult malignancies using DNA microarray data from a collection of 198 tumor samples, spanning 14 of the most common tumor types. Overall classification accuracy is 78%, far exceeding the expecte...

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

[...]

Alexander Statnikov¹, Lily Wang¹, Constantin F. Aliferis•Institutions (1)

Vanderbilt University¹

22 Jul 2008-BMC Bioinformatics

TL;DR: Both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

...read moreread less

Abstract: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

...read moreread less

616 citations

Journal Article•DOI•

Msvm-rfe

[...]

Xin Zhou¹, David Tuck¹•Institutions (1)

Yale University¹

06 Mar 2007-Bioinformatics

TL;DR: A family of four extensions to SVM-RFE is proposed to solve the multiclass gene selection problem, based on different frameworks of multiclass SVMs, and identifies genes leading to more accurate classification.

...read moreread less

Abstract: Motivation: Given the thousands of genes and the small number of samples, gene selection has emerged as an important research problem in microarray data analysis. Support Vector Machine—Recursive Feature Elimination (SVM-RFE) is one of a group of recently described algorithms which represent the stat-of-the-art for gene selection. Just like SVM itself, SVM-RFE was originally designed to solve binary gene selection problems. Several groups have extended SVM-RFE to solve multiclass problems using one-versus-all techniques. However, the genes selected from one binary gene selection problem may reduce the classification performance in other binary problems. Results: In the present study, we propose a family of four extensions to SVM-RFE (called MSVM-RFE) to solve the multiclass gene selection problem, based on different frameworks of multiclass SVMs. By simultaneously considering all classes during the gene selection stages, our proposed extensions identify genes leading to more accurate classification. Contact: david.tuck@yale.edu Supplementary information: Supplementary materials, including a detailed review of both binary and multiclass SVMs, and complete experimental results, are available at Bioinformatics online.

...read moreread less

231 citations

Journal Article•DOI•

A comprehensive evaluation of multicategory classification methods for microbiomic data

[...]

Alexander Statnikov¹, Mikael Henaff¹, Varun Narendra¹, Kranti Konganti², Zhiguo Li¹, Liying Yang¹, Zhiheng Pei³, Zhiheng Pei¹, Martin J. Blaser³, Martin J. Blaser¹, Constantin F. Aliferis¹, Constantin F. Aliferis⁴, Alexander V. Alekseyenko¹ - Show less +9 more•Institutions (4)

New York University¹, Texas A&M University², United States Department of Veterans Affairs³, Vanderbilt University⁴

05 Apr 2013-Microbiome

TL;DR: It is found that random forests, support vector machines, kernel ridge regression, and Bayesian logistic regression with Laplace priors are the most effective machine learning techniques for performing accurate classification from these microbiomic data.

...read moreread less

Abstract: Recent advances in next-generation DNA sequencing enable rapid high-throughput quantitation of microbial community composition in human samples, opening up a new field of microbiomics. One of the promises of this field is linking abundances of microbial taxa to phenotypic and physiological states, which can inform development of new diagnostic, personalized medicine, and forensic modalities. Prior research has demonstrated the feasibility of applying machine learning methods to perform body site and subject classification with microbiomic data. However, it is currently unknown which classifiers perform best among the many available alternatives for classification with microbiomic data. In this work, we performed a systematic comparison of 18 major classification methods, 5 feature selection methods, and 2 accuracy metrics using 8 datasets spanning 1,802 human samples and various classification tasks: body site and subject classification and diagnosis. We found that random forests, support vector machines, kernel ridge regression, and Bayesian logistic regression with Laplace priors are the most effective machine learning techniques for performing accurate classification from these microbiomic data.

...read moreread less

166 citations

Journal Article•DOI•

The cell graphs of cancer

[...]

Cigdem Gunduz¹, Bülent Yener¹, S. Humayun Gultekin²•Institutions (2)

Rensselaer Polytechnic Institute¹, Icahn School of Medicine at Mount Sinai²

01 Jan 2004

TL;DR: It is demonstrated that the self-organizing clusters of cancerous cells exhibit distinctive graph metrics that distinguish them from the healthy cells and the unhealthy inflamed cells at the cellular level with an accuracy of at least 85%.

...read moreread less

Abstract: Summary: We report a novel, proof-of-concept, computational method that models a type of brain cancer (glioma) only by using the topological properties of its cells in the tissue image. From low-magnification (80×) tissue images of 384 × 384 pixels, we construct the graphs of the cells based on the locations of the cells within the images. We generate such cell graphs of 1000--3000 cells (nodes) with 2000--10 000 links, each of which is calculated as a decaying exponential function of the Euclidean distance between every pair of cells in accordance with the Waxman model. At the cellular level, we compute the graph metrics of the cell graphs, including the degree, clustering coefficient, eccentricity and closeness for each cell. Working with a total of 285 tissue samples surgically removed from 12 different patients, we demonstrate that the self-organizing clusters of cancerous cells exhibit distinctive graph metrics that distinguish them from the healthy cells and the unhealthy inflamed cells at the cellular level with an accuracy of at least 85%. At the tissue level, we accomplish correct tissue classifications of cancerous, healthy and non-neoplastic inflamed tissue samples with an accuracy of 100% by requiring correct classification for the majority of the cells within the tissue sample.

...read moreread less

165 citations

Journal Article•DOI•

Metagene projection for cross-platform, cross-species characterization of global transcriptional states

[...]

Pablo Tamayo¹, Daniel Scanfeld, Benjamin L. Ebert, Michael A. Gillette, Charles W. M. Roberts, Jill P. Mesirov - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

03 Apr 2007-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: It is shown how a metagene projection methodology can greatly reduce the number of features used to characterize microarray data, and how this approach can help assess and interpret similarities and differences between independent data sets, enable cross-platform and cross-species analysis, improve clustering and class prediction, and provide a computational means to detect and remove sample contamination.

...read moreread less

Abstract: The high dimensionality of global transcription profiles, the expression level of 20,000 genes in a much small number of samples, presents challenges that affect the sensitivity and general applicability of analysis results. In principle, it would be better to describe the data in terms of a small number of metagenes, positive linear combinations of genes, which could reduce noise while still capturing the invariant biological features of the data. Here, we describe how to accomplish such a reduction in dimension by a metagene projection methodology, which can greatly reduce the number of features used to characterize microarray data. We show, in applications to the analysis of leukemia and lung cancer data sets, how this approach can help assess and interpret similarities and differences between independent data sets, enable cross-platform and cross-species analysis, improve clustering and class prediction, and provide a computational means to detect and remove sample contamination.

...read moreread less

154 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13

Collapse

References

PDF

Open Access

More filters

Book•

An introduction to the bootstrap

[...]

Bradley Efron¹, Robert Tibshirani•Institutions (1)

South Dakota School of Mines and Technology¹

01 Jan 1993

TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.

...read moreread less

Abstract: This article presents bootstrap methods for estimation, using simple arguments. Minitab macros for implementing these methods are given.

...read moreread less

37,183 citations

Statistical learning theory

[...]

Vladimir Vapnik

01 Jan 1998

TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

...read moreread less

Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

...read moreread less

26,531 citations

Journal Article•DOI•

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

[...]

Todd R. Golub¹, Todd R. Golub², Donna K. Slonim¹, Pablo Tamayo¹, Christine Huard¹, Michelle Gaasenbeek¹, Jill P. Mesirov¹, Hilary A. Coller¹, Mignon L. Loh², James R. Downing³, Michael A. Caligiuri⁴, Clara D. Bloomfield⁴, Eric S. Lander¹ - Show less +9 more•Institutions (4)

Massachusetts Institute of Technology¹, Harvard University², St. Jude Children's Research Hospital³, Ohio State University⁴

15 Oct 1999-Science

TL;DR: A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case and suggests a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

...read moreread less

Abstract: Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

...read moreread less

12,530 citations

Book•

Social Choice and Individual Values

[...]

Kenneth J. Arrow¹•Institutions (1)

National Research University – Higher School of Economics¹

01 Jan 1951

TL;DR: Saari as mentioned in this paper introduced Arrow's Theorem and founded the field of social choice theory in economics and political science, and introduced a new foreword by Nobel laureate Eric Maskin, introducing Arrow's seminal book to a new generation of students and researchers.

...read moreread less

Abstract: Originally published in 1951, Social Choice and Individual Values introduced "Arrow's Impossibility Theorem" and founded the field of social choice theory in economics and political science. This new edition, including a new foreword by Nobel laureate Eric Maskin, reintroduces Arrow's seminal book to a new generation of students and researchers. "Far beyond a classic, this small book unleashed the ongoing explosion of interest in social choice and voting theory. A half-century later, the book remains full of profound insight: its central message, 'Arrow's Theorem,' has changed the way we think."-Donald G. Saari, author of Decisions and Elections: Explaining the Unexpected

...read moreread less

8,219 citations

Book•

Solutions of ill-posed problems

[...]

Andreĭ Nikolaevich Tikhonov, Vasiliy Yakovlevich Arsenin

01 Jan 1977

8,009 citations