Optimally splitting cases for training and testing high dimensional classifiers

doi:10.1186/1755-8794-4-31

Open AccessJournal ArticleDOI

Optimally splitting cases for training and testing high dimensional classifiers

Kevin K. Dobbin, +1 more

- 08 Apr 2011 -

BMC Medical Genomics

- Vol. 4, Iss: 1, pp 31-31

Chats0

TLDR

A non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm is developed and applied to any dataset, using any predictor development method, to determine the best split.

Abstract:

We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Criteria for the use of omics-based predictors in clinical trials

Lisa M. McShane, +14 more

- 17 Oct 2013 -

Nature

TL;DR: A checklist of criteria that can be used to determine the readiness of omics-based tests for guiding patient care in clinical trials and issues relating to specimens, assays, mathematical modelling, clinical trial design, and ethical, legal and regulatory aspects are developed.

...read moreread less

Journal ArticleDOI

IDRiD: Diabetic Retinopathy – Segmentation and Grading Challenge

Prasanna Porwal, +57 more

- 01 Jan 2020 -

Medical Image Analysis

TL;DR: The set-up and results of this challenge that is primarily based on Indian Diabetic Retinopathy Image Dataset (IDRiD), which received a positive response from the scientific community, have the potential to enable new developments in retinal image analysis and image-based DR screening in particular.

...read moreread less

Journal ArticleDOI

Deep learning approach for microarray cancer data classification

Hema Shekar Basavegowda, +1 more

- 01 Mar 2020 -

CAAI Transactions on Intelligence Techno...

TL;DR: A deep feedforward method to classify the given microarray cancer data into a set of classes for subsequent diagnosis purposes using a 7-layer deep neural network architecture having various parameters for each dataset is developed.

...read moreread less

Journal ArticleDOI

Texture Analysis of Imaging: What Radiologists Need to Know

Bino Varghese, +3 more

- 15 Jan 2019 -

American Journal of Roentgenology

TL;DR: Some parameters that affect the performance of texture metrics are discussed and recommendations that can guide both the design and evaluation of future radiomics studies are proposed.

...read moreread less

Journal ArticleDOI

Criteria for the use of omics-based predictors in clinical trials: explanation and elaboration.

Lisa M. McShane, +14 more

- 17 Oct 2013 -

BMC Medicine

TL;DR: A checklist of criteria to consider when evaluating the body of evidence supporting the clinical use of a predictor to guide patient therapy is presented, including issues pertaining to specimen and assay requirements, the soundness of the process for developing predictor models, expectations regarding clinical study design and conduct, and attention to regulatory, ethical, and legal issues.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

An introduction to the bootstrap

Bradley Efron, +1 more

TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.

...read moreread less

Book

Pattern Classification

Peter E. Hart, +2 more

Journal ArticleDOI

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Todd R. Golub, +12 more

- 15 Oct 1999 -

Science

TL;DR: A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case and suggests a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

...read moreread less

Book

Introduction to Statistical Pattern Recognition

Keinosuke Fukunaga

TL;DR: This completely revised second edition presents an introduction to statistical pattern recognition, which is appropriate as a text for introductory courses in pattern recognition and as a reference book for workers in the field.

...read moreread less

Journal ArticleDOI

Gene expression profiling predicts clinical outcome of breast cancer

Laura J. van't Veer, +15 more

- 31 Jan 2002 -

Nature

TL;DR: DNA microarray analysis on primary breast tumours of 117 young patients is used and supervised classification is applied to identify a gene expression signature strongly predictive of a short interval to distant metastases (‘poor prognosis’ signature) in patients without tumour cells in local lymph nodes at diagnosis, providing a strategy to select patients who would benefit from adjuvant therapy.

...read moreread less

Machine Learning

Regression Shrinkage and Selection via the Lasso

Robert Tibshirani

- 01 Jan 1996 -

Journal of the royal statistical society...

Optimally splitting cases for training and testing high dimensional classifiers

Citations

Criteria for the use of omics-based predictors in clinical trials

IDRiD: Diabetic Retinopathy – Segmentation and Grading Challenge

Deep learning approach for microarray cancer data classification

Texture Analysis of Imaging: What Radiologists Need to Know

Criteria for the use of omics-based predictors in clinical trials: explanation and elaboration.

References

An introduction to the bootstrap

Pattern Classification

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Introduction to Statistical Pattern Recognition

Gene expression profiling predicts clinical outcome of breast cancer

Related Papers (5)

Random Forests

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Support-Vector Networks

Regression Shrinkage and Selection via the Lasso

Scikit-learn: Machine Learning in Python