scispace - formally typeset
Open AccessJournal ArticleDOI

Optimally splitting cases for training and testing high dimensional classifiers

Reads0
Chats0
TLDR
A non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm is developed and applied to any dataset, using any predictor development method, to determine the best split.
Abstract
We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.

read more

Citations
More filters
Journal ArticleDOI

Criteria for the use of omics-based predictors in clinical trials

TL;DR: A checklist of criteria that can be used to determine the readiness of omics-based tests for guiding patient care in clinical trials and issues relating to specimens, assays, mathematical modelling, clinical trial design, and ethical, legal and regulatory aspects are developed.
Journal ArticleDOI

IDRiD: Diabetic Retinopathy – Segmentation and Grading Challenge

TL;DR: The set-up and results of this challenge that is primarily based on Indian Diabetic Retinopathy Image Dataset (IDRiD), which received a positive response from the scientific community, have the potential to enable new developments in retinal image analysis and image-based DR screening in particular.
Journal ArticleDOI

Deep learning approach for microarray cancer data classification

TL;DR: A deep feedforward method to classify the given microarray cancer data into a set of classes for subsequent diagnosis purposes using a 7-layer deep neural network architecture having various parameters for each dataset is developed.
Journal ArticleDOI

Texture Analysis of Imaging: What Radiologists Need to Know

TL;DR: Some parameters that affect the performance of texture metrics are discussed and recommendations that can guide both the design and evaluation of future radiomics studies are proposed.
Journal ArticleDOI

Criteria for the use of omics-based predictors in clinical trials: explanation and elaboration.

TL;DR: A checklist of criteria to consider when evaluating the body of evidence supporting the clinical use of a predictor to guide patient therapy is presented, including issues pertaining to specimen and assay requirements, the soundness of the process for developing predictor models, expectations regarding clinical study design and conduct, and attention to regulatory, ethical, and legal issues.
References
More filters
Book

An introduction to the bootstrap

TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Journal ArticleDOI

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

TL;DR: A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case and suggests a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Book

Introduction to Statistical Pattern Recognition

TL;DR: This completely revised second edition presents an introduction to statistical pattern recognition, which is appropriate as a text for introductory courses in pattern recognition and as a reference book for workers in the field.
Journal ArticleDOI

Gene expression profiling predicts clinical outcome of breast cancer

TL;DR: DNA microarray analysis on primary breast tumours of 117 young patients is used and supervised classification is applied to identify a gene expression signature strongly predictive of a short interval to distant metastases (‘poor prognosis’ signature) in patients without tumour cells in local lymph nodes at diagnosis, providing a strategy to select patients who would benefit from adjuvant therapy.
Related Papers (5)