Optimally splitting cases for training and testing high dimensional classifiers
Kevin K. Dobbin,Richard M. Simon +1 more
Reads0
Chats0
TLDR
A non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm is developed and applied to any dataset, using any predictor development method, to determine the best split.Abstract:
We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.read more
Citations
More filters
Journal ArticleDOI
Criteria for the use of omics-based predictors in clinical trials
Lisa M. McShane,Margaret M. Cavenagh,Tracy Lively,David A. Eberhard,William L. Bigbee,P. Mickey Williams,Jill P. Mesirov,Mei Yin C. Polley,Kelly Y. Kim,James V. Tricoli,Jeremy M. G. Taylor,Deborah J. Shuman,Richard M. Simon,James H. Doroshow,Barbara A. Conley +14 more
TL;DR: A checklist of criteria that can be used to determine the readiness of omics-based tests for guiding patient care in clinical trials and issues relating to specimens, assays, mathematical modelling, clinical trial design, and ethical, legal and regulatory aspects are developed.
Journal ArticleDOI
IDRiD: Diabetic Retinopathy – Segmentation and Grading Challenge
Prasanna Porwal,Prasanna Porwal,Samiksha Pachade,Manesh Kokare,Girish Deshmukh,Jaemin Son,Woong Bae,Lihong Liu,Jianzong Wang,Xinhui Liu,Liangxin Gao,Tian Bo Wu,Jing Xiao,Fengyan Wang,Baocai Yin,Yunzhi Wang,Gopichandh Danala,Linsheng He,Yoon-Ho Choi,Yeong Chan Lee,Sang Hyuk Jung,Zhongyu Li,Xiaodan Sui,Junyan Wu,Xiaolong Li,Ting Zhou,Janos Toth,Agnes Baran,Avinash Kori,Sai Saketh Chennamsetty,Mohammed Safwan,Varghese Alex,Xingzheng Lyu,Li Cheng,Qinhao Chu,Pengcheng Li,Xin Ji,Sanyuan Zhang,Shen Yaxin,Ling Dai,Oindrila Saha,Rachana Sathish,Tânia Melo,Teresa Araújo,Balazs Harangi,Bin Sheng,Ruogu Fang,Debdoot Sheet,Andras Hajdu,Yuanjie Zheng,Ana Maria Mendonça,Shaoting Zhang,Aurélio Campilho,Bin Zheng,Dinggang Shen,Luca Giancardo,Gwenole Quellec,Fabrice Meriaudeau +57 more
TL;DR: The set-up and results of this challenge that is primarily based on Indian Diabetic Retinopathy Image Dataset (IDRiD), which received a positive response from the scientific community, have the potential to enable new developments in retinal image analysis and image-based DR screening in particular.
Journal ArticleDOI
Deep learning approach for microarray cancer data classification
TL;DR: A deep feedforward method to classify the given microarray cancer data into a set of classes for subsequent diagnosis purposes using a 7-layer deep neural network architecture having various parameters for each dataset is developed.
Journal ArticleDOI
Texture Analysis of Imaging: What Radiologists Need to Know
TL;DR: Some parameters that affect the performance of texture metrics are discussed and recommendations that can guide both the design and evaluation of future radiomics studies are proposed.
Journal ArticleDOI
Criteria for the use of omics-based predictors in clinical trials: explanation and elaboration.
Lisa M. McShane,Margaret M. Cavenagh,Tracy Lively,David A. Eberhard,William L. Bigbee,P. M. Williams,Jill P. Mesirov,Mei Yin C. Polley,Kelly Y. Kim,James V. Tricoli,Jeremy M. G. Taylor,Deborah J. Shuman,Richard M. Simon,James H. Doroshow,Barbara A. Conley +14 more
TL;DR: A checklist of criteria to consider when evaluating the body of evidence supporting the clinical use of a predictor to guide patient therapy is presented, including issues pertaining to specimen and assay requirements, the soundness of the process for developing predictor models, expectations regarding clinical study design and conduct, and attention to regulatory, ethical, and legal issues.
References
More filters
Book
An introduction to the bootstrap
Bradley Efron,Robert Tibshirani +1 more
TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Journal ArticleDOI
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Todd R. Golub,Todd R. Golub,Donna K. Slonim,Pablo Tamayo,Christine Huard,Michelle Gaasenbeek,Jill P. Mesirov,Hilary A. Coller,Mignon L. Loh,James R. Downing,Michael A. Caligiuri,Clara D. Bloomfield,Eric S. Lander +12 more
TL;DR: A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case and suggests a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Book
Introduction to Statistical Pattern Recognition
TL;DR: This completely revised second edition presents an introduction to statistical pattern recognition, which is appropriate as a text for introductory courses in pattern recognition and as a reference book for workers in the field.
Journal ArticleDOI
Gene expression profiling predicts clinical outcome of breast cancer
Laura J. van't Veer,Hongyue Dai,Marc J. van de Vijver,Yudong D. He,Augustinus A. M. Hart,Mao Mao,Hans Peterse,Karin van der Kooy,Matthew J. Marton,Anke T. Witteveen,George J. Schreiber,Ron M. Kerkhoven,Christopher J. Roberts,Peter S. Linsley,René Bernards,Stephen H. Friend +15 more
TL;DR: DNA microarray analysis on primary breast tumours of 117 young patients is used and supervised classification is applied to identify a gene expression signature strongly predictive of a short interval to distant metastases (‘poor prognosis’ signature) in patients without tumour cells in local lymph nodes at diagnosis, providing a strategy to select patients who would benefit from adjuvant therapy.