scispace - formally typeset
Search or ask a question
Author

Yanni Zhu

Bio: Yanni Zhu is an academic researcher from University of Minnesota. The author has contributed to research in topics: Support vector machine & Hinge loss. The author has an hindex of 2, co-authored 2 publications receiving 112 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: A network-based support vector machine is proposed for binary classification problems by constructing a penalty term from the F∞-norm being applied to pairwise gene neighbors with the hope to improve predictive performance and gene selection.
Abstract: The importance of network-based approach to identifying biological markers for diagnostic classification and prognostic assessment in the context of microarray data has been increasingly recognized. To our knowledge, there have been few, if any, statistical tools that explicitly incorporate the prior information of gene networks into classifier building. The main idea of this paper is to take full advantage of the biological observation that neighboring genes in a network tend to function together in biological processes and to embed this information into a formal statistical framework. We propose a network-based support vector machine for binary classification problems by constructing a penalty term from the F∞-norm being applied to pairwise gene neighbors with the hope to improve predictive performance and gene selection. Simulation studies in both low- and high-dimensional data settings as well as two real microarray applications indicate that the proposed method is able to identify more clinically relevant genes while maintaining a sparse model with either similar or higher prediction accuracy compared with the standard and the L1 penalized support vector machines. The proposed network-based support vector machine has the potential to be a practically useful classification tool for microarrays and other high-dimensional data.

113 citations

Journal ArticleDOI
TL;DR: The proposed DGC-SVM utilizes the hinge loss penalized by a sum of the L(infinity)-norm being applied to each group of genes to be an effective classification tool that encourages gene selection along paths to or clustering around known disease genes for microarray data.
Abstract: With the availability of genetic pathways or networks and accumulating knowledge on genes with variants predisposing to diseases (disease genes), we propose a disease-gene-centric support vector machine (DGC-SVM) that directly incorporates these two sources of prior information into building microarray-based classifiers for binary classification problems. DGC-SVM aims to detect the genes clustering together and around some key disease genes in a gene network. To achieve this goal, we propose a penalty over suitably defined groups of genes. A hierarchy is imposed on an undirected gene network to facilitate the definition of such gene groups. Our proposed DGC-SVM utilizes the hinge loss penalized by a sum of the L(infinity)-norm being applied to each group. The simulation studies show that DGC-SVM not only detects more disease genes along pathways than the existing standard SVM and SVM with an L(1)-penalty (L1-SVM), but also captures disease genes that potentially affect the outcome only weakly. Two real data applications demonstrate that DGC-SVM improves gene selection with predictive performance comparable to the standard-SVM and L1-SVM. The proposed method has the potential to be an effective classification tool that encourages gene selection along paths to or clustering around known disease genes for microarray data.

6 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The new concept of dynamical network biomarkers (DNBs) has been developed, which is different from traditional static approaches, and the DNB is able to distinguish a predisease state from normal and disease states by even a small number of samples, and therefore has great potential to achieve “real” early diagnosis of complex diseases.
Abstract: Many studies have been carried out for early diagnosis of complex diseases by finding accurate and robust biomarkers specific to respective diseases. In particular, recent rapid advance of high-throughput technologies provides unprecedented rich information to characterize various disease genotypes and phenotypes in a global and also dynamical manner, which significantly accelerates the study of biomarkers from both theoretical and clinical perspectives. Traditionally, molecular biomarkers that distinguish disease samples from normal samples are widely adopted in clinical practices due to their ease of data measurement. However, many of them suffer from low coverage and high false-positive rates or high false-negative rates, which seriously limit their further clinical applications. To overcome those difficulties, network biomarkers (or module biomarkers) attract much attention and also achieve better performance because a network (or subnetwork) is considered to be a more robust form to characterize diseases than individual molecules. But, both molecular biomarkers and network biomarkers mainly distinguish disease samples from normal samples, and they generally cannot ensure to identify predisease samples due to their static nature, thereby lacking ability to early diagnosis. Based on nonlinear dynamical theory and complex network theory, a new concept of dynamical network biomarkers (DNBs, or a dynamical network of biomarkers) has been developed, which is different from traditional static approaches, and the DNB is able to distinguish a predisease state from normal and disease states by even a small number of samples, and therefore has great potential to achieve "real" early diagnosis of complex diseases. In this paper, we comprehensively review the recent advances and developments on molecular biomarkers, network biomarkers, and DNBs in particular, focusing on the biomarkers for early diagnosis of complex diseases considering a small number of samples and high-throughput data (or big data). Detailed comparisons of various types of biomarkers as well as their applications are also discussed.

230 citations

Journal ArticleDOI
TL;DR: A grouped penalty based on the Lγ‐norm that smoothes the regression coefficients of the predictors over the network is proposed that performs best in variable selection across all simulation set‐ups considered.
Abstract: We consider penalized linear regression, especially for “large p, small n” problems, for which the relationships among predictors are described a priori by a network. A class of motivating examples includes modeling a phenotype through gene expression profiles while accounting for coordinated functioning of genes in the form of biological pathways or networks. To incorporate the prior knowledge of the similar effect sizes of neighboring predictors in a network, we propose a grouped penalty based on the Lγ-norm that smoothes the regression coefficients of the predictors over the network. The main feature of the proposed method is its ability to automatically realize grouped variable selection and exploit grouping effects. We also discuss effects of the choices of the γ and some weights inside the Lγ-norm. Simulation studies demonstrate the superior finite sample performance of the proposed method as compared to Lasso, elastic net and a recently proposed network-based method. The new method performs best in variable selection across all simulation set-ups considered. For illustration, the method is applied to a microarray dataset to predict survival times for some glioblastoma patients using a gene expression dataset and a gene network compiled from some KEGG pathways.

127 citations

Journal ArticleDOI
TL;DR: This study aims at developing a novel method utilizing particle swarm optimization combined with a decision tree as the classifier that outperforms other popular classifiers for all test datasets, and is compatible to SVM for certain specific datasets.
Abstract: In the application of microarray data, how to select a small number of informative genes from thousands of genes that may contribute to the occurrence of cancers is an important issue. Many researchers use various computational intelligence methods to analyzed gene expression data. To achieve efficient gene selection from thousands of candidate genes that can contribute in identifying cancers, this study aims at developing a novel method utilizing particle swarm optimization combined with a decision tree as the classifier. This study also compares the performance of our proposed method with other well-known benchmark classification methods (support vector machine, self-organizing map, back propagation neural network, C4.5 decision tree, Naive Bayes, CART decision tree, and artificial immune recognition system) and conducts experiments on 11 gene expression cancer datasets. Based on statistical analysis, our proposed method outperforms other popular classifiers for all test datasets, and is compatible to SVM for certain specific datasets. Further, the housekeeping genes with various expression patterns and tissue-specific genes are identified. These genes provide a high discrimination power on cancer classification.

123 citations

Journal ArticleDOI
TL;DR: A newly developed classifier named Forest Deep Neural Network (fDNN), to integrate the deep neural network architecture with a supervised forest feature detector, which is able to learn sparse feature representations and feed the representations into a neural network to mitigate the overfitting problem.
Abstract: In predictive model development, gene expression data is associated with the unique challenge that the number of samples (n) is much smaller than the amount of features (p). This “n ≪ p” property has prevented classification of gene expression data from deep learning techniques, which have been proved powerful under “n > p” scenarios in other application fields, such as image classification. Further, the sparsity of effective features with unknown correlation structures in gene expression profiles brings more challenges for classification tasks. To tackle these problems, we propose a newly developed classifier named Forest Deep Neural Network (fDNN), to integrate the deep neural network architecture with a supervised forest feature detector. Using this built-in feature detector, the method is able to learn sparse feature representations and feed the representations into a neural network to mitigate the overfitting problem. Simulation experiments and real data analyses using two RNA-seq expression datasets are conducted to evaluate fDNN’s capability. The method is demonstrated a useful addition to current predictive models with better classification performance and more meaningful selected features compared to ordinary random forests and deep neural networks.

113 citations