scispace - formally typeset
Search or ask a question

Showing papers on "Feature selection published in 2010"


Journal ArticleDOI
TL;DR: The Boruta package provides a convenient interface to the Boruta algorithm, implementing a novel feature selection algorithm for finding emph{all relevant variables}.
Abstract: This article describes a R package Boruta, implementing a novel feature selection algorithm for finding emph{all relevant variables}. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented.

2,832 citations


Journal ArticleDOI
TL;DR: It is proved that at a universal penalty level, the MC+ has high probability of matching the signs of the unknowns, and thus correct selection, without assuming the strong irrepresentable condition required by the LASSO.
Abstract: We propose MC+, a fast, continuous, nearly unbiased and accurate method of penalized variable selection in high-dimensional linear regression. The LASSO is fast and continuous, but biased. The bias of the LASSO may prevent consistent variable selection. Subset selection is unbiased but computationally costly. The MC+ has two elements: a minimax concave penalty (MCP) and a penalized linear unbiased selection (PLUS) algorithm. The MCP provides the convexity of the penalized loss in sparse regions to the greatest extent given certain thresholds for variable selection and unbiasedness. The PLUS computes multiple exact local minimizers of a possibly nonconvex penalized loss function in a certain main branch of the graph of critical points of the penalized loss. Its output is a continuous piecewise linear path encompassing from the origin for infinite penalty to a least squares solution for zero penalty. We prove that at a universal penalty level, the MC+ has high probability of matching the signs of the unknowns, and thus correct selection, without assuming the strong irrepresentable condition required by the LASSO. This selection consistency applies to the case of $p\gg n$, and is proved to hold for exactly the MC+ solution among possibly many local minimizers. We prove that the MC+ attains certain minimax convergence rates in probability for the estimation of regression coefficients in $\ell_r$ balls. We use the SURE method to derive degrees of freedom and $C_p$-type risk estimates for general penalized LSE, including the LASSO and MC+ estimators, and prove their unbiasedness. Based on the estimated degrees of freedom, we propose an estimator of the noise level for proper choice of the penalty level.

2,727 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a penalized linear unbiased selection (PLUS) algorithm, which computes multiple exact local minimizers of a possibly nonconvex penalized loss function in a certain main branch of the graph of critical points of the loss.
Abstract: We propose MC+, a fast, continuous, nearly unbiased and accurate method of penalized variable selection in high-dimensional linear regression. The LASSO is fast and continuous, but biased. The bias of the LASSO may prevent consistent variable selection. Subset selection is unbiased but computationally costly. The MC+ has two elements: a minimax concave penalty (MCP) and a penalized linear unbiased selection (PLUS) algorithm. The MCP provides the convexity of the penalized loss in sparse regions to the greatest extent given certain thresholds for variable selection and unbiasedness. The PLUS computes multiple exact local minimizers of a possibly nonconvex penalized loss function in a certain main branch of the graph of critical points of the penalized loss. Its output is a continuous piecewise linear path encompassing from the origin for infinite penalty to a least squares solution for zero penalty. We prove that at a universal penalty level, the MC+ has high probability of matching the signs of the unknowns, and thus correct selection, without assuming the strong irrepresentable condition required by the LASSO. This selection consistency applies to the case of p≫n, and is proved to hold for exactly the MC+ solution among possibly many local minimizers. We prove that the MC+ attains certain minimax convergence rates in probability for the estimation of regression coefficients in lr balls. We use the SURE method to derive degrees of freedom and Cp-type risk estimates for general penalized LSE, including the LASSO and MC+ estimators, and prove their unbiasedness. Based on the estimated degrees of freedom, we propose an estimator of the noise level for proper choice of the penalty level. For full rank designs and general sub-quadratic penalties, we provide necessary and sufficient conditions for the continuity of the penalized LSE. Simulation results overwhelmingly support our claim of superior variable selection properties and demonstrate the computational efficiency of the proposed method.

2,382 citations


Journal ArticleDOI
TL;DR: This paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection, and proposes a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy.

1,766 citations


Proceedings Article
06 Dec 2010
TL;DR: A new robust feature selection method with emphasizing joint l2,1-norm minimization on both loss function and regularization is proposed, which has been applied into both genomic and proteomic biomarkers discovery.
Abstract: Feature selection is an important component of many machine learning applications. Especially in many bioinformatics tasks, efficient and robust feature selection methods are desired to extract meaningful features and eliminate noisy ones. In this paper, we propose a new robust feature selection method with emphasizing joint l2,1-norm minimization on both loss function and regularization. The l2,1-norm based loss function is robust to outliers in data points and the l2,1-norm regularization selects features across all data points with joint sparsity. An efficient algorithm is introduced with proved convergence. Our regression based objective makes the feature selection process more efficient. Our method has been applied into both genomic and proteomic biomarkers discovery. Extensive empirical studies are performed on six data sets to demonstrate the performance of our feature selection method.

1,697 citations


Journal ArticleDOI
TL;DR: A Bayesian "sum-of-trees" model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian backfitting MCMC algorithm that generates samples from a posterior.
Abstract: We develop a Bayesian “sum-of-trees” model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian backfitting MCMC algorithm that generates samples from a posterior. Effectively, BART is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements. Motivated by ensemble methods in general, and boosting algorithms in particular, BART is defined by a statistical model: a prior and a likelihood. This approach enables full posterior inference including point and interval estimates of the unknown regression function as well as the marginal effects of potential predictors. By keeping track of predictor inclusion frequencies, BART can also be used for model-free variable selection. BART’s many features are illustrated with a bake-off against competing methods on 42 different data sets, with a simulation experiment and on a drug discovery classification problem.

1,439 citations


Proceedings ArticleDOI
25 Jul 2010
TL;DR: Inspired from the recent developments on manifold learning and L1-regularized models for subset selection, a new approach is proposed, called Multi-Cluster Feature Selection (MCFS), for unsupervised feature selection, which select those features such that the multi-cluster structure of the data can be best preserved.
Abstract: In many data analysis tasks, one is often confronted with very high dimensional data. Feature selection techniques are designed to find the relevant feature subset of the original features which can facilitate clustering, classification and retrieval. In this paper, we consider the feature selection problem in unsupervised learning scenario, which is particularly difficult due to the absence of class labels that would guide the search for relevant information. The feature selection problem is essentially a combinatorial optimization problem which is computationally expensive. Traditional unsupervised feature selection methods address this issue by selecting the top ranked features based on certain scores computed independently for each feature. These approaches neglect the possible correlation between different features and thus can not produce an optimal feature subset. Inspired from the recent developments on manifold learning and L1-regularized models for subset selection, we propose in this paper a new approach, called Multi-Cluster Feature Selection (MCFS), for unsupervised feature selection. Specifically, we select those features such that the multi-cluster structure of the data can be best preserved. The corresponding optimization problem can be efficiently solved since it only involves a sparse eigen-problem and a L1-regularized least squares problem. Extensive experimental results over various real-life data sets have demonstrated the superiority of the proposed algorithm.

975 citations


Journal Article
TL;DR: In this paper, a brief account of the recent developments of theory, methods, and implementations for high-dimensional variable selection is presented, with emphasis on independence screening and two-scale methods.
Abstract: High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods.

892 citations


Journal ArticleDOI
TL;DR: This review focuses on the variable selection methods in NIR spectroscopy with some classical approaches and sophisticated methods such as successive projections algorithm (SPA), uninformative variable elimination (UVE) and elaborate search-based strategies.

860 citations


Journal ArticleDOI
TL;DR: The proposed OP-ELM methodology performs several orders of magnitude faster than the other algorithms used in this brief, except the original ELM, and is still able to maintain an accuracy that is comparable to the performance of the SVM.
Abstract: In this brief, the optimally pruned extreme learning machine (OP-ELM) methodology is presented. It is based on the original extreme learning machine (ELM) algorithm with additional steps to make it more robust and generic. The whole methodology is presented in detail and then applied to several regression and classification problems. Results for both computational time and accuracy (mean square error) are compared to the original ELM and to three other widely used methodologies: multilayer perceptron (MLP), support vector machine (SVM), and Gaussian process (GP). As the experiments for both regression and classification illustrate, the proposed OP-ELM methodology performs several orders of magnitude faster than the other algorithms used in this brief, except the original ELM. Despite the simplicity and fast performance, the OP-ELM is still able to maintain an accuracy that is comparable to the performance of the SVM. A toolbox for the OP-ELM is publicly available online.

745 citations


Journal ArticleDOI
TL;DR: Support vector machines (SVM) are attractive for the classification of remotely sensed data with some claims that the method is insensitive to the dimensionality of the data and, therefore, does not require a dimensionality-reduction analysis in preprocessing, but it is shown that the accuracy of a classification by an SVM does vary as a function of the number of features used.
Abstract: Support vector machines (SVM) are attractive for the classification of remotely sensed data with some claims that the method is insensitive to the dimensionality of the data and, therefore, does not require a dimensionality-reduction analysis in preprocessing. Here, a series of classification analyses with two hyperspectral sensor data sets reveals that the accuracy of a classification by an SVM does vary as a function of the number of features used. Critically, it is shown that the accuracy of a classification may decline significantly (at 0.05 level of statistical significance) with the addition of features, particularly if a small training sample is used. This highlights a dependence of the accuracy of classification by an SVM on the dimensionality of the data and, therefore, the potential value of undertaking a feature-selection analysis prior to classification. Additionally, it is demonstrated that, even when a large training sample is available, feature selection may still be useful. For example, the accuracy derived from the use of a small number of features may be noninferior (at 0.05 level of significance) to that derived from the use of a larger feature set providing potential advantages in relation to issues such as data storage and computational processing costs. Feature selection may, therefore, be a valuable analysis to include in preprocessing operations for classification by an SVM.

Journal ArticleDOI
TL;DR: A theoretic framework based on rough set theory, called positive approximation, is introduced, which can be used to accelerate a heuristic process of attribute reduction, and several representative heuristic attribute reduction algorithms inrough set theory have been enhanced.

Journal ArticleDOI
TL;DR: The emphasis in this paper is on how to use variable selection in practice and avoid the most common pitfalls.
Abstract: This paper provides a practical guide to variable selection in chemometrics with a focus on regression-based calibration models. Several approaches, such as genetic algorithms (GAs), jack-knifing, forward selection, etc., are explained; it is also explained how to choose between different kinds of variable selection methods. The emphasis in this paper is on how to use variable selection in practice and avoid the most common pitfalls. Copyright © 2010 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A brief review of the existing work on sequence classification in terms of methodologies and application domains is presented and several extensions of the sequence classification problem are provided, such as early classification on sequences and semi-supervised learning on sequences.
Abstract: Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences.

Journal ArticleDOI
TL;DR: The performance of the Bayesian lassos is compared to their fre- quentist counterparts using simulations, data sets that previous lasso papers have used, and a di-cult modeling problem for predicting the collapse of governments around the world.
Abstract: Penalized regression methods for simultaneous variable selection and coe-cient estimation, especially those based on the lasso of Tibshirani (1996), have received a great deal of attention in recent years, mostly through frequen- tist models. Properties such as consistency have been studied, and are achieved by difierent lasso variations. Here we look at a fully Bayesian formulation of the problem, which is ∞exible enough to encompass most versions of the lasso that have been previously considered. The advantages of the hierarchical Bayesian for- mulations are many. In addition to the usual ease-of-interpretation of hierarchical models, the Bayesian formulation produces valid standard errors (which can be problematic for the frequentist lasso), and is based on a geometrically ergodic Markov chain. We compare the performance of the Bayesian lassos to their fre- quentist counterparts using simulations, data sets that previous lasso papers have used, and a di-cult modeling problem for predicting the collapse of governments around the world. In terms of prediction mean squared error, the Bayesian lasso performance is similar to and, in some cases, better than, the frequentist lasso.

Journal ArticleDOI
TL;DR: A blockwise path-following scheme that approximately traces the regularization path and theoretical results showing that this random projection approach converges to the solution yielded by trace-norm regularization are presented.
Abstract: We address the problem of recovering a common set of covariates that are relevant simultaneously to several classification problems. By penalizing the sum of ? 2 norms of the blocks of coefficients associated with each covariate across different classification problems, similar sparsity patterns in all models are encouraged. To take computational advantage of the sparsity of solutions at high regularization levels, we propose a blockwise path-following scheme that approximately traces the regularization path. As the regularization coefficient decreases, the algorithm maintains and updates concurrently a growing set of covariates that are simultaneously active for all problems. We also show how to use random projections to extend this approach to the problem of joint subspace selection, where multiple predictors are found in a common low-dimensional subspace. We present theoretical results showing that this random projection approach converges to the solution yielded by trace-norm regularization. Finally, we present a variety of experimental results exploring joint covariate selection and joint subspace selection, comparing the path-following approach to competing algorithms in terms of prediction accuracy and running time.

Journal ArticleDOI
TL;DR: It is found that non-causal feature selection methods cannot be interpreted causally even when they achieve excellent predictivity, so only local causal techniques should be used when insight into causal structure is sought.
Abstract: We present an algorithmic framework for learning local causal structure around target variables of interest in the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples. The selected feature sets can be used for causal discovery and classification. The framework (Generalized Local Learning, or GLL) can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms. The resulting algorithms are sound under well-defined sufficient conditions. In a first set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare to other local causal discovery methods and to state-of-the-art non-causal feature selection methods using real data. A second set of experimental evaluations compares the algorithms in terms of ability to induce local causal neighborhoods using simulated and resimulated data and examines the relation of predictivity with causal induction performance. Our experiments demonstrate, consistently with causal feature selection theory, that local causal feature selection methods (under broad assumptions encompassing appropriate family of distributions, types of classifiers, and loss functions) exhibit strong feature set parsimony, high predictivity and local causal interpretability. Although non-causal feature selection methods are often used in practice to shed light on causal relationships, we find that they cannot be interpreted causally even when they achieve excellent predictivity. Therefore we conclude that only local causal techniques should be used when insight into causal structure is sought. In a companion paper we examine in depth the behavior of GLL algorithms, provide extensions, and show how local techniques can be used for scalable and accurate global causal graph learning.

Journal ArticleDOI
TL;DR: Saeys et al. as discussed by the authors proposed a large-scale analysis of ensemble feature selection, where multiple feature selections are combined in order to increase the robustness of the final set of selected features.
Abstract: Motivation: Biomarker discovery is an important topic in biomedical applications of computational biology, including applications such as gene and SNP selection from high-dimensional data. Surprisingly, the stability with respect to sampling variation or robustness of such selection processes has received attention only recently. However, robustness of biomarkers is an important issue, as it may greatly influence subsequent biological validations. In addition, a more robust set of markers may strengthen the confidence of an expert in the results of a selection method. Results: Our first contribution is a general framework for the analysis of the robustness of a biomarker selection algorithm. Secondly, we conducted a large-scale analysis of the recently introduced concept of ensemble feature selection, where multiple feature selections are combined in order to increase the robustness of the final set of selected features. We focus on selection methods that are embedded in the estimation of support vector machines (SVMs). SVMs are powerful classification models that have shown state-of-the-art performance on several diagnosis and prognosis tasks on biological data. Their feature selection extensions also offered good results for gene selection tasks. We show that the robustness of SVMs for biomarker discovery can be substantially increased by using ensemble feature selection techniques, while at the same time improving upon classification performances. The proposed methodology is evaluated on four microarray datasets showing increases of up to almost 30% in robustness of the selected biomarkers, along with an improvement of ~15% in classification performance. The stability improvement with ensemble methods is particularly noticeable for small signature sizes (a few tens of genes), which is most relevant for the design of a diagnosis or prognosis model from a gene signature. Contact: yvan.saeys@psb.ugent.be Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The adaptive group Lasso procedure works well with samples of moderate size and achieves the optimal rate of convergence in a nonparametric additive model of a conditional mean function that may be larger than the sample size but the number of nonzero additive components is "small" relative to the sample Size.
Abstract: We consider a nonparametric additive model of a conditional mean function in which the number of variables and additive components may be larger than the sample size but the number of nonzero additive components is "small" relative to the sample size. The statistical problem is to determine which additive components are nonzero. The additive components are approximated by truncated series expansions with B-spline bases. With this approximation, the problem of component selection becomes that of selecting the groups of coefficients in the expansion. We apply the adaptive group Lasso to select nonzero components, using the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We give conditions under which the group Lasso selects a model whose number of components is comparable with the underlying model, and the adaptive group Lasso selects the nonzero components correctly with probability approaching one as the sample size increases and achieves the optimal rate of convergence. The results of Monte Carlo experiments show that the adaptive group Lasso procedure works well with samples of moderate size. A data example is used to illustrate the application of the proposed method.

Journal ArticleDOI
TL;DR: This article provides a review of the most widely used ensemble learning methods and their application in various bioinformatics problems, including the main topics of gene expression, mass spectrometry-based proteomics, gene-gene interaction identification from genome-wide association studies, and prediction of regulatory elements from DNA and protein sequences.
Abstract: Ensemble learning is an intensively studies technique in machine learning and pattern recognition. Recent work in computational biology has seen an increasing use of ensemble learning methods due to their unique advantages in dealing with small sample size, high-dimensionality, and complexity data structures. The aim of this article is two-fold. First, it is to provide a review of the most widely used ensemble learning methods and their application in various bioinformatics problems, including the main topics of gene expression, mass spectrometry-based proteomics, gene-gene interaction identification from genome-wide association studies, and prediction of regulatory elements from DNA and protein sequences. Second, we try to identify and summarize future trends of ensemble methods in bioinformatics. Promising directions such as ensemble of support vector machine, meta-ensemble, and ensemble based feature selection are discussed.

Journal ArticleDOI
TL;DR: A modified discrete particle swarm optimization (PSO) algorithm is developed which dynamically accounts for the relevance and dependence of the features included the feature subset in an adaptive feature selection procedure.

01 Jan 2010
TL;DR: Gain ratio and Correlation based feature selection have been used to illustrate the significance of feature subset selection for classifying Pima Indian diabetic database (PIDD) and results show that the feature subsets selected by CFS filter resulted in marginal improvement for both back propagation neural network and Radial basis function network classification accuracy.
Abstract: Feature subset selection is of great importance in the field of data mining. The high dimension data makes testing and training of general classification methods difficult. In the present paper two filters approaches namely Gain ratio and Correlation based feature selection have been used to illustrate the significance of feature subset selection for classifying Pima Indian diabetic database (PIDD). The C4.5 tree uses gain ratio to determine the splits and to select the most important features. Genetic algorithm is used as search method with Correlation based feature selection as subset evaluating mechanism. The feature subset obtained is then tested using two supervised classification method namely, Back propagation neural network and Radial basis function network. Experimental results show that the feature subsets selected by CFS filter resulted in marginal improvement for both back propagation neural network and Radial basis function network classification accuracy when compared to feature subset selected by information gain filter.

Journal ArticleDOI
TL;DR: An improved version of the algorithm for identification of the full set of truly important variables in an information system is presented, an extension of the random forest method which utilises the importance measure generated by the original algorithm.
Abstract: Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. Even more, usually one cannot decide a priori which attributes are relevant. In this paper we present an improved version of the algorithm for identification of the full set of truly important variables in an information system. It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the algorithm on several examples of synthetic data, as well as on a biologically important problem, namely on identification of the sequence motifs that are important for aptameric activity of short RNA sequences.

Journal ArticleDOI
TL;DR: This work proposes a novel discriminative semi-supervised feature selection method based on the idea of manifold regularization that selects features through maximizing the classification margin between different classes and simultaneously exploiting the geometry of the probability distribution that generates both labeled and unlabeled data.
Abstract: Feature selection has attracted a huge amount of interest in both research and application communities of data mining. We consider the problem of semi-supervised feature selection, where we are given a small amount of labeled examples and a large amount of unlabeled examples. Since a small number of labeled samples are usually insufficient for identifying the relevant features, the critical problem arising from semi-supervised feature selection is how to take advantage of the information underneath the unlabeled data. To address this problem, we propose a novel discriminative semi-supervised feature selection method based on the idea of manifold regularization. The proposed approach selects features through maximizing the classification margin between different classes and simultaneously exploiting the geometry of the probability distribution that generates both labeled and unlabeled data. In comparison with previous semi-supervised feature selection algorithms, our proposed semi-supervised feature selection method is an embedded feature selection method and is able to find more discriminative features. We formulate the proposed feature selection method into a convex-concave optimization problem, where the saddle point corresponds to the optimal solution. To find the optimal solution, the level method, a fairly recent optimization method, is employed. We also present a theoretic proof of the convergence rate for the application of the level method to our problem. Empirical evaluation on several benchmark data sets demonstrates the effectiveness of the proposed semi-supervised feature selection method.

Journal ArticleDOI
TL;DR: A new feature selection mechanism based on ant colony optimization is proposed in an attempt to combat the aforemention difficulties and denotes that the SVM-learning system has advantage when the information preprocessing is based on data mining technology.
Abstract: This paper creates a system for power load forecasting using support vector machine and ant colony optimization. The method of colony optimization is employed to process large amount of data and eliminate redundant information. The system mines the historical daily loading which has the same meteorological category as the forecasting day in order to compose data sequence with highly similar meteorological features. With this method, we reduced SVM training data and overcame the disadvantage of very large data and slow processing speed when constructing SVM model. This paper proposes a new feature selection mechanism based on ant colony optimization in an attempt to combat the aforemention difficulties. The method is then applied to find optimal feature subsets in the fuzzy-rough data reduction process. The present work is applied to complex systems monitoring, the ant colony optimization can mine the data more overall and accurate than the original fuzzy-rough method, an entropy-based feature selector, and a transformation-based reduction method, PCA. Comparing with single SVM and BP neural network in short-term load forecasting, this new method can achieve greater forecasting accuracy. It denotes that the SVM-learning system has advantage when the information preprocessing is based on data mining technology.

Journal ArticleDOI
TL;DR: An improved practical version of this method is provided by combining it with a reduced version of the dynamic programming algorithm and it is proved that, in an appropriate asymptotic framework, this method provides consistent estimators of the change points with an almost optimal rate.
Abstract: We propose a new approach for dealing with the estimation of the location of change-points in one-dimensional piecewise constant signals observed in white noise. Our approach consists in reframing this task in a variable selection context. We use a penalized least-square criterion with a l1-type penalty for this purpose. We explain how to implement this method in practice by using the LARS / LASSO algorithm. We then prove that, in an appropriate asymptotic framework, this method provides consistent estimators of the change points with an almost optimal rate. We finally provide an improved practical version of this method by combining it with a reduced version of the dynamic programming algorithm and we successfully compare it with classical methods.

Journal ArticleDOI
TL;DR: This paper presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications and showed that signal-to-noise correlation coefficient and Feature Assessment by Sliding Thresholds are great candidates for feature selection in most applications.
Abstract: The class imbalance problem is encountered in real-world applications of machine learning and results in a classifier's suboptimal performance. Researchers have rigorously studied the resampling, algorithms, and feature selection approaches to this problem. No systematic studies have been conducted to understand how well these methods combat the class imbalance problem and which of these methods best manage the different challenges posed by imbalanced data sets. In particular, feature selection has rarely been studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using area under the receiver operating characteristic (AUC) and area under the precision-recall curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and Feature Assessment by Sliding Thresholds (FAST) are great candidates for feature selection in most applications, especially when selecting very small numbers of features.

Journal ArticleDOI
TL;DR: This paper considers feature selection for data classification in the presence of a huge number of irrelevant features, and proposes a new feature-selection algorithm that addresses several major issues with prior work, including problems with algorithm implementation, computational complexity, and solution accuracy.
Abstract: This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We propose a new feature-selection algorithm that addresses several major issues with prior work, including problems with algorithm implementation, computational complexity, and solution accuracy. The key idea is to decompose an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin framework. The proposed algorithm is based on well-established machine learning and numerical analysis techniques, without making any assumptions about the underlying data distribution. It is capable of processing many thousands of features within minutes on a personal computer while maintaining a very high accuracy that is nearly insensitive to a growing number of irrelevant features. Theoretical analyses of the algorithm's sample complexity suggest that the algorithm has a logarithmical sample complexity with respect to the number of features. Experiments on 11 synthetic and real-world data sets demonstrate the viability of our formulation of the feature-selection problem for supervised learning and the effectiveness of our algorithm.

Journal ArticleDOI
01 Dec 2010
TL;DR: Three well-known feature selection methods, which are Principal Component Analysis (PCA), Genetic Algorithms (GA) and decision trees (CART), are used and the back-propagation neural network is developed for the prediction model.
Abstract: To effectively predict stock price for investors is a very important research problem. In literature, data mining techniques have been applied to stock (market) prediction. Feature selection, a pre-processing step of data mining, aims at filtering out unrepresentative variables from a given dataset for effective prediction. As using different feature selection methods will lead to different features selected and thus affect the prediction performance, the purpose of this paper is to combine multiple feature selection methods to identify more representative variables for better prediction. In particular, three well-known feature selection methods, which are Principal Component Analysis (PCA), Genetic Algorithms (GA) and decision trees (CART), are used. The combination methods to filter out unrepresentative variables are based on union, intersection, and multi-intersection strategies. For the prediction model, the back-propagation neural network is developed. Experimental results show that the intersection between PCA and GA and the multi-intersection of PCA, GA, and CART perform the best, which are of 79% and 78.98% accuracy respectively. In addition, these two combined feature selection methods filter out near 80% unrepresentative features from 85 original variables, resulting in 14 and 17 important features respectively. These variables are the important factors for stock prediction and can be used for future investment decisions.

Journal ArticleDOI
TL;DR: This article reviews existing stable feature selection methods for biomarker discovery using a generic hierarchical framework, providing an overview on this new yet fast growing topic for a convenient reference and categorizing existing methods under an expandable framework for future research and development.