scispace - formally typeset
Search or ask a question

Showing papers on "Feature selection published in 2002"


Journal ArticleDOI
TL;DR: In this article, a Support Vector Machine (SVM) method based on recursive feature elimination (RFE) was proposed to select a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays.
Abstract: DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues. In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer. In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leave-one-out error, while 64 genes are necessary for the baseline method to get the best result (one leave-one-out error). In the colon cancer database, using only 4 genes our method is 98% accurate, while the baseline method is only 86% accurate.

7,939 citations


Journal ArticleDOI
TL;DR: An unsupervised feature selection algorithm suitable for data sets, large in both dimension and size, based on measuring similarity between features whereby redundancy therein is removed, which does not need any search and is fast.
Abstract: In this article, we describe an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size. The method is based on measuring similarity between features whereby redundancy therein is removed. This does not need any search and, therefore, is fast. A new feature similarity measure, called maximum information compression index, is introduced. The algorithm is generic in nature and has the capability of multiscale representation of data sets. The superiority of the algorithm, in terms of speed and performance, is established extensively over various real-life data sets of different sizes and dimensions. It is also demonstrated how redundancy and information loss in feature selection can be quantified with an entropy measure.

1,432 citations


Proceedings ArticleDOI
23 Jul 2002
TL;DR: An overview of various measures proposed in the statistics, machine learning and data mining literature is presented and it is shown that each measure has different properties which make them useful for some application domains, but not for others.
Abstract: Many techniques for association rule mining and feature selection require a suitable metric to capture the dependencies among variables in a data set. For example, metrics such as support, confidence, lift, correlation, and collective strength are often used to determine the interestingness of association patterns. However, many such measures provide conflicting information about the interestingness of a pattern, and the best metric to use for a given application domain is rarely known. In this paper, we present an overview of various measures proposed in the statistics, machine learning and data mining literature. We describe several key properties one should examine in order to select the right measure for a given application domain. A comparative study of these properties is made using twenty one of the existing measures. We show that each measure has different properties which make them useful for some application domains, but not for others. We also present two scenarios in which most of the existing measures agree with each other, namely, support-based pruning and table standardization. Finally, we present an algorithm to select a small set of tables such that an expert can select a desirable measure by looking at just this small set of tables.

985 citations


Journal ArticleDOI
TL;DR: It is demonstrated is that the proposed method can provide the performance of the ideal greedy selection algorithm when information is distributed uniformly and should prove to be a useful method in selecting features for classification problems.
Abstract: Feature selection plays an important role in classifying systems such as neural networks (NNs). We use a set of attributes which are relevant, irrelevant or redundant and from the viewpoint of managing a dataset which can be huge, reducing the number of attributes by selecting only the relevant ones is desirable. In doing so, higher performances with lower computational effort is expected. In this paper, we propose two feature selection algorithms. The limitation of mutual information feature selector (MIFS) is analyzed and a method to overcome this limitation is studied. One of the proposed algorithms makes more considered use of mutual information between input attributes and output classes than the MIFS. What is demonstrated is that the proposed method can provide the performance of the ideal greedy selection algorithm when information is distributed uniformly. The computational load for this algorithm is nearly the same as that of MIFS. In addition, another feature selection algorithm using the Taguchi method is proposed. This is advanced as a solution to the question as to how to identify good features with as few experiments as possible. The proposed algorithms are applied to several classification problems and compared with MIFS. These two algorithms can be combined to complement each other's limitations. The combined algorithm performed well in several experiments and should prove to be a useful method in selecting features for classification problems.

918 citations


Proceedings ArticleDOI
09 Dec 2002
TL;DR: This work assesses the performance of several fundamental algorithms found in the literature in a controlled scenario by taking into account the amount of relevance, irrelevance and redundance on sample data sets and a scoring measure ranks the algorithms.
Abstract: In view of the substantial number of existing feature selection algorithms, the need arises to count on criteria that enables to adequately decide which algorithm to use in certain situations. This work assesses the performance of several fundamental algorithms found in the literature in a controlled scenario. A scoring measure ranks the algorithms by taking into account the amount of relevance, irrelevance and redundance on sample data sets. This measure computes the degree of matching between the output given by the algorithm and the known optimal solution. Sample size effects are also studied.

657 citations


Journal ArticleDOI
TL;DR: A new method of calculating mutual information between input and class variables based on the Parzen window is proposed, and this is applied to a feature selection algorithm for classification problems.
Abstract: Mutual information is a good indicator of relevance between variables, and have been used as a measure in several feature selection algorithms. However, calculating the mutual information is difficult, and the performance of a feature selection algorithm depends on the accuracy of the mutual information. In this paper, we propose a new method of calculating mutual information between input and class variables based on the Parzen window, and we apply this to a feature selection algorithm for classification problems.

639 citations


Journal ArticleDOI
TL;DR: Fan and Li as mentioned in this paper extended the nonconcave penalized likelihood approach to the Cox proportional hazards model and Cox proportional hazard frailty model, two commonly used semi-parametric models in survival analysis and proposed new variable selection procedures for these two commonly-used models.
Abstract: A class of variable selection procedures for parametric models via nonconcave penalized likelihood was proposed in Fan and Li (2001a). It has been shown there that the resulting procedures perform as well as if the subset of significant variables were known in advance. Such a property is called an oracle property. The proposed procedures were illustrated in the context of linear regression, robust linear regression and generalized linear models. In this paper, the nonconcave penalized likelihood approach is extended further to the Cox proportional hazards model and the Cox proportional hazards frailty model, two commonly used semi-parametric models in survival analysis. As a result, new variable selection procedures for these two commonly-used models are proposed. It is demonstrated how the rates of convergence depend on the regularization parameter in the penalty function. Further, with a proper choice of the regularization parameter and the penalty function, the proposed estimators possess an oracle property. Standard error formulae are derived and their accuracies are empirically tested. Simulation studies show that the proposed procedures are more stable in prediction and more effective in computation than the best subset variable selection, and they reduce model complexity as effectively as the best subset variable selection. Compared with the LASSO, which is the penalized likelihood method with the $L_1$ -penalty, proposed by Tibshirani, the newly proposed approaches have better theoretic properties and finite sample performance.

570 citations


Journal ArticleDOI
TL;DR: This review paper is to provide a summary and guidelines for using the most widely used pattern analysis techniques, as well as to identify research directions that are at the frontier of sensor-based machine olfaction.
Abstract: Pattern analysis constitutes a critical building block in the development of gas sensor array instruments capable of detecting, identifying, and measuring volatile compounds, a technology that has been proposed as an artificial substitute for the human olfactory system. The successful design of a pattern analysis system for machine olfaction requires a careful consideration of the various issues involved in processing multivariate data: signal-preprocessing, feature extraction, feature selection, classification, regression, clustering, and validation. A considerable number of methods from statistical pattern recognition, neural networks, chemometrics, machine learning, and biological cybernetics have been used to process electronic nose data. The objective of this review paper is to provide a summary and guidelines for using the most widely used pattern analysis techniques, as well as to identify research directions that are at the frontier of sensor-based machine olfaction.

556 citations


01 Jan 2002
TL;DR: The proposed algorithm, GUIDE, is specifically designed to eliminate variable selection bias, a problem that can undermine the reliability of inferences from a tree structure and allows fast computation speed, natural ex- tension to data sets with categorical variables, and direct detection of local two- variable interactions.
Abstract: We propose an algorithm for regression tree construction called GUIDE. It is specifically designed to eliminate variable selection bias, a problem that can undermine the reliability of inferences from a tree structure. GUIDE controls bias by employing chi-square analysis of residuals and bootstrap calibration of signif- icance probabilities. This approach allows fast computation speed, natural ex- tension to data sets with categorical variables, and direct detection of local two- variable interactions. Previous algorithms are not unbiased and are insensitive to local interactions during split selection. The speed of GUIDE enables two further enhancements—complex modeling at the terminal nodes, such as polynomial or best simple linear models, and bagging. In an experiment with real data sets, the prediction mean square error of the piecewise constant GUIDE model is within ±20% of that of CART r � . Piecewise linear GUIDE models are more accurate; with bagging they can outperform the spline-based MARS r � method.

484 citations


Journal ArticleDOI
TL;DR: Several MCMC methods for estimating probabilities of models and associated 'model-averaged' posterior distributions in the presence of model uncertainty are discussed, compare, develop and illustrate, focussed on connections between them.
Abstract: Several MCMC methods have been proposed for estimating probabilities of models and associated ‘model-averaged’ posterior distributions in the presence of model uncertainty. We discuss, compare, develop and illustrate several of these methods, focussing on connections between them.

479 citations


Journal ArticleDOI
TL;DR: This work presents a comparative study on six feature selection heuristics by applying them to two sets of data, which are gene expression profiles from Acute Lymphoblastic Leukemia and proteomic patterns from ovarian cancer patients.
Abstract: Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteomic patterns from ovarian cancer patients. Based on features chosen by these methods, error rates of several classification algorithms were obtained for analysis. Our results demonstrate the importance of feature selection in accurately classifying new samples.

Journal ArticleDOI
TL;DR: The performance of both types of classifiers in two-class fault/no-fault recognition examples are examined and the attempts to improve the overall generalisationperformance of both techniques through the use of genetic algorithm based feature selection process are examined.

Proceedings ArticleDOI
04 Nov 2002
TL;DR: It is found that feature selection methods based on chi2 statistics consistently outperformed those based on other criteria (including information gain) for all four classifiers and both data collections, and that a further increase in performance was obtained by combining uncorrelated and high-performing feature selection Methods.
Abstract: This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive Bayesian (NB) approach, a Rocchio-style classifier, a k-nearest neighbor (kNN) method and a Support Vector Machine (SVM) system. Two benchmark collections were chosen as the testbeds: Reuters-21578 and small portion of Reuters Corpus Version 1 (RCV1), making the new results comparable to published results. We found that feature selection methods based on chi2 statistics consistently outperformed those based on other criteria (including information gain) for all four classifiers and both data collections, and that a further increase in performance was obtained by combining uncorrelated and high-performing feature selection methods.The results we obtained using only 3% of the available features are among the best reported, including results obtained with the full feature set.

Proceedings ArticleDOI
24 Aug 2002
TL;DR: It is shown that an NE recognizer based on Support Vector Machines (SVMs) gives better scores than conventional systems, but off-the-shelf SVM classifiers are too inefficient for this task.
Abstract: Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into categories such as person, organization, and date. It is a key technology of Information Extraction and Open-Domain Question Answering. First, we show that an NE recognizer based on Support Vector Machines (SVMs) gives better scores than conventional systems. However, off-the-shelf SVM classifiers are too inefficient for this task. Therefore, we present a method that makes the system substantially faster. This approach can also be applied to other similar tasks such as chunking and part-of-speech tagging. We also present an SVM-based feature selection method and an efficient training method.

Journal ArticleDOI
TL;DR: The algorithm developed outperformed the other methods by achieving higher classification accuracy on all the problems tested and compared the approach with five other feature selection methods, each of which banks on a different concept.

Journal ArticleDOI
TL;DR: This paper uses tabu search to solve this feature selection problem and compares it with classic algorithms, such as sequential methods, branch and bound method, etc., and most other suboptimal methods proposed recently.

Journal ArticleDOI
TL;DR: This paper presents an approach that integrates a potentially powerful fuzzy rule induction algorithm with a rough set-assisted feature reduction method, and the integrated rule generation mechanism maintains the underlying semantics of the feature set.

Journal ArticleDOI
TL;DR: It is demonstrated that the variables selected by the genetic algorithm are consistent with expert knowledge and a convincing application that the algorithm can select correct variables in an automated fashion.

Proceedings ArticleDOI
03 Dec 2002
TL;DR: It is argued that feature selection is an important issue in gender classification and demonstrated that Genetic Algorithms (GA) can select good subsets of features (i.e., features that encode mostly gender information), reducing the classification error.
Abstract: We consider the problem of gender classification from frontal facial images using genetic feature subset selection. We argue that feature selection is an important issue in gender classification and demonstrate that Genetic Algorithms (GA) can select good subsets of features (i.e., features that encode mostly gender information), reducing the classification error. First, Principal Component Analysis (PCA) is used to represent each image as a feature vector (i.e., eigen-features) in a low-dimensional space. Genetic Algorithms (GAs) are then employed to select a subset of features from the low-dimensional representation by disregarding certain eigenvectors that do not seem to encode important gender information. Four different classifiers were compared in this study using genetic feature subset selection: a Bayes classifier, a Neural Network (NN) classifier, a Support Vector Machine (SVM) classifier, and a classifier based on Linear Discriminant Analysis (LDA). Our experimental results show a significant error rate reduction in all cases. The best performance was obtained using the SVM classifier. Using only 8.4% of the features in the complete set, the SVM classifier achieved an error rate of 4.7% from an average error rate of 8.9% using manually selected features.

Journal ArticleDOI
TL;DR: The authors proposed an adaptive model selection procedure that uses a data-adaptive complexity penalty based on a concept of generalized degrees of freedom, combining the benefit of a class of nonadaptive procedures, approximates the best performance of this class of procedures across a variety of different situations.
Abstract: Most model selection procedures use a fixed penalty penalizing an increase in the size of a model. These nonadaptive selection procedures perform well only in one type of situation. For instance, Bayesian information criterion (BIC) with a large penalty performs well for “small” models and poorly for “large” models, and Akaike's information criterion (AIC) does just the opposite. This article proposes an adaptive model selection procedure that uses a data-adaptive complexity penalty based on a concept of generalized degrees of freedom. The proposed procedure, combining the benefit of a class of nonadaptive procedures, approximates the best performance of this class of procedures across a variety of different situations. This class includes many well-known procedures, such as AIC, BIC, Mallows's Cp, and risk inflation criterion (RIC). The proposed procedure is applied to wavelet thresholding in nonparametric regression and variable selection in least squares regression. Simulation results and an asymptotic...

Proceedings ArticleDOI
04 Nov 2002
TL;DR: This paper process the data resource and set up the vector space model in order to provide a convenient data structure for text categorization and calculates the precision of this feature selection approach with the help of categorization results.
Abstract: This paper describes the feature selection method TFIDF (term frequency, inverse document frequency). With it, we process the data resource and set up the vector space model in order to provide a convenient data structure for text categorization. We calculate the precision of this method with the help of categorization results. According to the empirical results, we analyze its advantages and disadvantages and present a new TFIDF-based feature selection approach to improve its accuracy.

Proceedings Article
01 Jan 2002
TL;DR: The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.
Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.

Proceedings Article
08 Jul 2002
TL;DR: It is shown how linkage and autocorrelation affect a representative algorithm for feature selection by applying the algorithm to synthetic data and to data drawn from the Internet Movie Database.
Abstract: Two common characteristics of relational data sets — concentrated linkage and relational autocorrelation — can cause learning algorithms to be strongly biased toward certain features, irrespective of their predictive power. We identify these characteristics, define quantitative measures of their severity, and explain how they produce this bias. We show how linkage and autocorrelation affect a representative algorithm for feature selection by applying the algorithm to synthetic data and to data drawn from the Internet Movie Database.

Journal ArticleDOI
TL;DR: Two new Bayesian classification algorithms incorporating feature selection are investigated and it is argued that the feature selection is meaningful and some of the highlighted genes appear to be medically important.
Abstract: Motivation: We investigate two new Bayesian classification algorithms incorporating feature selection. These algorithms are applied to the classification of gene expression data derived from cDNA microarrays. Results: We demonstrate the effectiveness of the algorithms on three gene expression datasets for cancer, showing they compare well with alternative kernel-based techniques. By automatically incorporating feature selection, accurate classifiers can be constructed utilizing very few features and with minimal hand-tuning. We argue that the feature selection is meaningful and some of the highlighted genes appear to be medically important.

Journal ArticleDOI
TL;DR: The extended approach proposed in the paper proved to be able to outperform the standard approach on some benchmark problems on a statistically significant level and to show that classifiers induced using the representation enriched by the GP-constructed features provide better accuracy of classification on the test set.
Abstract: In this paper we use genetic programming for changing the representation of the input data for machine learners. In particular, the topic of interest here is feature construction in the learning-from-examples paradigm, where new features are built based on the original set of attributes. The paper first introduces the general framework for GP-based feature construction. Then, an extended approach is proposed where the useful components of representation (features) are preserved during an evolutionary run, as opposed to the standard approach where valuable features are often lost during search. Finally, we present and discuss the results of an extensive computational experiment carried out on several reference data sets. The outcomes show that classifiers induced using the representation enriched by the GP-constructed features provide better accuracy of classification on the test set. In particular, the extended approach proposed in the paper proved to be able to outperform the standard approach on some benchmark problems on a statistically significant level.

Journal ArticleDOI
TL;DR: The algorithm is applied in the construction of parsimonious quantitative structure-activity relationship (QSAR) models based on feed-forward neural networks and is tested on three classical data sets from the QSAR literature.
Abstract: We present a new feature selection algorithm for structure-activity and structure-property correlation based on particle swarms. Particle swarms explore the search space through a population of individuals that adapt by returning stochastically toward previously successful regions, influenced by the success of their neighbors. This method, which was originally intended for searching multidimensional continuous spaces, is adapted to the problem of feature selection by viewing the location vectors of the particles as probabilities and employing roulette wheel selection to construct candidate subsets. The algorithm is applied in the construction of parsimonious quantitative structure-activity relationship (QSAR) models based on feed-forward neural networks and is tested on three classical data sets from the QSAR literature. It is shown that the method compares favorably with simulated annealing and is able to identify a better and more diverse set of solutions given the same amount of simulation time.

Journal ArticleDOI
TL;DR: In this article, a fast algorithm for updating regressions in the Markov chain Monte Carlo searches for posterior inference is developed, allowing many more variables than observations to be considered, which can greatly aid the interpretation of the model.
Abstract: When a number of distinct models contend for use in prediction, the choice of a single model can offer rather unstable predictions. In regression, stochastic search variable selection with Bayesian model averaging offers a cure for this robustness issue but at the expense of requiring very many predictors. Here we look at Bayes model averaging incorporating variable selection for prediction. This offers similar mean-square errors of prediction but with a vastly reduced predictor space. This can greatly aid the interpretation of the model. It also reduces the cost if measured variables have costs. The development here uses decision theory in the context of the multivariate general linear model. In passing, this reduced predictor space Bayes model averaging is contrasted with single-model approximations. A fast algorithm for updating regressions in the Markov chain Monte Carlo searches for posterior inference is developed, allowing many more variables than observations to be contemplated. We discuss the merits of absolute rather than proportionate shrinkage in regression, especially when there are more variables than observations. The methodology is illustrated on a set of spectroscopic data used for measuring the amounts of different sugars in an aqueous solution.

Patent
20 May 2002
TL;DR: In this paper, a pre-processing step prior to training a learning machine is described, which includes reducing the quantity of features to be processed using feature selection methods selected from the group consisting of recursive feature elimination (RFE), minimizing the number of non-zero parameters of the system (lo-norm minimization), evaluation of cost function to identify a subset of features that are compatible with constraints imposed by the learning set, unbalanced correlation score and transductive feature selection.
Abstract: In a pre-processing step prior to training a learning machine, pre-processing includes reducing the quantity of features to be processed using feature selection methods selected from the group consisting of recursive feature elimination (RFE), minimizing the number of non-zero parameters of the system (lo-norm minimization), evaluation of cost function to identify a subset of features that are compatible with constraints imposed by the learning set, unbalanced correlation score and transductive feature selection. The features remaining after feature selection are then used to train a learning machine for purposes of pattern classification, regression, clustering and/or novelty detection. (Fig. 3, 300, 301, 302, 304, 306, 308, 309, 310, 311, 312, 314)

Patent
05 Jul 2002
TL;DR: In this article, a method of automatically selecting regions of interest within an image in response to a selection signal, and panning across an image so as to keep the selected region in view is disclosed, together with an image processing system employing the method.
Abstract: A method of automatically selecting regions of interest within an image in response to a selection signal, and panning across an image so as to keep the selected region in view is disclosed, together with an image processing system employing the method.

Journal ArticleDOI
TL;DR: This work begins with a filter model, exploiting the geometrical information contained in the minimum spanning tree (MST) built on the learning set, which leads to a feature selection algorithm belonging to a new category of hybrid models ( filter-wrapper).