scispace - formally typeset
Search or ask a question

Showing papers on "Feature selection published in 1997"


Journal ArticleDOI
TL;DR: The wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain and compares the wrapper approach to induction without feature subset selection and to Relief, a filter approach tofeature subset selection.

8,610 citations


Proceedings Article
08 Jul 1997
TL;DR: This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
Abstract: This paper is a comparative study of feature selection methods in statistical learning of text categorization The focus is on aggres sive dimensionality reduction Five meth ods were evaluated including term selection based on document frequency DF informa tion gain IG mutual information MI a test CHI and term strength TS We found IG and CHI most e ective in our ex periments Using IG thresholding with a k nearest neighbor classi er on the Reuters cor pus removal of up to removal of unique terms actually yielded an improved classi cation accuracy measured by average preci sion DF thresholding performed similarly Indeed we found strong correlations between the DF IG and CHI values of a term This suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive TS compares favorably with the other methods with up to vocabulary reduction but is not competitive at higher vo cabulary reduction levels In contrast MI had relatively poor performance due to its bias towards favoring rare terms and its sen sitivity to probability estimation errors

5,366 citations


Journal ArticleDOI
01 May 1997
TL;DR: This survey identifies the future research areas in feature selection, introduces newcomers to this field, and paves the way for practitioners who search for suitable methods for solving domain-specific real-world applications.
Abstract: Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems arise and novel approaches to feature selection are in demand. This survey is a comprehensive overview of many existing methods from the 1970's to the present. It identifies four steps of a typical feature selection method, and categorizes the different existing methods in terms of generation procedures and evaluation functions, and reveals hitherto unattempted combinations of generation procedures and evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion via example. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses of different methods are explained. Guidelines for applying feature selection methods are given based on data types and domain characteristics. This survey identifies the future research areas in feature selection, introduces newcomers to this field, and paves the way for practitioners who search for suitable methods for solving domain-specific real-world applications.

3,174 citations


Journal ArticleDOI
TL;DR: Simulations indicate that the lasso can be more accurate than stepwise selection in this setting and reduce the estimation variance while providing an interpretable final model in Cox's proportional hazards model.
Abstract: SUMMARY I propose a new method for variable selection and shrinkage in Cox’s proportional hazards model. My proposal minimizes the log partial likelihood subject to the sum of the absolute values of the parameters being bounded by a constant. Because of the nature of this constraint, it shrinks coeƒcients and produces some coeƒcients that are exactly zero. As a result it reduces the estimation variance while providing an interpretable final model. The method is a variation of the ‘lasso’ proposal of Tibshirani, designed for the linear regression context. Simulations indicate that the lasso can be more accurate than stepwise selection in this setting.

3,004 citations


Journal ArticleDOI
TL;DR: This survey reviews work in machine learning on methods for handling data sets containing large amounts of irrelevant information and describes the advances that have been made in both empirical and theoretical work in this area.

2,869 citations


Journal ArticleDOI
TL;DR: This work studies the problem of choosing an optimal feature set for land use classification based on SAR satellite images using four different texture models and shows that pooling features derived from different texture Models, followed by a feature selection results in a substantial improvement in the classification accuracy.
Abstract: A large number of algorithms have been proposed for feature subset selection. Our experimental results show that the sequential forward floating selection algorithm, proposed by Pudil et al. (1994), dominates the other algorithms tested. We study the problem of choosing an optimal feature set for land use classification based on SAR satellite images using four different texture models. Pooling features derived from different texture models, followed by a feature selection results in a substantial improvement in the classification accuracy. We also illustrate the dangers of using feature selection in small sample size situations.

2,238 citations


Journal ArticleDOI
TL;DR: The use of a naive Bayesian classifier is described, and it is demonstrated that it can incrementally learn profiles from user feedback on the interestingness of Web sites and may easily be extended to revise user provided profiles.
Abstract: We discuss algorithms for learning and revising user profiles that can determine which World Wide Web sites on a given topic would be interesting to a user. We describe the use of a naive Bayesian classifier for this task, and demonstrate that it can incrementally learn profiles from user feedback on the interestingness of Web sites. Furthermore, the Bayesian classifier may easily be extended to revise user provided profiles. In an experimental evaluation we compare the Bayesian classifier to computationally more intensive alternatives, and show that it performs at least as well as these approaches throughout a range of different domains. In addition, we empirically analyze the effects of providing the classifier with background knowledge in form of user defined profiles and examine the use of lexical knowledge for feature selection. We find that both approaches can substantially increase the prediction accuracy.

1,353 citations


Journal Article
TL;DR: The authors compare various hierarchical mixture prior formulations of variable selection uncertainty in normal linear regression models, including the nonconjugate SSVS formulation of George and McCulloch (1993), as well as conjugate formulations which allow for analytical simplification.
Abstract: This paper describes and compares various hierarchical mixture prior formulations of variable selection uncertainty in normal linear regression models. These include the nonconjugate SSVS formulation of George and McCulloch (1993), as well as conjugate formulations which allow for analytical simplification. Hyperpa- rameter settings which base selection on practical significance, and the implications of using mixtures with point priors are discussed. Computational methods for pos- terior evaluation and exploration are considered. Rapid updating methods are seen to provide feasible methods for exhaustive evaluation using Gray Code sequencing in moderately sized problems, and fast Markov Chain Monte Carlo exploration in large problems. Estimation of normalization constants is seen to provide improved posterior estimates of individual model probabilities and the total visited probabil- ity. Various procedures are illustrated on simulated sample problems and on a real problem concerning the construction of financial index tracking portfolios.

1,291 citations


Journal ArticleDOI
TL;DR: A class of weight-setting methods for lazy learning algorithms which use performance feedback to assign weight settings demonstrated three advantages over other methods: they require less pre-processing, perform better in the presence of interacting features, and generally require less training data to learn good settings.
Abstract: Many lazy learning algorithms are derivatives of the k-nearest neighbor (k-NN) classifier, which uses a distance function to generate predictions from stored instances. Several studies have shown that k-NN‘s performance is highly sensitive to the definition of its distance function. Many k-NN variants have been proposed to reduce this sensitivity by parameterizing the distance function with feature weights. However, these variants have not been categorized nor empirically compared. This paper reviews a class of weight-setting methods for lazy learning algorithms. We introduce a framework for distinguishing these methods and empirically compare them. We observed four trends from our experiments and conducted further studies to highlight them. Our results suggest that methods which use performance feedback to assign weight settings demonstrated three advantages over other methods: they require less pre-processing, perform better in the presence of interacting features, and generally require less training data to learn good settings. We also found that continuous weighting methods tend to outperform feature selection algorithms for tasks where some features are useful but less important than others.

762 citations


Proceedings ArticleDOI
01 Jul 1997
TL;DR: An automated learning approach to text categorization based on perception learning and a new feature selection metric, called correlation coefficient, is described and empirical results indicate that this approach outperforms the best published results on this % uters collection.
Abstract: In this paper, we describe an automated learning approach to text categorization based on perception learning and a new feature selection metric, called correlation coefficient. Our approach has been teated on the standard Reuters text categorization collection. Empirical results indicate that our approach outperforms the best published results on this % uters collection. In particular, our new feature selection method yields comiderable improvement. We also investigate the usability of our automated hxu-n~ approach by actually developing a system that categorizes texts into a treeof categories. We compare tbe accuracy of our learning approach to a rrddmsed, expert system ap preach that uses a text categorization shell built by Cams gie Group. Although our automated learning approach still gives a lower accuracy, by appropriately inmrporating a set of manually chosen worda to use as f~ures, the combined, semi-automated approach yields accuracy close to the * baaed approach.

521 citations


Journal ArticleDOI
TL;DR: This paper proposes the use of a three-layer feedforward neural network to select those input attributes that are most useful for discriminating classes in a given set of input patterns.
Abstract: Feature selection is an integral part of most learning algorithms. Due to the existence of irrelevant and redundant attributes, by selecting only the relevant attributes of the data, higher predictive accuracy can be expected from a machine learning method. In this paper, we propose the use of a three-layer feedforward neural network to select those input attributes that are most useful for discriminating classes in a given set of input patterns. A network pruning algorithm is the foundation of the proposed algorithm. By adding a penalty term to the error function of the network, redundant network connections can be distinguished from those relevant ones by their small weights when the network training process has been completed. A simple criterion to remove an attribute based on the accuracy rate of the network is developed. The network is retrained after removal of an attribute, and the selection process is repeated until no attribute meets the criterion for removal. Our experimental results suggest that the proposed method works very well on a wide variety of classification problems.


Journal ArticleDOI
TL;DR: Chi2 is a simple and general algorithm that uses the /spl chi//sup 2/ statistic to discretize numeric attributes repeatedly until some inconsistencies are found in the data and achieves feature selection via discretization.
Abstract: Discretization can turn numeric attributes into discrete ones. Feature selection can eliminate some irrelevant and/or redundant attributes. Chi2 is a simple and general algorithm that uses the /spl chi//sup 2/ statistic to discretize numeric attributes repeatedly until some inconsistencies are found in the data. It achieves feature selection via discretization. It can handle mixed attributes, work with multiclass data, and remove irrelevant and redundant attributes.

Proceedings Article
01 Jan 1997
TL;DR: This paper describes a feature subset selector that uses a correlation based evaluates its effectiveness with three common ML algorithms: a decision tree inducer, a naive Bayes classifier, and an instance based learner.
Abstract: Recent work has shown that feature subset selection can have a positive affect on the performance of machine learning algorithms. Some algorithms can be slowed or their performance irrelevant or redundant to the learning task. Feature subset selection, then, is a method for enhancing the performance of learning algorithms, reducing the hypothesis search space, and, in some cases, reducing the storage requirement. This paper describes a feature subset selector that uses a correlation based evaluates its effectiveness with three common ML algorithms: a decision tree inducer (C4.5), a naive Bayes classifier, and an instance based learner (IB1). Experiments using a number of standard data sets drawn from real and artificial domains are presented. Feature subset selection gave significant improvement for all three algorithms; C4.5 generated smaller decision trees.

Journal ArticleDOI
TL;DR: For the Cardiovascular Health Study, Bayesian model averaging predictively outperforms standard model selection and does a better job of assessing who is at high risk for a stroke.
Abstract: SUMMARY In the context of the Cardiovascular Health Study, a comprehensive investigation into the risk factors for strokes, we apply Bayesian model averaging to the selection of variables in Cox proportional hazard models. We use an extension of the leaps-and-bounds algorithm for locating the models that are to be averaged over and make available S-PLUS software to implement the methods. Bayesian model averaging provides a posterior probability that each variable belongs in the model, a more directly interpretable measure of variable importance than a P-value. P-values from models preferred by stepwise methods tend to overstate the evidence for the predictive value of a variable and do not account for model uncertainty. We introduce the partial predictive score to evaluate predictive performance. For the Cardiovascular Health Study, Bayesian model averaging predictively outperforms standard model selection and does a better

Journal ArticleDOI
TL;DR: This paper develops a technique called “racing” that tests the set of models in parallel, quickly discards those models that are clearly inferior and concentrates the computational effort on differentiating among the better models.
Abstract: Given a set of models and some training data, we would like to find the model that best describes the data. Finding the model with the lowest generalization error is a computationally expensive process, especially if the number of testing points is high or if the number of models is large. Optimization techniques such as hill climbing or genetic algorithms are helpful but can end up with a model that is arbitrarily worse than the best one or cannot be used because there is no distance metric on the space of discrete models. In this paper we develop a technique called ’’racing‘‘ that tests the set of models in parallel, quickly discards those models that are clearly inferior and concentrates the computational effort on differentiating among the better models. Racing is especially suitable for selecting among lazy learners since training requires negligible expense, and incremental testing using leave-one-out cross validation is efficient. We use racing to select among various lazy learning algorithms and to find relevant features in applications ranging from robot juggling to lesion detection in MRI scans.

01 Apr 1997
TL;DR: In this paper, the problem of discriminating between two finite point sets in n-dimensional feature space by a separating plane that utilizes as few of the features as possible is formulated as a mathematical program with a parametric objective function and linear constraints.
Abstract: The problem of discriminating between two finite point sets in n-dimensional feature space by a separating plane that utilizes as few of the features as possible is formulated as a mathematical program with a parametric objective function and linear constraints. The step function that appears in the objective function can be approximated by a sigmoid or by a concave exponential on the nonnegative real line, or it can be treated exactly by considering the equivalent linear program with equilibrium constraints. Computational tests of these three approaches on publicly available real-world databases have been carried out and compared with an adaptation of the optimal brain damage method for reducing neural network complexity. One feature selection algorithm via concave minimization reduced cross-validation error on a cancer prognosis database by 35.4% while reducing problem features from 32 to 4.

Proceedings ArticleDOI
03 Nov 1997
TL;DR: This paper proposes an entropy measure for ranking features, and conducts extensive experiments to show that the method is able to find the important features and compares well with a similar feature ranking method that requires class information unlike this method.
Abstract: Dimensionality reduction is an important problem for efficient handling of large databases. Many feature selection methods exist for supervised data having class information. Little work has been done for dimensionality reduction of unsupervised data in which class information is not available. Principal component analysis (PCA) is often used. However, PCA creates new features. It is difficult to obtain intuitive understanding of the data using the new features only. We are concerned with the problem of determining and choosing the important original features for unsupervised data. Our method is based on the observation that removing an irrelevant feature from the feature set may not change the underlying concept of the data, but not so otherwise. We propose an entropy measure for ranking features, and conduct extensive experiments to show that our method is able to find the important features. Also it compares well with a similar feature ranking method (Relief) that requires class information unlike our method.

Proceedings ArticleDOI
22 Jul 1997
TL;DR: The advanced mine detection and classification (AMDAC) algorithm consists of an improved detection density algorithm, a classification feature extractor that uses a stepwise feature selection strategy, a k-nearest neighbor attractor-based neural network (KNN) classifier, and an optimal discriminatory filter classifier.
Abstract: An advanced capability for automated detection and classification of sea mines in sonar imagery has been developed. The advanced mine detection and classification (AMDAC) algorithm consists of an improved detection density algorithm, a classification feature extractor that uses a stepwise feature selection strategy, a k-nearest neighbor attractor-based neural network (KNN) classifier, and an optimal discriminatory filter classifier. The detection stage uses a nonlinear matched filter to identify mine-size regions in the sonar image that closely match a mine's signature. For each detected mine-like region, the feature extractor calculates a large set of candidate classification features. A stepwise feature selection process then determines the subset features that optimizes probability of detection and probability of classification for each of the classifiers while minimizing false alarms.

Journal ArticleDOI
TL;DR: A novel approach is proposed that purposely tolerates a small error in the training process in order to avoid overfitting data that may contain errors and is utilized to discover very useful survival curves for breast cancer patients from a medical database.
Abstract: Mathematical programming approaches to three fundamental problems will be described: feature selection, clustering and robust representation. The feature selection problem considered is that of discriminating between two sets while recognizing irrelevant and redundant features and suppressing them. This creates a lean model that often generalizes better to new unseen data. Computational results on real data confirm improved generalization of leaner models. Clustering is exemplified by the unsupervised learning of patterns and clusters that may exist in a given database and is a useful tool for knowledge discovery in databases (KDD). A mathematical programming formulation of this problem is proposed that is theoretically justifiable and computationally implementable in a finite number of steps. A resulting k-Median Algorithm is utilized to discover very useful survival curves for breast cancer patients from a medical database. Robust representation is concerned with minimizing trained model degradation when applied to new problems. A novel approach is proposed that purposely tolerates a small error in the training process in order to avoid overfitting data that may contain errors. Examples of applications of these concepts are given.

Journal ArticleDOI
TL;DR: This paper presents a novel functional analysis of the weight matrix based on a technique developed for determining the behavioral significance of hidden neurons, compared with the application of the same technique to the training and test data.
Abstract: The problem of data encoding and feature selection for training back-propagation neural networks is well known The basic principles are to avoid encrypting the underlying structure of the data, and to avoid using irrelevant inputs This is not easy in the real world, where we often receive data which has been processed by at least one previous user The data may contain too many instances of some class, and too few instances of other classes Real data sets often include many irrelevant or redundant input fields This paper examines the use of weight matrix analysis techniques and functional measures using two real (and hence noisy) data sets The first part of this paper examines the use of the weight matrix of the trained neural network itself to determine which inputs are significant A new technique is introduced and compared with two other techniques from the literature We present our experience and results on some satellite data augmented by a terrain model The task was to predict the forest supra-type based on the available information A brute force technique eliminating randomly selected inputs was used to validate our approach The second part of this paper examines the use of measures to determine the functional contribution of inputs to outputs Inputs which include minor but unique information to the network are more significant than inputs with higher magnitude contribution but providing redundant information, which is also provided by another input A comparison is made to sensitivity analysis, where the sensitivity of outputs to input perturbation is used as a measure of the significance of inputs This paper presents a novel functional analysis of the weight matrix based on a technique developed for determining the behavioral significance of hidden neurons This is compared with the application of the same technique to the training and test data Finally, a novel aggregation technique is introduced

Journal ArticleDOI
Se June Hong1
TL;DR: A new approach to classification rules or decision trees from examples by finding each feature's "obligation" to the class discrimination in the context of other features, which is a powerful alternative to the traditional methods.
Abstract: Deriving classification rules or decision trees from examples is an important problem. When there are too many features, discarding weak features before the derivation process is highly desirable. When there are numeric features, they need to be discretized for the rule generation. We present a new approach to these problems. Traditional techniques make use of feature merits based on either the information theoretic, or the statistical correlation between each feature and the class. We instead assign merits to features by finding each feature's "obligation" to the class discrimination in the context of other features. The merits are then used to rank the features, select a feature subset, and discretize the numeric variables. Experience with benchmark example sets demonstrates that the new approach is a powerful alternative to the traditional methods. This paper concludes by posing some new technical issues that arise from this approach.

Proceedings Article
08 Jul 1997
TL;DR: An implementation of a sequential feature selection algorithm based on an existing conceptual clustering system is described and an implementation which employs a technique for improving the efficiency of the search for an optimal description is presented.
Abstract: Feature selection has proven to be a valuable technique in supervised learning for improving predictive accuracy while reducing the number of attributes considered in a task. We investigate the potential for similar benefits in an unsupervised learning task, conceptual clustering. The issues raised in feature selection by the absence of class labels are discussed and an implementation of a sequential feature selection algorithm based on an existing conceptual clustering system is described. Additionally, we present a second implementation which employs a technique for improving the efficiency of the search for an optimal description and compare the performance of both algorithms.

Journal ArticleDOI
01 Jan 1997-Analyst
TL;DR: In this article, a new method for the selection of wavelengths from near infrared spectra using partial least squares (PLS) analysis is presented, aiming to find wavelengths that produce significant improvements in PLS prediction accuracy over using all wavelengths.
Abstract: A new method for the selection of wavelengths from near infrared spectra using partial least squares(PLS) analysis is presented. The method aims to find wavelengths that produce significant improvements in PLS prediction accuracy over using all wavelengths. The method is based on data splitting and evaluation of the appropriate prediction errors. Analysis of interactance spectra of kiwifruit using three evaluation criteria are compared with the results obtained from full spectrum analysis and with the recently proposed feature selection method. Using the recommended criterion, the method was found to produce models with lower standard errors than the optimum model obtained using the feature selection method 87% of the time. Properties of initiating the search method from starting points selected by three procedures are compared and recommendations are given for selecting the initial wavelengths. The new search method also has a low probability of obtaining significant correlations through chance.

Journal ArticleDOI
TL;DR: By combining features derived from different texture models, the classification accuracy increased significantly and the discrimination ability of four different methods for texture computation in ERS SAR imagery was examined and compared.
Abstract: The discrimination ability of four different methods for texture computation in ERS SAR imagery is examined and compared. Feature selection methodology and discriminant analysis are applied to find the optimal combination of texture features. By combining features derived from different texture models, the classification accuracy increased significantly.

Book ChapterDOI
TL;DR: Experiments show that RC almost always improves accuracy with respect to FSS and BSS, and a study using artificial domains confirms the hypothesis that this difference in performance is due to RC's context sensitivity, and suggests conditions where this sensitivity will and will not be an advantage.
Abstract: High sensitivity to irrelevant features is arguably the main shortcoming of simple lazy learners. In response to it, many feature selection methods have been proposed, including forward sequential selection (FSS) and backward sequential selection (BSS). Although they often produce substantial improvements in accuracy, these methods select the same set of relevant features everywhere in the instance space, and thus represent only a partial solution to the problem. In general, some features will be relevant only in some parts of the space; deleting them may hurt accuracy in those parts, but selecting them will have the same effect in parts where they are irrelevant. This article introduces RC, a new feature selection algorithm that uses a clustering-like approach to select sets of locally relevant features (i.e., the features it selects may vary from one instance to another). Experiments in a large number of domains from the UCI repository show that RC almost always improves accuracy with respect to FSS and BSS, often with high significance. A study using artificial domains confirms the hypothesis that this difference in performance is due to RC’s context sensitivity, and also suggests conditions where this sensitivity will and will not be an advantage. Another feature of RC is that it is faster than FSS and BSS, often by an order of magnitude or more.

Proceedings ArticleDOI
13 Apr 1997
TL;DR: This paper presents a genetic algorithm for feature selection which improves previous results presented in the literature for genetic-based feature selection, independent of a specific learning algorithm and requires less CPU time to reach a relevant subset of features.
Abstract: The goal of the feature selection process is, given a dataset described by n attributes (features), to find the minimum number m of relevant attributes which describe the data as well as the original set of attributes do. Genetic algorithms have been used to implement feature selection algorithms. Previous algorithms presented in the literature used the predictive accuracy of a specific learning algorithm as the fitness function to maximize over the space of possible feature subsets. Such an approach to feature selection requires a large amount of CPU time to reach a good solution on large datasets. This paper presents a genetic algorithm for feature selection which improves previous results presented in the literature for genetic-based feature selection. It is independent of a specific learning algorithm and requires less CPU time to reach a relevant subset of features. Reported experiments show that the proposed algorithm is at least ten times faster than a standard genetic algorithm for feature selection without a loss of predictive accuracy when a learning algorithm is applied to reduced data.

Journal ArticleDOI
TL;DR: Empirical investigations show that the proposed MLP-based scheme is superior to the other schemes implemented.

Journal ArticleDOI
TL;DR: A measure of the saliency of the input variables that is based upon the connection weights of the neural network is examined, and it is found that the method works quite well in identifying significant variables under a variety of experimental conditions, including neural network architectures and data configurations.

Journal ArticleDOI
TL;DR: This work examines various important concepts and approaches that are used for modelling a target attribute by other attributes in the data and contrasts their strengths.