scispace - formally typeset
Search or ask a question

Showing papers on "Feature selection published in 1999"


Book
10 Sep 1999
TL;DR: In this paper, the authors propose a statistical pattern recognition method for pattern recognition using neural networks and nonlinear discriminant analysis (NDA) based on classification trees and feature selection and extraction.
Abstract: Introduction to statistical pattern recognition * Estimation * Density estimation * Linear discriminant analysis * Nonlinear discriminant analysis - neural networks * Nonlinear discriminant analysis - statistical methods * Classification trees * Feature selection and extraction * Clustering * Additional topics * Measures of dissimilarity * Parameter estimation * Linear algebra * Data * Probability theory

1,813 citations


01 Apr 1999
TL;DR: This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems and performs more feature selection than ReliefF does—reducing the data dimensionality by fifty percent in most cases.
Abstract: Algorithms for feature selection fall into two broad categories: wrappers that use the learning algorithm itself to evaluate the usefulness of features and filters that evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems. This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. The algorithm often outperforms the well-known ReliefF attribute estimator when used as a preprocessing step for naive Bayes, instance-based learning, decision trees, locally weighted regression, and model trees. It performs more feature selection than ReliefF does—reducing the data dimensionality by fifty percent in most cases. Also, decision and model trees built from the preprocessed data are often significantly smaller.

1,653 citations


Journal ArticleDOI
01 Jun 1999
TL;DR: An algorithmic framework for solving the projected clustering problem, in which the subsets of dimensions selected are specific to the clusters themselves, is developed and tested.
Abstract: The clustering problem is well known in the database literature for its numerous applications in problems such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In such high dimensional spaces not all dimensions may be relevant to a given cluster. One way of handling this is to pick the closely correlated dimensions and find clusters in the corresponding subspace. Traditional feature selection algorithms attempt to achieve this. The weakness of this approach is that in typical high dimensional data mining applications different sets of points may cluster better for different subsets of dimensions. The number of dimensions in each such cluster-specific subspace may also vary. Hence, it may be impossible to find a single small subset of dimensions for all the clusters. We therefore discuss a generalization of the clustering problem, referred to as the projected clustering problem, in which the subsets of dimensions selected are specific to the clusters themselves. We develop an algorithmic framework for solving the projected clustering problem, and test its performance on synthetic data.

1,111 citations


Book
Shigeo Abe1
26 Oct 1999
TL;DR: This book presents architectures for multiclass classification and function approximation problems, as well as evaluation criteria for classifiers and regressors, and discusses kernel methods for improving the generalization ability of neural networks and fuzzy systems.
Abstract: A guide on the use of SVMs in pattern classification, including a rigorous performance comparison of classifiers and regressors. The book presents architectures for multiclass classification and function approximation problems, as well as evaluation criteria for classifiers and regressors. Features: Clarifies the characteristics of two-class SVMs; Discusses kernel methods for improving the generalization ability of neural networks and fuzzy systems; Contains ample illustrations and examples; Includes performance evaluation using publicly available data sets; Examines Mahalanobis kernels, empirical feature space, and the effect of model selection by cross-validation; Covers sparse SVMs, learning using privileged information, semi-supervised learning, multiple classifier systems, and multiple kernel learning; Explores incremental training based batch training and active-set training methods, and decomposition techniques for linear programming SVMs; Discusses variable selection for support vector regressors.

1,002 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: An unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase, and a refinement to center adjustment, “vector average damping,” that further improves cluster quality.
Abstract: Clustering is a powerful technique for large-scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in high-dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. We describe an unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase. We introduce a methodology for measuring the quality of a cluster hierarchy in terms of FMeasure, and present the results of experiments comparing different algorithms. The evaluation considers some feature selection parameters (tfidfand feature vector length) but focuses on the clustering algorithms, namely techniques from Scatter/Gather (buckshot, fractionation, and split/join) and kmeans. Our experiments suggest that continuous center adjustment contributes more to cluster quality than seed selection does. It follows that using a simpler seed selection algorithm gives a better time/quality tradeoff. We describe a refinement to center adjustment, “vector average damping,” that further improves cluster quality. We also compare the near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively.

958 citations


Proceedings Article
01 May 1999
TL;DR: A new fllter approach to feature selection that uses a correlation based heuristic to evaluate the worth of feature subsets when applied as a data preprocessing step for two common machine learning algorithms.
Abstract: Feature selection is often an essential data processing step prior to applying a learning algorithm. The removal of irrelevant and redundant information often improves the performance of machine learning algorithms. There are two common approaches: a wrapper uses the intended learning algorithm itself to evaluate the usefulness of features, while a fllter evaluates features according to heuristics based on general characteristics of the data. The wrapper approach is generally considered to produce better feature subsets but runs much more slowly than a fllter. This paper describes a new fllter approach to feature selection that uses a correlation based heuristic to evaluate the worth of feature subsets When applied as a data preprocessing step for two common machine learning algorithms, the new method compares favourably with the wrapper but requires much less computation.

547 citations


Proceedings Article
27 Jun 1999
TL;DR: This paper describes an approach to feature subset selection that takes into account problem speciics and learning algorithm characteristics, and shows that considering domain and algorithm characteristics signiicantly improves the results of classiication.
Abstract: This paper describes an approach to feature subset selection that takes into account problem speciics and learning algorithm characteristics. It is developed for the Naive Bayesian classiier applied on text data, since it combines well with the addressed learning problems. We focus on domains with many features that also have a highly unbalanced class distribution and asymmetric misclassii-cation costs given only implicitly in the problem. By asymmetric misclassiication costs we mean that one of the class values is the target class value for which we want to get predictions and we prefer false positive over false negative. Our example problem is automatic document categorization using machine learning, where we want to identify documents relevant for the selected category. Usually, only about 1%-10% of examples belong to the selected category. Our experimental comparison of eleven feature scoring measures show that considering domain and algorithm characteristics signiicantly improves the results of classiication.

464 citations


Journal ArticleDOI
TL;DR: It is concluded that stepwise selection may result in a substantial bias of estimated regression coefficients of selected covariables, similar to that found in the GUSTO-I trial.

451 citations


Journal ArticleDOI
TL;DR: A segmented, and possibly multistage, principal components transformation (PCT) is proposed for efficient hyperspectral remote-sensing image classification and display and results have been obtained in terms of classification accuracy, speed, and quality of color image display using two airborne visible/infrared imaging spectrometer (AVIRIS) data sets.
Abstract: A segmented, and possibly multistage, principal components transformation (PCT) is proposed for efficient hyperspectral remote-sensing image classification and display. The scheme requires, initially, partitioning the complete set of bands into several highly correlated subgroups. After separate transformation of each subgroup, the single-band separabilities are used as a guide to carry out feature selection. The selected features can then be transformed again to achieve a satisfactory data reduction ratio and generate the three most significant components for color display. The scheme reduces the computational load significantly for feature extraction, compared with the conventional PCT. A reduced number of features will also accelerate the maximum likelihood classification process significantly, and the process will not suffer the limitations encountered by trying to use the full set of hyperspectral data when training samples are limited. Encouraging results have been obtained in terms of classification accuracy, speed, and quality of color image display using two airborne visible/infrared imaging spectrometer (AVIRIS) data sets.

408 citations


Proceedings ArticleDOI
29 Jan 1999
TL;DR: The Support Vector Machine (SVM) as discussed by the authors is a new way to design classification algorithms which learn from examples (supervised learning) and generalize when applied to new data.
Abstract: The Support Vector Machine provides a new way to design classification algorithms which learn from examples (supervised learning) and generalize when applied to new data. We demonstrate its success on a difficult classification problem from hyperspectral remote sensing, where we obtain performances of 96%, and 87% correct for a 4 class problem, and a 16 class problem respectively. These results are somewhat better than other recent results on the same data. A key feature of this classifier is its ability to use high-dimensional data without the usual recourse to a feature selection step to reduce the dimensionality of the data. For this application, this is important, as hyperspectral data consists of several hundred contiguous spectral channels for each exemplar. We provide an introduction to this new approach, and demonstrate its application to classification of an agriculture scene.

383 citations


Proceedings Article
18 Jul 1999
TL;DR: This paper presents an ensemble feature selection approach that is based on genetic algorithms and shows improved performance over the popular and powerful ensemble approaches of AdaBoost and Bagging and demonstrates the utility of ensemble features selection.
Abstract: The traditional motivation behind feature selection algorithms is to find the best subset of features for a task using one particular learning algonthm. Given the recent success of ensembles, however, we investigate the notion of ensemble feature selection in this paper. This task is harder than traditional feature selection in that one not only needs to find features germane to the learning task and learning algorithm, but one also needs to find a set of feature subsets that will promote disagreement among the ensemble's classifiers. In this paper, we present an ensemble feature selection approach that is based on genetic algorithms. Our algorithm shows improved performance over the popular and powerful ensemble approaches of AdaBoost and Bagging and demonstrates the utility of ensemble feature selection.

Journal ArticleDOI
TL;DR: A new suboptimal search strategy for feature selection that represents a more sophisticated version of “classical” floating search algorithms and facilitates finding a solution even closer to the optimal one.

Proceedings Article
29 Nov 1999
TL;DR: The efficacy of the methods is illustrated on a radar signal analysis problem to find 2-D viewing coordinates for data visualization and to select inputs for a neural network classifier.
Abstract: Data visualization and feature selection methods are proposed based on the joint mutual information and ICA. The visualization methods can find many good 2-D projections for high dimensional data interpretation, which cannot be easily found by the other existing methods. The new variable selection method is found to be better in eliminating redundancy in the inputs than other methods based on simple mutual information. The efficacy of the methods is illustrated on a radar signal analysis problem to find 2-D viewing coordinates for data visualization and to select inputs for a neural network classifier.

Journal ArticleDOI
01 Dec 1999
TL;DR: Two new clustering algorithms are introduced that can effectively cluster documents, even in the presence of a very high dimensional feature space, and do not require pre-specified ad hoc distance functions and are capable of automatically discovering document similarities or associations.
Abstract: Clustering techniques have been used by many intelligent software agents in order to retrieve, filter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related Web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to define a distance or similarity among these documents, or use probabilistic techniques such as Bayesian classification. Many of these traditional algorithms, however, falter when the dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorithms that can effectively cluster documents, even in the presence of a very high dimensional feature space. These clustering techniques, which are based on generalizations of graph partitioning, do not require pre-specified ad hoc distance functions, and are capable of automatically discovering document similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clustering schemes to standard distance-based techniques, such as hierarchical agglomeration clustering , and Bayesian classification methods, such as AutoClass .

Book
02 Mar 1999
TL;DR: Decision functions classification by distance functions and clustering classification using statistical approach feature selection fuzzy classification and pattern recognition syntactic pattern recognition neural nets and pattern classification.
Abstract: Decision functions classification by distance functions and clustering classification using statistical approach feature selection fuzzy classification and pattern recognition syntactic pattern recognition neural nets and pattern classification.

Journal ArticleDOI
TL;DR: The GA was found to be an expedient solution compared to editing followed by feature selection, feature selection followed by editing, and the individual results from feature selection and editing.

Journal ArticleDOI
01 May 1999
TL;DR: MFS, a combining algorithm designed to improve the accuracy of the nearest neighbor NN classifier, is presented, which significantly outperformed several standard NN variants and was competitive with boosted decision trees.
Abstract: Combining multiple classifiers is an effective technique for improving accuracy. There are many general combining algorithms, such as Bagging, Boosting, or Error Correcting Output Coding, that significantly improve classifiers like decision trees, rule learners, or neural networks. Unfortunately, these combining methods do not improve the nearest neighbor classifier. In this paper, we present MFS, a combining algorithm designed to improve the accuracy of the nearest neighbor NN classifier. MFS combines multiple NN classifiers each using only a random subset of features. The experimental results are encouraging: On 25 datasets from the UCI repository, MFS significantly outperformed several standard NN variants and was competitive with boosted decision trees. In additional experiments, we show that MFS is robust to irrelevant features, and is able to reduce both bias and variance components of error.

Journal ArticleDOI
TL;DR: In this article, four methods of variable selection along with different criteria levels for deciding on the number of variables to retain were examined along with a selection method that requires one principal component analysis and retains variables by starting with selection from the first component.
Abstract: In many large environmental datasets redundant variables can be discarded without the loss of extra variation. Principal components analysis can be used to select those variables that contain the most information. Using an environmental dataset consisting of 36 meteorological variables spanning 37 years, four methods of variable selection are examined along with different criteria levels for deciding on the number of variables to retain. Procrustes analysis, a measure of similarity and bivariate plots are used to assess the success of the alternative variable selection methods and criteria levels in extracting representative variables. The Broken-stick model is a consistent approach to choosing significant principal components and is chosen here as the more suitable criterion in combination with a selection method that requires one principal component analysis and retains variables by starting with selection from the first component. Copyright © 1999 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
10 Jul 1999
TL;DR: The wavelet packet transform is introduced as an alternative means of extracting time-frequency information from vibration signature with the aid of statistical based feature selection criteria, which significantly reduces the long training time that is often associated with neural network classifier and increases the generalization ability of the neural networkclassifier.
Abstract: Condition monitoring of dynamic systems based on vibration signatures has generally relied upon Fourier based analysis as a means of translating vibration signals in time domain into the frequency domain. However, Fourier analysis provided a poor representation of signals well localized in time. The wavelet packet transform is introduced as an alternative means of extracting time-frequency information from vibration signature. Moreover, with the aid of statistical based feature selection criteria, a lot of feature components containing little discriminant information could be discarded resulting in a feature subset with reduced number of parameters. This significantly reduces the long training time that is often associated with neural network classifier and increases the generalization ability of the neural network classifier.


Journal ArticleDOI
TL;DR: A novel artificial neural-network decision tree algorithm (ANN-DT), which extracts binary decision trees from a trained neural network, and is shown to have significant benefits in certain cases when compared with the standard criteria of minimum weighted variance over the branches.
Abstract: Although artificial neural networks can represent a variety of complex systems with a high degree of accuracy, these connectionist models are difficult to interpret. This significantly limits the applicability of neural networks in practice, especially where a premium is placed on the comprehensibility or reliability of systems. A novel artificial neural-network decision tree algorithm (ANN-DT) is therefore proposed, which extracts binary decision trees from a trained neural network. The ANN-DT algorithm uses the neural network to generate outputs for samples interpolated from the training data set. In contrast to existing techniques, ANN-DT can extract rules from feedforward neural networks with continuous outputs. These rules are extracted from the neural network without making assumptions about the internal structure of the neural network or the features of the data. A novel attribute selection criterion based on a significance analysis of the variables on the neural-network output is examined. It is shown to have significant benefits in certain cases when compared with the standard criteria of minimum weighted variance over the branches. In three case studies the ANN-DT algorithm compared favorably with CART, a standard decision tree algorithm.

Journal ArticleDOI
TL;DR: A new approach to computer supported diagnosis of skin tumors in dermatology is presented, using neural networks with error back-propagation as learning paradigm to optimized classification performance of the neural classifiers.

Proceedings Article
18 Jul 1999
TL;DR: Results suggest a simple strategy for the SVM text categorization: use a full number of words found through a rough filtering technique like part-of-speech tagging, which indicates that SVMs cannot find irrelevant parts of speech.
Abstract: This paper investigates the effect of prior feature selection in Support Vector Machine (SVM) text categorization The input space was gradually increased by using mutual information (MI) filtering and part-of-speech (POS) filtering, which determine the portion of words that are appropriate for learning from the information-theoretic and the linguistic perspectives, respectively We tested the two filtering methods on SVMs as well as a decision tree algorithm C45 The SVMs' results common to both filtering are that 1) the optimal number of features differed completely across categories, and 2) the average performance for all categories was best when all of the words were used In addition, a comparison of the two filtering methods clarified that POS filtering on SVMs consistently outperformed MI filtering, which indicates that SVMs cannot find irrelevant parts of speech These results suggest a simple strategy for the SVM text categorization: use a full number of words found through a rough filtering technique like part-of-speech tagging

Journal ArticleDOI
TL;DR: This paper first briefly introduces baseline statistical methods used in regression and classification, then describes families of methods which have been developed specifically for neural networks, and compared on different test problems.
Abstract: The observed features of a given phenomenon are not all equally informative : some may be noisy, others correlated or irrelevant. The purpose of feature selection is to select a set of features pertinent to a given task. This is a complex process, but it is an important issue in many fields. In neural networks, feature selection has been studied for the last ten years, using conventional and original methods. This paper is a review of neural network approaches to feature selection. We first briefly introduce baseline statistical methods used in regression and classification. We then describe families of methods which have been developed specifically for neural networks. Representative methods are then compared on different test problems.

Journal ArticleDOI
TL;DR: Results on applying the evidence framework to the real-world data sets showed that committees of Bayesian networks achieved classification accuracies similar to the best alternative methods with a minimum of human intervention.

Journal ArticleDOI
TL;DR: A new approach to combine multiple features in handwriting recognition based on two ideas: feature selection-based combination and class dependent features that are effective in separating pattern classes and the new feature vector derived from a combination of two types of such features further improves the recognition rate.
Abstract: In this paper, we propose a new approach to combine multiple features in handwriting recognition based on two ideas: feature selection-based combination and class dependent features. A nonparametric method is used for feature evaluation, and the first part of this paper is devoted to the evaluation of features in terms of their class separation and recognition capabilities. In the second part, multiple feature vectors are combined to produce a new feature vector. Based on the fact that a feature has different discriminating powers for different classes, a new scheme of selecting and combining class-dependent features is proposed. In this scheme, a class is considered to have its own optimal feature vector for discriminating itself from the other classes. Using an architecture of modular neural networks as the classifier, a series of experiments were conducted on unconstrained handwritten numerals. The results indicate that the selected features are effective in separating pattern classes and the new feature vector derived from a combination of two types of such features further improves the recognition rate.

Journal ArticleDOI
TL;DR: This work proposes an informative prior distribution for variable selection and proposes novel methods for computing the marginal distribution of the data for the logistic regression model.
Abstract: Summary. Bayesian selection of variables is often difficult to carry out because of the challenge in specifying prior distributions for the regression parameters for all possible models, specifying a prior distribution on the model space and computations. We address these three issues for the logistic regression model. For the first, we propose an informative prior distribution for variable selection. Several theoretical and computational properties of the prior are derived and illustrated with several examples. For the second, we propose a method for specifying an informative prior on the model space, and for the third we propose novel methods for computing the marginal distribution of the data. The new computational algorithms only require Gibbs samples from the full model to facilitate the computation of the prior and posterior model probabilities for all possible models. Several properties of the algorithms are also derived. The prior specification for the first challenge focuses on the observables in that the elicitation is based on a prior prediction yo for the response vector and a quantity ao quantifying the uncertainty in yo. Then, yo and ao are used to specify a prior for the regression coefficients semi-automatically. Examples using real data are given to demonstrate the methodology.

Journal ArticleDOI
TL;DR: Using a forward selection procedure, applying the root mean square error from a multilinear regression model as the selection criterion, it was possible to get good prediction accuracy from a back-propagation neural network (ANN).

Journal Article
TL;DR: A practical framework for the semi-automatic construction of evaluation-functions for games based on a structured evaluation function representation is presented that is able to discover new features in a computationally feasible way.
Abstract: This paper discusses a practical framework for the semiautomatic construction of evaluation-functions for games. Based on a structured evaluation function representation, a procedure for exploring the feature space is presented that is able to discover new features in a computationally feasible way. Besides the theoretical aspects, related practical issues such as the generation of training positions, feature selection, and weight fitting in large linear systems are discussed. Finally, we present experimental results for Othello, which demonstrate the potential of the described approach.

Proceedings ArticleDOI
06 Jul 1999
TL;DR: A survey of the approaches presented in the literature to select relevant features by using genetic algorithms is given and the different values of the genetic parameters utilized as well as the fitness functions are compared.
Abstract: In this paper, we review the feature selection problem in mining issues. The application of soft computing techniques to data mining and knowledge discovery is now emerging in order to enhance the effectiveness of the traditional classification methods coming from machine learning. A survey of the approaches presented in the literature to select relevant features by using genetic algorithms is given. The different values of the genetic parameters utilized as well as the fitness functions are compared. A more detailed review of the proposals in the mining fields of databases, text and the Web is also given.