scispace - formally typeset
Search or ask a question

Showing papers on "Feature selection published in 2007"


Journal ArticleDOI
TL;DR: A basic taxonomy of feature selection techniques is provided, providing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.
Abstract: Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications. Contact: yvan.saeys@psb.ugent.be Supplementary information: http://bioinformatics.psb.ugent.be/supplementary_data/yvsae/fsreview

4,706 citations


Journal ArticleDOI
TL;DR: An alternative implementation of random forests is proposed, that provides unbiased variable selection in the individual classification trees, that can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories.
Abstract: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.

2,697 citations


Reference BookDOI
29 Oct 2007
TL;DR: This book discusses Supervised, Unsupervised, and Semi-Supervised Feature Selection Key Contributions and Organization of the Book Looking Ahead Unsuper supervised Feature Selection.
Abstract: PREFACE Introduction and Background Less Is More Huan Liu and Hiroshi Motoda Background and Basics Supervised, Unsupervised, and Semi-Supervised Feature Selection Key Contributions and Organization of the Book Looking Ahead Unsupervised Feature Selection Jennifer G. Dy Introduction Clustering Feature Selection Feature Selection for Unlabeled Data Local Approaches Summary Randomized Feature Selection David J. Stracuzzi Introduction Types of Randomizations Randomized Complexity Classes Applying Randomization to Feature Selection The Role of Heuristics Examples of Randomized Selection Algorithms Issues in Randomization Summary Causal Feature Selection Isabelle Guyon, Constantin Aliferis, and Andre Elisseeff Introduction Classical "Non-Causal" Feature Selection The Concept of Causality Feature Relevance in Bayesian Networks Causal Discovery Algorithms Examples of Applications Summary, Conclusions, and Open Problems Extending Feature Selection Active Learning of Feature Relevance Emanuele Olivetti, Sriharsha Veeramachaneni, and Paolo Avesani Introduction Active Sampling for Feature Relevance Estimation Derivation of the Sampling Benefit Function Implementation of the Active Sampling Algorithm Experiments Conclusions and Future Work A Study of Feature Extraction Techniques Based on Decision Border Estimate Claudia Diamantini and Domenico Potena Introduction Feature Extraction Based on Decision Boundary Generalities about Labeled Vector Quantizers Feature Extraction Based on Vector Quantizers Experiments Conclusions Ensemble-Based Variable Selection Using Independent Probes Eugene Tuv, Alexander Borisov, and Kari Torkkola Introduction Tree Ensemble Methods in Feature Ranking The Algorithm: Ensemble-Based Ranking against Independent Probes Experiments Discussion Efficient Incremental-Ranked Feature Selection in Massive Data Roberto Ruiz, Jesus S. Aguilar-Ruiz, and Jose C. Riquelme Introduction Related Work Preliminary Concepts Incremental Performance over Ranking Experimental Results Conclusions Weighting and Local Methods Non-Myopic Feature Quality Evaluation with (R)ReliefF Igor Kononenko and Marko Robnik Sikonja Introduction From Impurity to Relief ReliefF for Classification and RReliefF for Regression Extensions Interpretation Implementation Issues Applications Conclusion Weighting Method for Feature Selection in k-Means Joshua Zhexue Huang, Jun Xu, Michael Ng, and Yunming Ye Introduction Feature Weighting in k-Means W-k-Means Clustering Algorithm Feature Selection Subspace Clustering with k-Means Text Clustering Related Work Discussions Local Feature Selection for Classification Carlotta Domeniconi and Dimitrios Gunopulos Introduction The Curse of Dimensionality Adaptive Metric Techniques Large Margin nearest Neighbor Classifiers Experimental Comparisons Conclusions Feature Weighting through Local Learning Yijun Sun Introduction Mathematical Interpretation of Relief Iterative Relief Algorithm Extension to Multiclass Problems Online Learning Computational Complexity Experiments Conclusion Text Classification and Clustering Feature Selection for Text Classification George Forman Introduction Text Feature Generators Feature Filtering for Classification Practical and Scalable Computation A Case Study Conclusion and Future Work A Bayesian Feature Selection Score Based on Naive Bayes Models Susana Eyheramendy and David Madigan Introduction Feature Selection Scores Classification Algorithms Experimental Settings and Results Conclusion Pairwise Constraints-Guided Dimensionality Reduction Wei Tang and Shi Zhong Introduction Pairwise Constraints-Guided Feature Projection Pairwise Constraints-Guided Co-Clustering Experimental Studies Conclusion and Future Work Aggressive Feature Selection by Feature Ranking Masoud Makrehchi and Mohamed S. Kamel Introduction Feature Selection by Feature Ranking Proposed Approach to Reducing Term Redundancy Experimental Results Summary Feature Selection in Bioinformatics Feature Selection for Genomic Data Analysis Lei Yu Introduction Redundancy-Based Feature Selection Empirical Study Summary A Feature Generation Algorithm with Applications to Biological Sequence Classification Rezarta Islamaj Dogan, Lise Getoor, and W. John Wilbur Introduction Splice-Site Prediction Feature Generation Algorithm Experiments and Discussion Conclusions An Ensemble Method for Identifying Robust Features for Biomarker Discovery Diana Chan, Susan M. Bridges, and Shane C. Burgess Introduction Biomarker Discovery from Proteome Profiles Challenges of Biomarker Identification Ensemble Method for Feature Selection Feature Selection Ensemble Results and Discussion Conclusion Model Building and Feature Selection with Genomic Data Hui Zou and Trevor Hastie Introduction Ridge Regression, Lasso, and Bridge Drawbacks of the Lasso The Elastic Net The Elastic-Net Penalized SVM Sparse Eigen-Genes Summary INDEX

1,097 citations


Proceedings ArticleDOI
24 Sep 2007
TL;DR: This study provides an empirical basis for designing visual-word representations that are likely to produce superior classification performance and applies techniques used in text categorization to generate image representations that differ in the dimension, selection, and weighting of visual words.
Abstract: Based on keypoints extracted as salient image patches, an image can be described as a "bag of visual words" and this representation has been used in scene classification. The choice of dimension, selection, and weighting of visual words in this representation is crucial to the classification performance but has not been thoroughly studied in previous work. Given the analogy between this representation and the bag-of-words representation of text documents, we apply techniques used in text categorization, including term weighting, stop word removal, feature selection, to generate image representations that differ in the dimension, selection, and weighting of visual words. The impact of these representation choices to scene classification is studied through extensive experiments on the TRECVID and PASCAL collection. This study provides an empirical basis for designing visual-word representations that are likely to produce superior classification performance.

900 citations


Journal ArticleDOI
TL;DR: A statistical perspective on boosting is presented, with special emphasis on estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis.
Abstract: We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in high-dimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated open-source software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing user-specified loss functions.

891 citations


Proceedings ArticleDOI
20 Jun 2007
TL;DR: This work exploits intrinsic properties underlying supervised and unsupervised feature selection algorithms, and proposes a unified framework for feature selection based on spectral graph theory, and shows that existing powerful algorithms such as ReliefF and Laplacian Score are special cases of the proposed framework.
Abstract: Feature selection aims to reduce dimensionality for building comprehensible learning models with good generalization performance. Feature selection algorithms are largely studied separately according to the type of learning: supervised or unsupervised. This work exploits intrinsic properties underlying supervised and unsupervised feature selection algorithms, and proposes a unified framework for feature selection based on spectral graph theory. The proposed framework is able to generate families of algorithms for both supervised and unsupervised feature selection. And we show that existing powerful algorithms such as ReliefF (supervised) and Laplacian Score (unsupervised) are special cases of the proposed framework. To the best of our knowledge, this work is the first attempt to unify supervised and unsupervised feature selection, and enable their joint study under a general framework. Experiments demonstrated the efficacy of the novel algorithms derived from the framework.

857 citations


Journal ArticleDOI
TL;DR: In this article, a simple Bayesian logistic regression approach that uses a Laplace prior to avoid overfitting and produces sparse predictive models for text data is presented. But this approach is not suitable for document classification problems.
Abstract: Logistic regression analysis of high-dimensional data, such as natural language text, poses computational and statistical challenges. Maximum likelihood estimation often fails in these applications. We present a simple Bayesian logistic regression approach that uses a Laplace prior to avoid overfitting and produces sparse predictive models for text data. We apply this approach to a range of document classification problems and show that it produces compact predictive models at least as effective as those produced by support vector machine classifiers or ridge logistic regression combined with feature selection. We describe our model fitting algorithm, our open source implementations (BBR and BMR), and experimental results.

829 citations


Journal ArticleDOI
TL;DR: It is argued why MFP is the preferred approach for multivariable model building with continuous covariates, and it is shown that spline modelling, while extremely flexible, can generate fitted curves with uninterpretable 'wiggles'.
Abstract: In developing regression models, data analysts are often faced with many predictor variables that may influence an outcome variable. After more than half a century of research, the 'best' way of selecting a multivariable model is still unresolved. It is generally agreed that subject matter knowledge, when available, should guide model building. However, such knowledge is often limited, and data-dependent model building is required. We limit the scope of the modelling exercise to selecting important predictors and choosing interpretable and transportable functions for continuous predictors. Assuming linear functions, stepwise selection and all-subset strategies are discussed; the key tuning parameters are the nominal P-value for testing a variable for inclusion and the penalty for model complexity, respectively. We argue that stepwise procedures perform better than a literature-based assessment would suggest. Concerning selection of functional form for continuous predictors, the principal competitors are fractional polynomial functions and various types of spline techniques. We note that a rigorous selection strategy known as multivariable fractional polynomials (MFP) has been developed. No spline-based procedure for simultaneously selecting variables and functional forms has found wide acceptance. Results of FP and spline modelling are compared in two data sets. It is shown that spline modelling, while extremely flexible, can generate fitted curves with uninterpretable 'wiggles', particularly when automatic methods for choosing the smoothness are employed. We give general recommendations to practitioners for carrying out variable and function selection. While acknowledging that further research is needed, we argue why MFP is our preferred approach for multivariable model building with continuous covariates.

806 citations


Journal ArticleDOI
TL;DR: Based on the concept of automatic relevance determination, this paper uses an empirical Bayesian prior to estimate a convenient posterior distribution over candidate basis vectors and consistently places its prominent posterior mass on the appropriate region of weight-space necessary for simultaneous sparse recovery.
Abstract: Given a large overcomplete dictionary of basis vectors, the goal is to simultaneously represent L>1 signal vectors using coefficient expansions marked by a common sparsity profile. This generalizes the standard sparse representation problem to the case where multiple responses exist that were putatively generated by the same small subset of features. Ideally, the associated sparse generating weights should be recovered, which can have physical significance in many applications (e.g., source localization). The generic solution to this problem is intractable and, therefore, approximate procedures are sought. Based on the concept of automatic relevance determination, this paper uses an empirical Bayesian prior to estimate a convenient posterior distribution over candidate basis vectors. This particular approximation enforces a common sparsity profile and consistently places its prominent posterior mass on the appropriate region of weight-space necessary for simultaneous sparse recovery. The resultant algorithm is then compared with multiple response extensions of matching pursuit, basis pursuit, FOCUSS, and Jeffreys prior-based Bayesian methods, finding that it often outperforms the others. Additional motivation for this particular choice of cost function is also provided, including the analysis of global and local minima and a variational derivation that highlights the similarities and differences between the proposed algorithm and previous approaches.

796 citations


Journal ArticleDOI
TL;DR: A new feature selection strategy based on rough sets and particle swarm optimization (PSO), which does not need complex operators such as crossover and mutation, and requires only primitive and simple mathematical operators, and is computationally inexpensive in terms of both memory and runtime.

794 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: The approach builds on recent work on object recognition based on hierarchical feedforward architectures and extends a neurobiological model of motion processing in the visual cortex and finds that sparse features in intermediate stages outperform dense ones and that using a simple feature selection approach leads to an efficient system that performs better with far fewer features.
Abstract: We present a biologically-motivated system for the recognition of actions from video sequences. The approach builds on recent work on object recognition based on hierarchical feedforward architectures [25, 16, 20] and extends a neurobiological model of motion processing in the visual cortex [10]. The system consists of a hierarchy of spatio-temporal feature detectors of increasing complexity: an input sequence is first analyzed by an array of motion- direction sensitive units which, through a hierarchy of processing stages, lead to position-invariant spatio-temporal feature detectors. We experiment with different types of motion-direction sensitive units as well as different system architectures. As in [16], we find that sparse features in intermediate stages outperform dense ones and that using a simple feature selection approach leads to an efficient system that performs better with far fewer features. We test the approach on different publicly available action datasets, in all cases achieving the highest results reported to date.

Journal ArticleDOI
TL;DR: Experimental results show that SVM is a promising addition to the existing data mining methods and three strategies to construct the hybrid SVM-based credit scoring models are used.
Abstract: The credit card industry has been growing rapidly recently, and thus huge numbers of consumers' credit data are collected by the credit department of the bank. The credit scoring manager often evaluates the consumer's credit with intuitive experience. However, with the support of the credit classification model, the manager can accurately evaluate the applicant's credit score. Support Vector Machine (SVM) classification is currently an active research area and successfully solves classification problems in many domains. This study used three strategies to construct the hybrid SVM-based credit scoring models to evaluate the applicant's credit score from the applicant's input features. Two credit datasets in UCI database are selected as the experimental data to demonstrate the accuracy of the SVM classifier. Compared with neural networks, genetic programming, and decision tree classifiers, the SVM classifier achieved an identical classificatory accuracy with relatively few input features. Additionally, combining genetic algorithms with SVM classifier, the proposed hybrid GA-SVM strategy can simultaneously perform feature selection task and model parameters optimization. Experimental results show that SVM is a promising addition to the existing data mining methods.

Journal ArticleDOI
TL;DR: This paper presents the most well know algorithms for each step of data pre-processing so that one achieves the best performance for their data set.
Abstract: Many factors affect the success of Machine Learning (ML) on a given task. The representation and quality of the instance data is first and foremost. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. It is well known that data preparation and filtering steps take considerable amount of processing time in ML problems. Data pre-processing includes data cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. It would be nice if a single sequence of data pre-processing algorithms had the best performance for each data set but this is not happened. Thus, we present the most well know algorithms for each step of data pre-processing so that one achieves the best performance for their data set. Keywords—Data mining, feature selection, data cleaning.

Posted Content
TL;DR: In this paper, the authors consider the least-square regression problem with regularization by a block 1-norm and derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification.
Abstract: We consider the least-square regression problem with regularization by a block 1-norm, i.e., a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1-norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic model consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied.

Journal ArticleDOI
TL;DR: In this paper, the adaptive Lasso estimator is proposed for Cox's proportional hazards model, which is based on a penalized log partial likelihood with the adaptively weighted L 1 penalty on regression coefficients.
Abstract: SUMMARY We investigate the variable selection problem for Cox's proportional hazards model, and propose a unified model selection and estimation procedure with desired theoretical properties and computational convenience. The new method is based on a penalized log partial likelihood with the adaptively weighted L1 penalty on regression coefficients, providing what we call the adaptive Lasso estimator. The method incorporates different penalties for different coefficients: unimportant variables receive larger penalties than important ones, so that important variables tend to be retained in the selection process, whereas unimportant variables are more likely to be dropped. Theoretical properties, such as consistency and rate of convergence of the estimator, are studied. We also show that, with proper choice of regularization parameters, the proposed estimator has the oracle properties. The convex optimization nature of the method leads to an efficient algorithm. Both simulated and real examples show that the method performs competitively.

Journal ArticleDOI
TL;DR: A methodological approach to the classification of pigmented skin lesions in dermoscopy images is presented and the issue of class imbalance is addressed using various sampling strategies and the classifier generalization error is estimated using Monte Carlo cross validation.

Journal ArticleDOI
TL;DR: The results demonstrate that very simple network-classification models perform quite well---well enough that they should be used regularly as baseline classifiers for studies of learning with networked data.
Abstract: This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well---well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes---that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection.

Journal ArticleDOI
TL;DR: This study quantifies the sensitivity of feature selection algorithms to variations in the training set by assessing the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset.
Abstract: With the proliferation of extremely high-dimensional data, feature selection algorithms have become indispensable components of the learning process Strangely, despite extensive work on the stability of learning algorithms, the stability of feature selection algorithms has been relatively neglected This study is an attempt to fill that gap by quantifying the sensitivity of feature selection algorithms to variations in the training set We assess the stability of feature selection algorithms based on the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset We examine a number of measures to quantify the stability of feature preferences and propose an empirical way to estimate them We perform a series of experiments with several feature selection algorithms on a set of proteomics datasets The experiments allow us to explore the merits of each stability measure and create stability profiles of the feature selection algorithms Finally, we show how stability profiles can support the choice of a feature selection algorithm

Journal ArticleDOI
TL;DR: The least absolute deviation (LAD) regression is a useful method for robust regression, and the least absolute shrinkage and selection operator (lasso) is a popular choice for shrinkage estimation and variable selection, which are combined to produce LAD-lasso.
Abstract: The least absolute deviation (LAD) regression is a useful method for robust regression, and the least absolute shrinkage and selection operator (lasso) is a popular choice for shrinkage estimation and variable selection. In this article we combine these two classical ideas together to produce LAD-lasso. Compared with the LAD regression, LAD-lasso can do parameter estimation and variable selection simultaneously. Compared with the traditional lasso, LAD-lasso is resistant to heavy-tailed errors or outliers in the response. Furthermore, with easily estimated tuning parameters, the LAD-lasso estimator enjoys the same asymptotic efficiency as the unpenalized LAD estimator obtained under the true model (i.e., the oracle property). Extensive simulation studies demonstrate satisfactory finite-sample performance of LAD-lasso, and a real example is analyzed for illustration purposes.

Journal ArticleDOI
TL;DR: This paper presents a technique for dimensionality reduction to deal with hyperspectral images based on a hierarchical clustering structure to group bands to minimize the intracluster variance and maximize the intercluster variance.
Abstract: Hyperspectral imaging involves large amounts of information. This paper presents a technique for dimensionality reduction to deal with hyperspectral images. The proposed method is based on a hierarchical clustering structure to group bands to minimize the intracluster variance and maximize the intercluster variance. This aim is pursued using information measures, such as distances based on mutual information or Kullback-Leibler divergence, in order to reduce data redundancy and non useful information among image bands. Experimental results include a comparison among some relevant and recent methods for hyperspectral band selection using no labeled information, showing their performance with regard to pixel image classification tasks. The technique that is presented has a stable behavior for different image data sets and a noticeable accuracy, mainly when selecting small sets of bands.

Journal ArticleDOI
01 Feb 2007
TL;DR: This correspondence presents a novel hybrid wrapper and filter feature selection algorithm for a classification problem using a memetic framework that incorporates a filter ranking method in the traditional genetic algorithm to improve classification performance and accelerate the search in identifying the core feature subsets.
Abstract: This correspondence presents a novel hybrid wrapper and filter feature selection algorithm for a classification problem using a memetic framework. It incorporates a filter ranking method in the traditional genetic algorithm to improve classification performance and accelerate the search in identifying the core feature subsets. Particularly, the method adds or deletes a feature from a candidate feature subset based on the univariate feature ranking information. This empirical study on commonly used data sets from the University of California, Irvine repository and microarray data sets shows that the proposed method outperforms existing methods in terms of classification accuracy, number of selected features, and computational efficiency. Furthermore, we investigate several major issues of memetic algorithm (MA) to identify a good balance between local search and genetic search so as to maximize search quality and efficiency in the hybrid filter and wrapper MA

Journal ArticleDOI
TL;DR: In this paper, the authors explore the problem of variable selection in high-dimensional models and show that consistent variable selection under certain conditions can be achieved by using three screening methods: the lasso, marginal regression and forward stepwise regression.
Abstract: This paper explores the following question: what kind of statistical guarantees can be given when doing variable selection in high-dimensional models? In particular, we look at the error rates and power of some multi-stage regression methods. In the first stage we fit a set of candidate models. In the second stage we select one model by cross-validation. In the third stage we use hypothesis testing to eliminate some variables. We refer to the first two stages as "screening" and the last stage as "cleaning." We consider three screening methods: the lasso, marginal regression, and forward stepwise regression. Our method gives consistent variable selection under certain conditions.

Journal ArticleDOI
TL;DR: This paper illustrates the use of a Decision Tree that identifies the best features from a given set of samples for the purpose of classification using Proximal Support Vector Machine (PSVM), which has the capability to efficiently classify the faults using statistical features.

Journal ArticleDOI
TL;DR: This paper investigates a novel approach based on fuzzy-rough sets, fuzzy rough feature selection (FRFS), that addresses problems and retains dataset semantics and is applied to two challenging domains where a feature reducing step is important; namely, web content classification and complex systems monitoring.
Abstract: Attribute selection (AS) refers to the problem of selecting those input attributes or features that are most predictive of a given outcome; a problem encountered in many areas such as machine learning, pattern recognition and signal processing. Unlike other dimensionality reduction methods, attribute selectors preserve the original meaning of the attributes after reduction. This has found application in tasks that involve datasets containing huge numbers of attributes (in the order of tens of thousands) which, for some learning algorithms, might be impossible to process further. Recent examples include text processing and web content classification. AS techniques have also been applied to small and medium-sized datasets in order to locate the most informative attributes for later use. One of the many successful applications of rough set theory has been to this area. The rough set ideology of using only the supplied data and no other information has many benefits in AS, where most other methods require supplementary knowledge. However, the main limitation of rough set-based attribute selection in the literature is the restrictive requirement that all data is discrete. In classical rough set theory, it is not possible to consider real-valued or noisy data. This paper investigates a novel approach based on fuzzy-rough sets, fuzzy rough feature selection (FRFS), that addresses these problems and retains dataset semantics. FRFS is applied to two challenging domains where a feature reducing step is important; namely, web content classification and complex systems monitoring. The utility of this approach is demonstrated and is compared empirically with several dimensionality reducers. In the experimental studies, FRFS is shown to equal or improve classification accuracy when compared to the results from unreduced data. Classifiers that use a lower dimensional set of attributes which are retained by fuzzy-rough reduction outperform those that employ more attributes returned by the existing crisp rough reduction method. In addition, it is shown that FRFS is more powerful than the other AS techniques in the comparative study

Journal ArticleDOI
TL;DR: The proposed method for fault diagnosis based on empirical mode decomposition (EMD), an improved distance evaluation technique and the combination of multiple adaptive neuro-fuzzy inference systems (ANFISs) show that the multiple ANFIS combination can reliably recognise different fault categories and severities.

Journal ArticleDOI
TL;DR: A new SVM approach is proposed, named Enhanced SVM, which combines these two methods in order to provide unsupervised learning and low false alarm capability, similar to that of a supervised S VM approach.

Journal ArticleDOI
TL;DR: In this article, the maximum relevance/minimum redundancy (MRMR) principle is used to select among the least redundant variables the ones that have the highest mutual information with the target.
Abstract: The paper presents MRNET, an original method for inferring genetic networks from microarray data. The method is based on maximum relevance/minimum redundancy (MRMR), an effective information-theoretic technique for feature selection in supervised learning. The MRMR principle consists in selecting among the least redundant variables the ones that have the highest mutual information with the target. MRNET extends this feature selection principle to networks in order to infer gene-dependence relationships from microarray data. The paper assesses MRNET by benchmarking it against RELNET, CLR, and ARACNE, three state-of-the-art information-theoretic methods for large (up to several thousands of genes) network inference. Experimental results on thirty synthetically generated microarray datasets show that MRNET is competitive with these methods.

Book ChapterDOI
17 Sep 2007
TL;DR: Two new techniques are proposed based on a smooth (differentiable) convex approximation for the L1 regularizer that does not depend on any assumptions about the loss function used and a new strategy that addresses the non-differentiability of the L 1-regularizer.
Abstract: L1 regularization is effective for feature selection, but the resulting optimization is challenging due to the non-differentiability of the 1-norm. In this paper we compare state-of-the-art optimization techniques to solve this problem across several loss functions. Furthermore, we propose two new techniques. The first is based on a smooth (differentiable) convex approximation for the L1 regularizer that does not depend on any assumptions about the loss function used. The other technique is a new strategy that addresses the non-differentiability of the L1-regularizer by casting the problem as a constrained optimization problem that is then solved using a specialized gradient projection method. Extensive comparisons show that our newly proposed approaches consistently rank among the best in terms of convergence speed and efficiency by measuring the number of function evaluations required.

Proceedings Article
12 Feb 2007
TL;DR: A stability index is proposed here based on cardinality of the intersection and a correction for chance, and the experimental results indicate that the index can be useful for selecting the final feature subset.
Abstract: Sequential forward selection (SFS) is one of the most widely used feature selection procedures. It starts with an empty set and adds one feature at each step. The estimate of the quality of the candidate subsets usually depends on the training/testing split of the data. Therefore different sequences of features may be returned from repeated runs of SFS. A substantial discrepancy between such sequences will signal a problem with the selection. A stability index is proposed here based on cardinality of the intersection and a correction for chance. The experimental results with 10 real data sets indicate that the index can be useful for selecting the final feature subset. If stability is high, then we should return a subset of features based on their total rank across the SFS runs. If stability is low, then it is better to return the feature subset which gave the minimum error across all SFS runs.

Journal ArticleDOI
TL;DR: A simple and efficient hybrid attribute reduction algorithm based on a generalized fuzzy-rough model based on fuzzy relations is introduced and the technique of variable precision fuzzy inclusion in computing decision positive region can get the optimal classification performance.