scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Measuring relevance between discrete and continuous features based on neighborhood mutual information

01 Sep 2011-Expert Systems With Applications (Pergamon Press, Inc.)-Vol. 38, Iss: 9, pp 10737-10750
TL;DR: It is shown that the proposed measure is a natural extension of classical mutual information which reduces to the classical one if features are discrete; thus the new measure can also be used to compute the relevance between discrete variables.
Abstract: Measures of relevance between features play an important role in classification and regression analysis. Mutual information has been proved an effective measure for decision tree construction and feature selection. However, there is a limitation in computing relevance between numerical features with mutual information due to problems of estimating probability density functions in high-dimensional spaces. In this work, we generalize Shannon's information entropy to neighborhood information entropy and propose a measure of neighborhood mutual information. It is shown that the new measure is a natural extension of classical mutual information which reduces to the classical one if features are discrete; thus the new measure can also be used to compute the relevance between discrete variables. In addition, the new measure introduces a parameter delta to control the granularity in analyzing data. With numeric experiments, we show that neighborhood mutual information produces the nearly same outputs as mutual information. However, unlike mutual information, no discretization is required in computing relevance when used the proposed algorithm. We combine the proposed measure with four classes of evaluating strategies used for feature selection. Finally, the proposed algorithms are tested on several benchmark data sets. The results show that neighborhood mutual information based algorithms yield better performance than some classical ones.
Citations
More filters
Journal ArticleDOI
TL;DR: This paper considers the two factors of multi-label feature, feature dependency and feature redundancy, and proposes an evaluation measure that combines mutual information with a max-dependency and min-redundancy algorithm, which allows to select superior feature subset for multi- label learning.

178 citations

Journal ArticleDOI
TL;DR: A neighborhood discrimination index is proposed to characterize the distinguishing information of a neighborhood relation that reflects the distinguishing ability of a feature subset and yields superior performance compared to other classical algorithms.
Abstract: Feature selection is viewed as an important preprocessing step for pattern recognition, machine learning, and data mining. Neighborhood is one of the most important concepts in classification learning and can be used to distinguish samples with different decisions. In this paper, a neighborhood discrimination index is proposed to characterize the distinguishing information of a neighborhood relation. It reflects the distinguishing ability of a feature subset. The proposed discrimination index is computed by considering the cardinality of a neighborhood relation rather than neighborhood similarity classes. Variants of the discrimination index, including joint discrimination index, conditional discrimination index, and mutual discrimination index, are introduced to compute the change of distinguishing information caused by the combination of multiple feature subsets. They have the similar properties as Shannon entropy and its variants. A parameter, named neighborhood radius, is introduced in these discrimination measures to address the analysis of real-valued data. Based on the proposed discrimination measures, the significance measure of a candidate feature is defined and a greedy forward algorithm for feature selection is designed. Data sets selected from public data sources are used to compare the proposed algorithm with existing algorithms. The experimental results confirm that the discrimination index-based algorithm yields superior performance compared to other classical algorithms.

158 citations


Cites background or methods from "Measuring relevance between discret..."

  • ...where [xi ]R1 is the successor neighborhood of xi with respect to R1 (see [16], [17])....

    [...]

  • ...According to neighborhood entropy [16], [17]...

    [...]

  • ...neighborhood rough set-based algorithm (NRS) [15], neighborhood entropy-based algorithm (NEIEN) [16], fuzzy information entropy-based algorithm (FINEN) [17], [58], and fuzzy rough dependence constructed by intersection operations of...

    [...]

  • ...To present the selected feature subset of a data set, in the following we employ the NEIEN, FINEN, and HANDI algorithms to reduce the entire data set based on the parameters where the classification accuracies were obtained in the above experiments....

    [...]

  • ...The complexity of HANDI is less than the NEIEN, FINEN, and FRSINT algorithms....

    [...]

Journal ArticleDOI
TL;DR: It is proven that the fourth measure, called relative neighborhood self-information, is better for feature selection than the other measures, because not only does it consider both the lower and the upper approximations but also the change of its magnitude is largest with the variation of feature subsets.
Abstract: The concept of dependency in a neighborhood rough set model is an important evaluation function for the feature selection. This function considers only the classification information contained in the lower approximation of the decision while ignoring the upper approximation. In this paper, we construct a class of uncertainty measures: decision self-information for the feature selection. These measures take into account the uncertainty information in the lower and the upper approximations. The relationships between these measures and their properties are discussed in detail. It is proven that the fourth measure, called relative neighborhood self-information, is better for feature selection than the other measures, because not only does it consider both the lower and the upper approximations but also the change of its magnitude is largest with the variation of feature subsets. This helps to facilitate the selection of optimal feature subsets. Finally, a greedy algorithm for feature selection has been designed and a series of numerical experiments was carried out to verify the effectiveness of the proposed algorithm. The experimental results show that the proposed algorithm often chooses fewer features and improves the classification accuracy in most cases.

147 citations


Cites background or methods from "Measuring relevance between discret..."

  • ...Thus, the Nemenyi tests demonstrate that NSI is significantly better than NRE and NRS at α = 0.1, respectively....

    [...]

  • ...[54] defined a feature relevance measure to characterize the classification ability of feature subsets....

    [...]

  • ...Three excellent algorithms, including neighborhood entropy (NRE) [54], NRS [14] and neighborhood discrimination index (NDI) [56], are selected and used to compare the proposed method....

    [...]

  • ...For the 3NN classifier, it is easily observed from Table IX that the distances between NSI to NRE, NRS, and NDI are all greater than 1.2612....

    [...]

  • ...According to Table VIII, the distances between average orderings of NSI to NRE and NRS are greater than 1.2612 for SVM....

    [...]

Journal ArticleDOI
TL;DR: This study develops a new multigranulation rough set model, called an intuitionistic fuzzy multigraphic rough set (IFMGRS), that is generalizations ofThree types of IFMGRSs are proposed that are extensions of three existing intuitionism fuzzy rough sets.

120 citations

Journal ArticleDOI
01 Jan 2016
TL;DR: This paper introduces the margin of instance to granulate all instances under different labels, and three different concepts of neighborhood are defined based on different cognitive viewpoints to generalize neighborhood information entropy to fit multi-label learning and propose three new measures of neighborhood mutual information.
Abstract: Graphical abstractDisplay Omitted HighlightsDifferent from the traditional multi-label feature selection, the proposed algorithm derives from different cognitive viewpoints.A simple and intuitive metric to evaluate the candidate features is proposed.The proposed algorithm is applicable to both categorical and numerical features.Our proposed method outperforms some other state-of-the-art multi-label feature selection methods in our experiments. Multi-label learning deals with data associated with a set of labels simultaneously. Like traditional single-label learning, the high-dimensionality of data is a stumbling block for multi-label learning. In this paper, we first introduce the margin of instance to granulate all instances under different labels, and three different concepts of neighborhood are defined based on different cognitive viewpoints. Based on this, we generalize neighborhood information entropy to fit multi-label learning and propose three new measures of neighborhood mutual information. It is shown that these new measures are a natural extension from single-label learning to multi-label learning. Then, we present an optimization objective function to evaluate the quality of the candidate features, which can be solved by approximating the multi-label neighborhood mutual information. Finally, extensive experiments conducted on publicly available data sets verify the effectiveness of the proposed algorithm by comparing it with state-of-the-art methods.

117 citations

References
More filters
Journal ArticleDOI
TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
Abstract: In this final installment of the paper we consider the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now. To a considerable extent the continuous case can be obtained through a limiting process from the discrete case by dividing the continuum of messages and signals into a large but finite number of small regions and calculating the various parameters involved on a discrete basis. As the size of the regions is decreased these parameters in general approach as limits the proper values for the continuous case. There are, however, a few new effects that appear and also a general change of emphasis in the direction of specialization of the general results to particular cases.

65,425 citations


"Measuring relevance between discret..." refers background in this paper

  • ...Shannon’s entropy, first introduced in 1948 (Shannon, 1948), is a measure of uncertainty of random variables....

    [...]

Journal ArticleDOI
TL;DR: In this paper, an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail, is described, and a reported shortcoming of the basic algorithm is discussed.
Abstract: The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

17,177 citations

Journal ArticleDOI
TL;DR: This article gives an introduction to the subject of classification and regression trees by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.
Abstract: Classification and regression trees are machine-learning methods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 14-23 DOI: 10.1002/widm.8 This article is categorized under: Technologies > Classification Technologies > Machine Learning Technologies > Prediction Technologies > Statistical Fundamentals

16,974 citations


"Measuring relevance between discret..." refers methods in this paper

  • ...In decision tree construction, indexes such as Gini, towing, deviance and mutual information were introduced to compute the relevance between inputs and output, thus guilding the algorithms to select an informative feature to split samples (Breiman, 1993; Quinlan, 1986, 1993)....

    [...]

Book
01 Jan 1983
TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Abstract: The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.

14,825 citations