Feature selection based on mutual information: criteria ofmax-dependency, max-relevance, and min-redundancy

Home
/
Papers
/
Feature selection based on mutual information: criteria ofmax-dependency, max-relevance, and min-redundancy

Feature selection based on mutual information: criteria ofmax-dependency, max-relevance, and min-redundancy

05 Aug 2003-Vol. 27, Iss: 8

TL;DR: This work derives an equivalent form, called minimal-redundancy-maximal-relevance criterion (mRMR), for first-order incremental feature selection, and presents a two-stage feature selection algorithm by combining mRMR and other more sophisticated feature selectors (e.g., wrappers).

read less

About: The article was published on 2003-08-05 and is currently open access. It has received 7075 citations till now. The article focuses on the topics: Feature selection & Mutual information.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Extreme Learning Machine for Regression and Multiclass Classification

[...]

Guang-Bin Huang¹, Hongming Zhou¹, Xiaojian Ding², Rui Zhang¹•Institutions (2)

Nanyang Technological University¹, Xi'an Jiaotong University²

01 Apr 2012

TL;DR: ELM provides a unified learning platform with a widespread type of feature mappings and can be applied in regression and multiclass classification applications directly and in theory, ELM can approximate any target continuous function and classify any disjoint regions.

...read moreread less

Abstract: Due to the simplicity of their implementations, least square support vector machine (LS-SVM) and proximal support vector machine (PSVM) have been widely used in binary classification applications. The conventional LS-SVM and PSVM cannot be used in regression and multiclass classification applications directly, although variants of LS-SVM and PSVM have been proposed to handle such cases. This paper shows that both LS-SVM and PSVM can be simplified further and a unified learning framework of LS-SVM, PSVM, and other regularization algorithms referred to extreme learning machine (ELM) can be built. ELM works for the “generalized” single-hidden-layer feedforward networks (SLFNs), but the hidden layer (or called feature mapping) in ELM need not be tuned. Such SLFNs include but are not limited to SVM, polynomial network, and the conventional feedforward neural networks. This paper shows the following: 1) ELM provides a unified learning platform with a widespread type of feature mappings and can be applied in regression and multiclass classification applications directly; 2) from the optimization method point of view, ELM has milder optimization constraints compared to LS-SVM and PSVM; 3) in theory, compared to ELM, LS-SVM and PSVM achieve suboptimal solutions and require higher computational complexity; and 4) in theory, ELM can approximate any target continuous function and classify any disjoint regions. As verified by the simulation results, ELM tends to have better scalability and achieve similar (for regression and binary class cases) or much better (for multiclass cases) generalization performance at much faster learning speed (up to thousands times) than traditional SVM and LS-SVM.

...read moreread less

4,835 citations

Cites methods from "Feature selection based on mutual i..."

...Performances of the different algorithms have also been tested on both leukemia data set and colon microarray data set after the minimum-redundancy–maximum-relevance feature selection method [54] being taken (cf....
[...]

Journal Article•DOI•

A survey on feature selection methods

[...]

Girish Chandrashekar¹, Ferat Sahin¹•Institutions (1)

Rochester Institute of Technology¹

01 Jan 2014-Computers & Electrical Engineering

TL;DR: The objective is to provide a generic introduction to variable elimination which can be applied to a wide array of machine learning problems and focus on Filter, Wrapper and Embedded methods.

...read moreread less

3,517 citations

Cites background or methods from "Feature selection based on mutual i..."

...In [23,24] the authors develop a ranking criteria based on class densities for binary data....
[...]
...The mRMR (max-relevancy, min-redundancy) [24] is another method based on MI....
[...]
...SVM [51,2,24,18] is a marginal classifier which maximizes the margin between the data samples in the two classes....
[...]

Journal Article•DOI•

A Survey on Human Activity Recognition using Wearable Sensors

[...]

Oscar D. Lara¹, Miguel A. Labrador¹•Institutions (1)

University of South Florida¹

23 Jan 2013-IEEE Communications Surveys and Tutorials

TL;DR: The state of the art in HAR based on wearable sensors is surveyed and a two-level taxonomy in accordance to the learning approach and the response time is proposed.

...read moreread less

Abstract: Providing accurate and opportune information on people's activities and behaviors is one of the most important tasks in pervasive computing. Innumerable applications can be visualized, for instance, in medical, security, entertainment, and tactical scenarios. Despite human activity recognition (HAR) being an active field for more than a decade, there are still key aspects that, if addressed, would constitute a significant turn in the way people interact with mobile devices. This paper surveys the state of the art in HAR based on wearable sensors. A general architecture is first presented along with a description of the main components of any HAR system. We also propose a two-level taxonomy in accordance to the learning approach (either supervised or semi-supervised) and the response time (either offline or online). Then, the principal issues and challenges are discussed, as well as the main solutions to each one of them. Twenty eight systems are qualitatively evaluated in terms of recognition performance, energy consumption, obtrusiveness, and flexibility, among others. Finally, we present some open problems and ideas that, due to their high relevance, should be addressed in future research.

...read moreread less

2,184 citations

Proceedings Article•

Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization

[...]

Feiping Nie¹, Heng Huang¹, Xiao Cai¹, Chris Ding¹•Institutions (1)

University of Texas at Arlington¹

06 Dec 2010

TL;DR: A new robust feature selection method with emphasizing joint l2,1-norm minimization on both loss function and regularization is proposed, which has been applied into both genomic and proteomic biomarkers discovery.

...read moreread less

Abstract: Feature selection is an important component of many machine learning applications. Especially in many bioinformatics tasks, efficient and robust feature selection methods are desired to extract meaningful features and eliminate noisy ones. In this paper, we propose a new robust feature selection method with emphasizing joint l2,1-norm minimization on both loss function and regularization. The l2,1-norm based loss function is robust to outliers in data points and the l2,1-norm regularization selects features across all data points with joint sparsity. An efficient algorithm is introduced with proved convergence. Our regression based objective makes the feature selection process more efficient. Our method has been applied into both genomic and proteomic biomarkers discovery. Extensive empirical studies are performed on six data sets to demonstrate the performance of our feature selection method.

...read moreread less

1,697 citations

Journal Article•DOI•

Feature Selection: A Data Perspective

[...]

Jundong Li¹, Kewei Cheng¹, Suhang Wang¹, Fred Morstatter¹, Robert P. Trevino¹, Jiliang Tang², Huan Liu¹ - Show less +3 more•Institutions (2)

Arizona State University¹, Michigan State University²

06 Dec 2017-ACM Computing Surveys

TL;DR: This survey revisits feature selection research from a data perspective and reviews representative feature selection algorithms for conventional data, structured data, heterogeneous data and streaming data, and categorizes them into four main groups: similarity- based, information-theoretical-based, sparse-learning-based and statistical-based.

...read moreread less

Abstract: Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data-mining and machine-learning problems. The objectives of feature selection include building simpler and more comprehensible models, improving data-mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities to feature selection. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the era of big data, we revisit feature selection research from a data perspective and review representative feature selection algorithms for conventional data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for conventional data, we categorize them into four main groups: similarity-based, information-theoretical-based, sparse-learning-based, and statistical-based methods. To facilitate and promote the research in this community, we also present an open source feature selection repository that consists of most of the popular feature selection algorithms (http://featureselection.asu.edu/). Also, we use it as an example to show how to evaluate feature selection algorithms. At the end of the survey, we present a discussion about some open problems and challenges that require more attention in future research.

...read moreread less

1,566 citations

Cites background from "Feature selection based on mutual i..."

...Peng et al. (2005) proposes a Minimum Redundancy Maximum Relevance (MRMR) criterion to set the value of β to be the reverse of the number of selected features: JMRMR (Xk ) = I (Xk ;Y ) − 1 |S| ∑ X j ∈S I (Xk...
[...]
...…Du et al. 2013; Tang et al. 2014), feature correlation (Koller and Sahami 1995; Guyon and Elisseeff 2003), mutual information (Yu and Liu 2003; Peng et al. 2005; Nguyen et al. 2014; Shishkin et al. 2016; Gao et al. 2016), feature ability to preserve data manifold structure (He et al. 2005;…...
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Elements of information theory

[...]

Thomas M. Cover¹, Joy A. Thomas²•Institutions (2)

Stanford University¹, IBM²

01 Jan 1991

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

...read moreread less

Abstract: Preface to the Second Edition. Preface to the First Edition. Acknowledgments for the Second Edition. Acknowledgments for the First Edition. 1. Introduction and Preview. 1.1 Preview of the Book. 2. Entropy, Relative Entropy, and Mutual Information. 2.1 Entropy. 2.2 Joint Entropy and Conditional Entropy. 2.3 Relative Entropy and Mutual Information. 2.4 Relationship Between Entropy and Mutual Information. 2.5 Chain Rules for Entropy, Relative Entropy, and Mutual Information. 2.6 Jensen's Inequality and Its Consequences. 2.7 Log Sum Inequality and Its Applications. 2.8 Data-Processing Inequality. 2.9 Sufficient Statistics. 2.10 Fano's Inequality. Summary. Problems. Historical Notes. 3. Asymptotic Equipartition Property. 3.1 Asymptotic Equipartition Property Theorem. 3.2 Consequences of the AEP: Data Compression. 3.3 High-Probability Sets and the Typical Set. Summary. Problems. Historical Notes. 4. Entropy Rates of a Stochastic Process. 4.1 Markov Chains. 4.2 Entropy Rate. 4.3 Example: Entropy Rate of a Random Walk on a Weighted Graph. 4.4 Second Law of Thermodynamics. 4.5 Functions of Markov Chains. Summary. Problems. Historical Notes. 5. Data Compression. 5.1 Examples of Codes. 5.2 Kraft Inequality. 5.3 Optimal Codes. 5.4 Bounds on the Optimal Code Length. 5.5 Kraft Inequality for Uniquely Decodable Codes. 5.6 Huffman Codes. 5.7 Some Comments on Huffman Codes. 5.8 Optimality of Huffman Codes. 5.9 Shannon-Fano-Elias Coding. 5.10 Competitive Optimality of the Shannon Code. 5.11 Generation of Discrete Distributions from Fair Coins. Summary. Problems. Historical Notes. 6. Gambling and Data Compression. 6.1 The Horse Race. 6.2 Gambling and Side Information. 6.3 Dependent Horse Races and Entropy Rate. 6.4 The Entropy of English. 6.5 Data Compression and Gambling. 6.6 Gambling Estimate of the Entropy of English. Summary. Problems. Historical Notes. 7. Channel Capacity. 7.1 Examples of Channel Capacity. 7.2 Symmetric Channels. 7.3 Properties of Channel Capacity. 7.4 Preview of the Channel Coding Theorem. 7.5 Definitions. 7.6 Jointly Typical Sequences. 7.7 Channel Coding Theorem. 7.8 Zero-Error Codes. 7.9 Fano's Inequality and the Converse to the Coding Theorem. 7.10 Equality in the Converse to the Channel Coding Theorem. 7.11 Hamming Codes. 7.12 Feedback Capacity. 7.13 Source-Channel Separation Theorem. Summary. Problems. Historical Notes. 8. Differential Entropy. 8.1 Definitions. 8.2 AEP for Continuous Random Variables. 8.3 Relation of Differential Entropy to Discrete Entropy. 8.4 Joint and Conditional Differential Entropy. 8.5 Relative Entropy and Mutual Information. 8.6 Properties of Differential Entropy, Relative Entropy, and Mutual Information. Summary. Problems. Historical Notes. 9. Gaussian Channel. 9.1 Gaussian Channel: Definitions. 9.2 Converse to the Coding Theorem for Gaussian Channels. 9.3 Bandlimited Channels. 9.4 Parallel Gaussian Channels. 9.5 Channels with Colored Gaussian Noise. 9.6 Gaussian Channels with Feedback. Summary. Problems. Historical Notes. 10. Rate Distortion Theory. 10.1 Quantization. 10.2 Definitions. 10.3 Calculation of the Rate Distortion Function. 10.4 Converse to the Rate Distortion Theorem. 10.5 Achievability of the Rate Distortion Function. 10.6 Strongly Typical Sequences and Rate Distortion. 10.7 Characterization of the Rate Distortion Function. 10.8 Computation of Channel Capacity and the Rate Distortion Function. Summary. Problems. Historical Notes. 11. Information Theory and Statistics. 11.1 Method of Types. 11.2 Law of Large Numbers. 11.3 Universal Source Coding. 11.4 Large Deviation Theory. 11.5 Examples of Sanov's Theorem. 11.6 Conditional Limit Theorem. 11.7 Hypothesis Testing. 11.8 Chernoff-Stein Lemma. 11.9 Chernoff Information. 11.10 Fisher Information and the Cram-er-Rao Inequality. Summary. Problems. Historical Notes. 12. Maximum Entropy. 12.1 Maximum Entropy Distributions. 12.2 Examples. 12.3 Anomalous Maximum Entropy Problem. 12.4 Spectrum Estimation. 12.5 Entropy Rates of a Gaussian Process. 12.6 Burg's Maximum Entropy Theorem. Summary. Problems. Historical Notes. 13. Universal Source Coding. 13.1 Universal Codes and Channel Capacity. 13.2 Universal Coding for Binary Sequences. 13.3 Arithmetic Coding. 13.4 Lempel-Ziv Coding. 13.5 Optimality of Lempel-Ziv Algorithms. Compression. Summary. Problems. Historical Notes. 14. Kolmogorov Complexity. 14.1 Models of Computation. 14.2 Kolmogorov Complexity: Definitions and Examples. 14.3 Kolmogorov Complexity and Entropy. 14.4 Kolmogorov Complexity of Integers. 14.5 Algorithmically Random and Incompressible Sequences. 14.6 Universal Probability. 14.7 Kolmogorov complexity. 14.9 Universal Gambling. 14.10 Occam's Razor. 14.11 Kolmogorov Complexity and Universal Probability. 14.12 Kolmogorov Sufficient Statistic. 14.13 Minimum Description Length Principle. Summary. Problems. Historical Notes. 15. Network Information Theory. 15.1 Gaussian Multiple-User Channels. 15.2 Jointly Typical Sequences. 15.3 Multiple-Access Channel. 15.4 Encoding of Correlated Sources. 15.5 Duality Between Slepian-Wolf Encoding and Multiple-Access Channels. 15.6 Broadcast Channel. 15.7 Relay Channel. 15.8 Source Coding with Side Information. 15.9 Rate Distortion with Side Information. 15.10 General Multiterminal Networks. Summary. Problems. Historical Notes. 16. Information Theory and Portfolio Theory. 16.1 The Stock Market: Some Definitions. 16.2 Kuhn-Tucker Characterization of the Log-Optimal Portfolio. 16.3 Asymptotic Optimality of the Log-Optimal Portfolio. 16.4 Side Information and the Growth Rate. 16.5 Investment in Stationary Markets. 16.6 Competitive Optimality of the Log-Optimal Portfolio. 16.7 Universal Portfolios. 16.8 Shannon-McMillan-Breiman Theorem (General AEP). Summary. Problems. Historical Notes. 17. Inequalities in Information Theory. 17.1 Basic Inequalities of Information Theory. 17.2 Differential Entropy. 17.3 Bounds on Entropy and Relative Entropy. 17.4 Inequalities for Types. 17.5 Combinatorial Bounds on Entropy. 17.6 Entropy Rates of Subsets. 17.7 Entropy and Fisher Information. 17.8 Entropy Power Inequality and Brunn-Minkowski Inequality. 17.9 Inequalities for Determinants. 17.10 Inequalities for Ratios of Determinants. Summary. Problems. Historical Notes. Bibliography. List of Symbols. Index.

...read moreread less

45,034 citations

"Feature selection based on mutual i..." refers background in this paper

...Index Terms—Feature selection, mutual information, minimal redundancy, maximal relevance, maximal dependency, classification....
[...]

Book•

The Nature of Statistical Learning Theory

[...]

Vladimir Vapnik¹•Institutions (1)

Bell Labs¹

01 Jan 1995

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?

...read moreread less

Abstract: Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.

...read moreread less

40,147 citations

"Feature selection based on mutual i..." refers methods in this paper

...(There are two exceptions in Table 4, for which the obtained feature subsets are comparable: 1) “NCI+LDA+Forward,” where five mRMR features lead to 20 errors (33.33 percent) and seven MaxRel features lead to 19 errors (31.67 ercent) and 2) “LYM+SVM+Backward,” the same error (3.13 percent) is obtained.)...
[...]
...With SVM+40features,we obtained the error rate 23-26 percent formRMR, and 35-38 percent forMaxRel....
[...]
...3a, 3b, and 3c show the classification error rates with classifiers NB, SVM, and LDA, respectively....
[...]
...We use the LIBSVM package [9], which supports both 2-class and multiclass classification....
[...]
...To test this, we consider three widely used classifiers, i.e., Naive Bayes (NB), Support Vector Machine (SVM), and Linear Discrimant Analysis (LDA)....
[...]

Journal Article•DOI•

On Estimation of a Probability Density Function and Mode

[...]

Emanuel Parzen

01 Sep 1962-Annals of Mathematical Statistics

TL;DR: In this paper, the problem of the estimation of a probability density function and of determining the mode of the probability function is discussed. Only estimates which are consistent and asymptotically normal are constructed.

...read moreread less

Abstract: : Given a sequence of independent identically distributed random variables with a common probability density function, the problem of the estimation of a probability density function and of determining the mode of a probability function are discussed. Only estimates which are consistent and asymptotically normal are constructed. (Author)

...read moreread less

10,114 citations

"Feature selection based on mutual i..." refers background in this paper

...As one of the earliest classifiers, LDA [30] learns a linear classification boundary in the input feature space....
[...]

Journal Article•DOI•

Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

[...]

Ash A. Alizadeh¹, Michael B. Eisen², R. Eric Davis³, Izidore S. Lossos¹, Andreas Rosenwald, Jennifer C. Boldrick¹, Hajeer Sabet³, Truc Tran³, Xin Yu³, John Powell, Liming Yang, Gerald E. Marti, Troy Moore, James I. Hudson, Li-Sheng Lu¹, David B. Lewis¹, Robert Tibshirani¹, Gavin Sherlock², Wing C. Chan⁴, Timothy C. Greiner⁴, Dennis D. Weisenburger⁴, James O. Armitage⁴, Roger A. Warnke¹, Ronald Levy¹, Wyndham H. Wilson³, M. R. Grever⁵, John C. Byrd⁶, David Botstein², Patrick O. Brown¹, Louis M. Staudt³ - Show less +26 more•Institutions (6)

Stanford University¹, University of California, Berkeley², National Institutes of Health³, University of Nebraska Medical Center⁴, Johns Hopkins University⁵, Walter Reed Army Institute of Research⁶

03 Feb 2000-Nature

TL;DR: It is shown that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour.

...read moreread less

Abstract: 12 Pathology and Microbiology, and 13 Diffuse large B-cell lymphoma (DLBCL), the most common subtype of non-Hodgkin's lymphoma, is clinically heterogeneous: 40% of patients respond well to current therapy and have prolonged survival, whereas the remainder succumb to the disease. We proposed that this variability in natural history reflects unrecognized molecular heterogeneity in the tumours. Using DNA microarrays, we have conducted a systematic characterization of gene expression in B-cell malignancies. Here we show that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour. We identified two molecularly distinct forms of DLBCL which had gene expression patterns indicative of different stages of B-cell differentiation. One type expressed genes characteristic of germinal centre B cells ('germinal centre B-like DLBCL'); the second type expressed genes normally induced during in vitro activation of peripheral blood B cells ('activated B-like DLBCL'). Patients with germinal centre B-like DLBCL had a significantly better overall survival than those with activated B-like DLBCL. The molecular classification of tumours on the basis of gene expression can thus identify previously undetected and clinically significant subtypes of cancer.

...read moreread less

9,493 citations

"Feature selection based on mutual i..." refers background or methods in this paper

...(There are two exceptions in Table 4, for which the obtained feature subsets are comparable: 1) “NCI+LDA+Forward,” where five mRMR features lead to 20 errors (33.33 percent) and seven MaxRel features lead to 19 errors (31.67 ercent) and 2) “LYM+SVM+Backward,” the same error (3.13 percent) is obtained.)...
[...]
...For LYM data in Fig....
[...]
...For the LYMdata,MaxDep needsmore than 200 seconds to find the 50th feature, while mRMR uses only 5 seconds....
[...]
...For example, we compared the average computational time cost to select the top 50 mRMR and MaxDep features for both continuous data sets NCI and LYM, based on parallel experiments on a cluster of eight 3.06G Xeon CPUs running Redhat Linux 9, with the Matlab implementation....
[...]
...The data set LYM [1] has 96 samples of 4,026 gene features....
[...]

Journal Article•DOI•

Wrappers for feature subset selection

[...]

Ron Kohavi, George H. John

01 Dec 1997-Artificial Intelligence

TL;DR: The wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain and compares the wrapper approach to induction without feature subset selection and to Relief, a filter approach tofeature subset selection.

...read moreread less

8,610 citations

"Feature selection based on mutual i..." refers background or methods in this paper

...A wrapper [ 15 ], [18] is a feature selector that convolves with a classifier (e.g., naive Bayes classifier), with the direct goal to minimize the classification error of the particular classifier....
[...]
...Second, we investigate how to combine mRMR with other feature selection methods (such as wrappers [18], [ 15 ]) into a two-stage selection algorithm....
[...]
...The latter type of approach (e.g., mRMR and Max-Relevance), sometimes called “filter” [18], [ 15 ], often selects features by testing whether some preset conditions about the features and the target class are satisfied....
[...]
...[ 15 ], [22], [12], [5]) and select features with the minimal redundancy (Min-Redundancy)....
[...]
...N many pattern recognition applications, identifying the most characterizing features (or attributes) of the observed data, i.e., feature selection (or variable selection, among many other names) [30], [14], [17], [18], [ 15 ], [12], [11], [19], [31], [32], [5], is critical to minimize the classification error....
[...]