Infinite Latent Feature Selection: A Probabilistic Latent Graph-Based Ranking Approach
Summary (3 min read)
1. Introduction
- Performance of machine learning methods is heavily dependent on the choice of features on which they are applied.
- The authors approach aims to model an important hidden variable behind data, that is, relevancy in features.
- The proposed method is compared against 11 state of the art feature selection methods selected from recent literature in the machine learning and pattern recognition domains, reporting results for a total of 576 unique tests (note, the source code is available at Matlab-Central).
3. Our Approach
- Each weight represents the likelihood that features ~xi and ~xj are good candidates.
- The learning framework models the probability of each cooccurrence in ~xi, ~xj as a mixture of conditionally independent multinomial distributions, where parameters are learnt using the EM algorithm.
- Given the weighted graph G, the proposed approach analyses subsets of features as paths connecting them.
- The cost of each path is given by the joint probability of all the nodes belonging to it.
- For this reason, the authors dub their approach infinite latent feature selection (ILFS).
3.1. Discriminative Quantization process
- Tokens are the words of their dictionary of features.
- Thus, each feature will be represented by a new low-dimensional vocabulary of meaningful tokens.
- Illustration of the general structure of the model.
- Secondly, the authors assign a token to values falling into each interval.
3.2. From co-occurrences to graph weighting
- Weighting the graph according to the nodes discriminatory power has a great influence on the quality of the ranking process.
- In order to better understand the intuition behind the proposed model, the authors need to make some assumptions.
- Fig. 1.(a) shows the general structure of the model, each feature can be represented as a mixture of concepts (Relevant/Irrelevant) weighted by the probability P (z|f) and each token expresses a topic with probability P (t|z).
- The unknown parameters of this model are P (t|z) and P (z|f).
- The responsibility for assigning the “condition of being relevant” to features lies to a great extent with the unobserved class variable Z.
3.3. Probabilistic Infinite Feature Selection
- For simplicity, suppose that the length l of the path is lower than the total number of nodes n in the graph.
- The authors want to consider all the possible paths of any length in the graph, which turns out to be the same as considering all the the possible subsets of features of any cardinality.
- Therefore, extending the path length to infinity implies that the authors have to calculate the geometric series of matrix A Ĉ = ∞∑ l=1 Al. (7) Summing infinite Al terms brings divergence.
- Thus, for appropriate choices of r, it is ensured that the infinite sum converges.
- Ranking in decreasing order the č(i) scores gives the output of the algorithm: a ranked list of features where the most discriminative and relevant features are positioned at the top of the list.
3.4. Markov chains and random walks
- This section provides a probabilistic interpretation of the proposed algorithm based on Absorbing Random Walks.
- Here, the authors reformulate the problem in terms of Markov chains and random walks.
- The probabilities tij are called transition probabilities.
- Note that R and 0 are not necessarily square.
- Note that C, which is a square matrix with rows and columns corresponding to the non-absorbing states, is derived in the same way of Eq.9. C(i, j) is the expected number of periods that the chain spends in the jth non-absorbing state given that the chain began in the ith non-absorbing state.
4. Experiments and Results
- The first goal is to evaluate the robustness of the proposed method, by choosing datasets spanning over a variety of domains and difficulties.
- The second goal is to analyze and empirically clarify how well important features are ranked high by the ILFS.
- The authors also include several comparative algorithms from recent literature, including filters, wrappers, and embedded methods.
- The last goal is to assess the reliability and validity of their research results.
- Finally, Tab. 2 reports the execution time of each method when applied to a randomly generated dataset consisting of 2 classes, 10k samples, and 5k features (features follow a uniform distribution - range [0,1000]), on an Intel i7 CPU 3.4GHz, 16.0 GB of RAM, using MATLAB 2016b.
4.1. Exp. #1: Deep Representation with pretraining
- The authors selected the pre-trained model called very deep ConvNets [31], which performed favorably to the state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 .
- The performance is measured as mean Average Precision (mAP) across all classes.
- The number of features used for both the experiments is set to: 50% of the total.
- Their method achieved the best performance in terms of mean average precision (mAP) on the VOC-2007, followed by Fisher, MI, also known as The results are significant.
4.2. Exp.#2: Miscellaneous Datasets
- In order to avoid any biases given for a particular favorable split, this procedure is repeated for 20 times and results are averaged over the trials.
- Feature selection is applied only on the training set and features are selected, generating different subsets of different cardinality (i.e., 10, 50, 100, 150, and 200).
- Results are reported in terms of mAP as for the previous experiment.
- Fisher, which performs well over all the datasets does not show the same ranking quality as ILFS.
4.3. Reliability and Validity
- In order to assess whether the difference in performance is statistically significant, a set of Student’s t-test have been applied to the results [3].
- 4 comes from the average of the accuracies obtained from a series of SVM classifications over 20 different splits of the data for 5 different subsets of features (i.e., a total of 100 different tests for each method).
- Thus, given the distribution of these accuracies for the proposed method dp, and the ones of the ith competitor dci , a two-sample t-test has been applied obtaining a test decision for the null hypothesisH0 that all the data come from independent random samples from normal distributions.
- Based on this result, the authors assess the validity of the reported results by the binomial cumulative distribution function [3, 9].
- Success when the ILFS outperforms all the other methods with a certain probability to do it by chance p. From Tab.
5. Conclusion
- In this paper the authors proposed a probabilistic feature selection algorithm that performs the ranking step by considering all the possible subsets of features bypassing the combinatorial problem.
- The most appealing characteristic of the ILFS is that it aims to model the features “relevancy” using PLSA-inspired process.
- The derived mixing weights P (z|f) are used to weight a graph of features.
- Finally, for the sake of repeatability, the source code is available at https://goo.gl/uTuZhc to provide the material needed to replicate their experiments.
Did you find this useful? Give us your feedback
Citations
148 citations
Cites background or methods from "Infinite Latent Feature Selection: ..."
...A number of feature scoring algorithms are available according to different criteria such as Mutual Information (MI) (Cover and Thomas, 2012), Fisher score (Gu et al., 2012), ReliefF (Liu and Motoda, 2007) (see, (Roffo et al., 2017) for their comparisons by using different datasets)....
[...]
...) classification model by using advanced dimension reduction approaches (Roffo et al., 2017); (ii) Design more levels of yellow rust infection or combining the data of different disease developmental stages so that regression analysis can be performed between SVIs and disease severity in a quantitative manner (Liu et al....
[...]
..., 2012), ReliefF (Liu and Motoda, 2007) (see, (Roffo et al., 2017) for their comparisons by using different datasets)....
[...]
...…SVI combination to build a better (in terms of simplicity, accuracy, etc.) classification model by using advanced dimension reduction approaches (Roffo et al., 2017); (ii) Design more levels of yellow rust infection or combining the data of different disease developmental stages so that…...
[...]
85 citations
82 citations
Cites methods from "Infinite Latent Feature Selection: ..."
...Some commonly used feature selection methods are, the Fisher method [149], infinite latent feature selection (ILFS) [150], feature selection via eigenvector centrality (EC-FS) [151], dependence guided unsupervised feature selection (DGUFS) [152], feature selection with adaptive structure learning (FSASL) [153], unsupervised feature selection with ordinal locality (UFSOL) [154], least absolute shrinkage and selection operator (LASSO) [155] and feature selection via concave minimization (FSV) [156]....
[...]
77 citations
76 citations
Cites background or methods from "Infinite Latent Feature Selection: ..."
...In this section, we compare our framework with several feature selection methods considering both recent approaches [22], [40], [41], [43], [45], [55], [58], as so as some established algorithms [11], [27], [28], [32], [33], [46], [50], [54]....
[...]
...In ILFS, the features are grouped into token by probabilistic latent semantic analysis (PLSA), which in practice learns the weights of the adjacency graph of Inf-FS as to provide better class separability....
[...]
...The proposed framework generalizes the previously published Infinite Feature Selection (Inf-FS) [22], [23] presented as an unsupervised filtering approach, explained by algebraic motivations....
[...]
...Recently, other graph-based approaches have been proposed such as the eigenvector centrality (ECFS) [40], [41], [42] and the infinite latent feature selection (ILFS) [22], which is an extension of the unsupervised Inf-FSU ....
[...]
...NHTP [59] ECFS [42] Fisher [33] FSV [11] ILFS [22] LASSOU [56] LASSOH [55] MI [34] ReliefF [30] RFE [47] Inf-FSS...
[...]
References
49,914 citations
12,530 citations
"Infinite Latent Feature Selection: ..." refers methods in this paper
...We selected 10 publicly available benchmarks of cancer classification and prediction on DNA microarray data (Colon [32], Lymphoma [14], Leukemia [14], Lung [15], Prostate [1]), handwritten character recognition (GINA [2]), text classification from the NIPS feature selection challenge (DEXTER [18]), and a movie reviews corpus for sentiment analysis (POLARITY [26])....
[...]
10,382 citations
7,939 citations
4,131 citations
Related Papers (5)
Frequently Asked Questions (2)
Q2. What are the future works in "Infinite latent feature selection: a probabilistic latent graph-based ranking approach" ?
This study also points to many future directions.