scispace - formally typeset
Search or ask a question

Showing papers on "Feature selection published in 2003"


Journal ArticleDOI
TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Abstract: Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

14,509 citations


05 Aug 2003
TL;DR: This work derives an equivalent form, called minimal-redundancy-maximal-relevance criterion (mRMR), for first-order incremental feature selection, and presents a two-stage feature selection algorithm by combining mRMR and other more sophisticated feature selectors (e.g., wrappers).

7,075 citations


Proceedings Article
21 Aug 2003
TL;DR: An approach to semi-supervised learning is proposed that is based on a Gaussian random field model, and methods to incorporate class priors and the predictions of classifiers obtained by supervised learning are discussed.
Abstract: An approach to semi-supervised learning is proposed that is based on a Gaussian random field model. Labeled and unlabeled data are represented as vertices in a weighted graph, with edge weights encoding the similarity between instances. The learning problem is then formulated in terms of a Gaussian random field on this graph, where the mean of the field is characterized in terms of harmonic functions, and is efficiently obtained using matrix methods or belief propagation. The resulting learning algorithms have intimate connections with random walks, electric networks, and spectral graph theory. We discuss methods to incorporate class priors and the predictions of classifiers obtained by supervised learning. We also propose a method of parameter learning by entropy minimization, and show the algorithm's ability to perform feature selection. Promising experimental results are presented for synthetic data, digit classification, and text classification tasks.

3,908 citations


Journal ArticleDOI
TL;DR: How and why Relief algorithms work, their theoretical and practical properties, their parameters, what kind of dependencies they detect, how do they scale up to large number of examples and features, how to sample data for them, how robust are they regarding the noise, how irrelevant and redundant attributes influence their output and how different metrics influences them.
Abstract: Relief algorithms are general and successful attribute estimators. They are able to detect conditional dependencies between attributes and provide a unified view on the attribute estimation in regression and classification. In addition, their quality estimates have a natural interpretation. While they have commonly been viewed as feature subset selection methods that are applied in prepossessing step before a model is learned, they have actually been used successfully in a variety of settings, e.g., to select splits or to guide constructive induction in the building phase of decision or regression tree learning, as the attribute weighting method and also in the inductive logic programming. A broad spectrum of successful uses calls for especially careful investigation of various features Relief algorithms have. In this paper we theoretically and empirically investigate and discuss how and why they work, their theoretical and practical properties, their parameters, what kind of dependencies they detect, how do they scale up to large number of examples and features, how to sample data for them, how robust are they regarding the noise, how irrelevant and redundant attributes influence their output and how different metrics influences them.

2,651 citations


Journal ArticleDOI
TL;DR: It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
Abstract: A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accu...

2,634 citations


Journal Article
George Forman1
TL;DR: An empirical comparison of twelve feature selection methods evaluated on a benchmark of 229 text classification problem instances, revealing that a new feature selection metric, called 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations and was the top single choice for all goals except precision.
Abstract: Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives-accuracy, F-measure, precision, and recall-since each is appropriate in different situations. The results reveal that a new feature selection metric we call 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair---e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.

2,621 citations


Proceedings Article
21 Aug 2003
TL;DR: A novel concept, predominant correlation, is introduced, and a fast filter method is proposed which can identify relevant features as well as redundancy among relevant features without pairwise correlation analysis.
Abstract: Feature selection, as a preprocessing step to machine learning, is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. However, the recent increase of dimensionality of data poses a severe challenge to many existing feature selection methods with respect to efficiency and effectiveness. In this work, we introduce a novel concept, predominant correlation, and propose a fast filter method which can identify relevant features as well as redundancy among relevant features without pairwise correlation analysis. The efficiency and effectiveness of our method is demonstrated through extensive comparisons with other methods using real-world data of high dimensionality

2,251 citations


Journal ArticleDOI
TL;DR: A benchmark comparison of several attribute selection methods for supervised classification by cross-validating the attribute rankings with respect to a classification learner to find the best attributes.
Abstract: Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant, and noisy attributes in the model building process phase can result in poor predictive performance and increased computation. Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted. This paper presents a benchmark comparison of several attribute selection methods for supervised classification. All the methods produce an attribute ranking, a useful devise for isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the attribute rankings with respect to a classification learner to find the best attributes. Results are reported for a selection of standard data sets and two diverse learning schemes C4.5 and naive Bayes.

1,248 citations


Journal ArticleDOI
TL;DR: The results indicate that the performance of the univariate DT is acceptably good in comparison with that of other classifiers, except with high-dimensional data, and the use of attribute selection methods does not appear to be justified in terms of accuracy increases.

1,013 citations


Journal ArticleDOI
TL;DR: An empirical study is conducted to examine the pros and cons of these search methods, give some guidelines on choosing a search method, and compare the classifier error rates before and after feature selection.

846 citations


Journal ArticleDOI
TL;DR: The algorithm for feature selection is based on an application of a rough set method to the result of principal components analysis (PCA) used for feature projection and reduction.

Journal ArticleDOI
TL;DR: It is seen that relatively few features are needed to achieve the same classification accuracies as in the original feature space when classification of panchromatic high-resolution data from urban areas using morphological and neural approaches.
Abstract: Classification of panchromatic high-resolution data from urban areas using morphological and neural approaches is investigated. The proposed approach is based on three steps. First, the composition of geodesic opening and closing operations of different sizes is used in order to build a differential morphological profile that records image structural information. Although, the original panchromatic image only has one data channel, the use of the composition operations will give many additional channels, which may contain redundancies. Therefore, feature extraction or feature selection is applied in the second step. Both discriminant analysis feature extraction and decision boundary feature extraction are investigated in the second step along with a simple feature selection based on picking the largest indexes of the differential morphological profiles. Third, a neural network is used to classify the features from the second step. The proposed approach is applied in experiments on high-resolution Indian Remote Sensing 1C (IRS-1C) and IKONOS remote sensing data from urban areas. In experiments, the proposed method performs well in terms of classification accuracies. It is seen that relatively few features are needed to achieve the same classification accuracies as in the original feature space.

Journal Article
TL;DR: In this article, the authors explore the use of the zero-norm of the parameters of linear models in learning and derive a simple but practical method for variable or feature selection, minimizing training error and ensuring sparsity in solutions.
Abstract: We explore the use of the so-called zero-norm of the parameters of linear models in learning. Minimization of such a quantity has many uses in a machine learning context: for variable or feature selection, minimizing training error and ensuring sparsity in solutions. We derive a simple but practical method for achieving these goals and discuss its relationship to existing techniques of minimizing the zero-norm. The method boils down to implementing a simple modification of vanilla SVM, namely via an iterative multiplicative rescaling of the training data. Applications we investigate which aid our discussion include variable and feature selection on biological microarray data, and multicategory classification.

Journal Article
TL;DR: New methods to evaluate variable subset relevance with a view to variable selection based on weight vector derivative achieves good results and performs consistently well over the datasets used.
Abstract: We propose new methods to evaluate variable subset relevance with a view to variable selection. Relevance criteria are derived from Support Vector Machines and are based on weight vector ||w||2 or generalization error bounds sensitivity with respect to a variable. Experiments on linear and non-linear toy problems and real-world datasets have been carried out to assess the effectiveness of these criteria. Results show that the criterion based on weight vector derivative achieves good results and performs consistently well over the datasets we used.

Journal Article
Kari Torkkola1
TL;DR: A quadratic divergence measure is used instead of a commonly used mutual information measure based on Kullback-Leibler divergence, which allows for an efficient non-parametric implementation and requires no prior assumptions about class densities.
Abstract: We present a method for learning discriminative feature transforms using as criterion the mutual information between class labels and transformed features. Instead of a commonly used mutual information measure based on Kullback-Leibler divergence, we use a quadratic divergence measure, which allows us to make an efficient non-parametric implementation and requires no prior assumptions about class densities. In addition to linear transforms, we also discuss nonlinear transforms that are implemented as radial basis function networks. Extensions to reduce the computational complexity are also presented, and a comparison to greedy feature selection is made.

Journal ArticleDOI
28 Jul 2003
TL;DR: The results of a linear (linear discriminant analysis) and two nonlinear classifiers applied to the classification of spontaneous EEG during five mental tasks are reported, showing that non linear classifiers produce only slightly better classification results.
Abstract: The reliable operation of brain-computer interfaces (BCIs) based on spontaneous electroencephalogram (EEG) signals requires accurate classification of multichannel EEG. The design of EEG representations and classifiers for BCI are open research questions whose difficulty stems from the need to extract complex spatial and temporal patterns from noisy multidimensional time series obtained from EEG measurements. The high-dimensional and noisy nature of EEG may limit the advantage of nonlinear classification methods over linear ones. This paper reports the results of a linear (linear discriminant analysis) and two nonlinear classifiers (neural networks and support vector machines) applied to the classification of spontaneous EEG during five mental tasks, showing that nonlinear classifiers produce only slightly better classification results. An approach to feature selection based on genetic algorithms is also presented with preliminary results of application to EEG during finger movement.

Proceedings ArticleDOI
Collins1, Liu1
13 Oct 2003
TL;DR: This paper presents an online feature selection mechanism for evaluating multiple features while tracking and adjusting the set of features used to improve tracking performance, and notes susceptibility of the variance ratio feature selection method to distraction by spatially correlated background clutter.
Abstract: We present a method for evaluating multiple feature spaces while tracking, and for adjusting the set of features used to improve tracking performance. Our hypothesis is that the features that best discriminate between object and background are also best for tracking the object. We develop an online feature selection mechanism based on the two-class variance ratio measure, applied to log likelihood distributions computed with respect to a given feature from samples of object and background pixels. This feature selection mechanism is embedded in a tracking system that adaptively selects the top-ranked discriminative features for tracking. Examples are presented to illustrate how the method adapts to changing appearances of both tracked object and scene background.

Journal ArticleDOI
TL;DR: A Bayesian approach to supervised learning, which leads to sparse solutions; that is, in which irrelevant parameters are automatically set exactly to zero, and involves no tuning or adjustment of sparseness-controlling hyperparameters.
Abstract: The goal of supervised learning is to infer a functional mapping based on a set of training examples. To achieve good generalization, it is necessary to control the "complexity" of the learned function. In Bayesian approaches, this is done by adopting a prior for the parameters of the function being learned. We propose a Bayesian approach to supervised learning, which leads to sparse solutions; that is, in which irrelevant parameters are automatically set exactly to zero. Other ways to obtain sparse classifiers (such as Laplacian priors, support vector machines) involve (hyper)parameters which control the degree of sparseness of the resulting classifiers; these parameters have to be somehow adjusted/estimated from the training data. In contrast, our approach does not involve any (hyper)parameters to be adjusted or estimated. This is achieved by a hierarchical-Bayes interpretation of the Laplacian prior, which is then modified by the adoption of a Jeffreys' noninformative hyperprior. Implementation is carried out by an expectation-maximization (EM) algorithm. Experiments with several benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms SVMs and performs competitively with the best alternative techniques, although it involves no tuning or adjustment of sparseness-controlling hyperparameters.

Journal Article
TL;DR: This paper addresses a common methodological flaw in the comparison of variable selection methods by addressing the problem of cross-validation performance estimates of the different variable subsets used with computationally intensive search algorithms.
Abstract: This paper addresses a common methodological flaw in the comparison of variable selection methods. A practical approach to guide the search or the selection process is to compute cross-validation performance estimates of the different variable subsets. Used with computationally intensive search algorithms, these estimates may overfit and yield biased predictions. Therefore, they cannot be used reliably to compare two selection methods, as is shown by the empirical results of this paper. Instead, like in other instances of the model selection problem, independent test sets should be used for determining the final performance. The claims made in the literature about the superiority of more exhaustive search algorithms over simpler ones are also revisited, and some of them infirmed.

Proceedings Article
01 Jan 2003
TL;DR: A low-order polynomial algorithm and several variants that soundly induce the Markov Blanket under certain broad conditions in datasets with thousands of variables are introduced and compared to other state-of-the-art local and global methods with excellent results.
Abstract: This paper presents a number of new algorithms for discovering the Markov Blanket of a target variable T from training data. The Markov Blanket can be used for variable selection for classification, for causal discovery, and for Bayesian Network learning. We introduce a low-order polynomial algorithm and several variants that soundly induce the Markov Blanket under certain broad conditions in datasets with thousands of variables and compare them to other state-of-the-art local and global methods with excellent results.

Proceedings Article
21 Aug 2003
TL;DR: It is empirically demonstrate that learning a distance metric using the RCA algorithm significantly improves clustering performance, similarly to the alternative algorithm.
Abstract: We address the problem of learning distance metrics using side-information in the form of groups of "similar" points. We propose to use the RCA algorithm, which is a simple and efficient algorithm for learning a full ranked Mahalanobis metric (Shental et al., 2002). We first show that RCA obtains the solution to an interesting optimization problem, founded on an information theoretic basis. If the Mahalanobis matrix is allowed to be singular, we show that Fisher's linear discriminant followed by RCA is the optimal dimensionality reduction algorithm under the same criterion. We then show how this optimization problem is related to the criterion optimized by another recent algorithm for metric learning (Xing et al., 2002), which uses the same kind of side information. We empirically demonstrate that learning a distance metric using the RCA algorithm significantly improves clustering performance, similarly to the alternative algorithm. Since the RCA algorithm is much more efficient and cost effective than the alternative, as it only uses closed form expressions of the data, it seems like a preferable choice for the learning of full rank Mahalanobis distances.

Journal Article
TL;DR: The method constructs a series of sparse linear SVMs to generate linear models that can generalize well, and uses a subset of nonzero weighted variables found by the linear models to produce a final nonlinear model.
Abstract: We describe a methodology for performing variable ranking and selection using support vector machines (SVMs). The method constructs a series of sparse linear SVMs to generate linear models that can generalize well, and uses a subset of nonzero weighted variables found by the linear models to produce a final nonlinear model. The method exploits the fact that a linear SVM (no kernels) with l1-norm regularization inherently performs variable selection as a side-effect of minimizing capacity of the SVM model. The distribution of the linear model weights provides a mechanism for ranking and interpreting the effects of variables. Starplots are used to visualize the magnitude and variance of the weights for each variable. We illustrate the effectiveness of the methodology on synthetic data, benchmark problems, and challenging regression problems in drug design. This method can dramatically reduce the number of variables and outperforms SVMs trained using all attributes and using the attributes selected according to correlation coefficients. The visualization of the resulting models is useful for understanding the role of underlying variables.

Journal ArticleDOI
TL;DR: A study to compare the performance of bearing fault detection using two different classifiers, namely, artificial neural networks and support vector machines (SMVs), using time-domain vibration signals of a rotating machine with normal and defective bearings.

Journal ArticleDOI
TL;DR: A hierarchical Bayesian model for gene (variable) selection is proposed and applied to cancer classification via cDNA microarrays where the genes BRCA1 and BRCa2 are associated with a hereditary disposition to breast cancer, and the method is used to identify a set of significant genes.
Abstract: Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We employ latent variables to specialize the model to a regression setting and uses a Bayesian mixture prior to perform the variable selection. We control the size of the model by assigning a prior distribution over the dimension (number of significant genes) of the model. The posterior distributions of the parameters are not in explicit form and we need to use a combination of truncated sampling and Markov Chain Monte Carlo (MCMC) based computation techniques to simulate the parameters from the posteriors. The Bayesian model is flexible enough to identify significant genes as well as to perform future predictions. The method is applied to cancer classification via cDNA microarrays where the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the method is used to identify a set of significant genes. The method is also applied successfully to the leukemia data.

Proceedings ArticleDOI
03 Nov 2003
TL;DR: This paper presents a special genetic algorithm, which especially takes into account the existing bounds on the generalization error for support vector machines (SVMs), which is compared to the traditional method of performing cross-validation and to other existing algorithms for feature selection.
Abstract: The problem of feature selection is a difficult combinatorial task in machine learning and of high practical relevance, e.g. in bioinformatics. genetic algorithms (GAs) offer a natural way to solve this problem. In this paper, we present a special genetic algorithm, which especially takes into account the existing bounds on the generalization error for support vector machines (SVMs). This new approach is compared to the traditional method of performing cross-validation and to other existing algorithms for feature selection.

Journal ArticleDOI
TL;DR: In this paper, a model selector should instead focus on the parameter singled out for interest; in particular, a model that gives good precision for one estimand may be worse when used for inference for another estimand.
Abstract: A variety of model selection criteria have been developed, of general and specific types. Most of these aim at selecting a single model with good overall properties, for example, formulated via average prediction quality or shortest estimated overall distance to the true model. The Akaike, the Bayesian, and the deviance information criteria, along with many suitable variations, are examples of such methods. These methods are not concerned, however, with the actual use of the selected model, which varies with context and application. The present article takes the view that the model selector should instead focus on the parameter singled out for interest; in particular, a model that gives good precision for one estimand may be worse when used for inference for another estimand. We develop a method that, for a given focus parameter, estimates the precision of any submodel-based estimator. The framework is that of large-sample likelihood inference. Using an unbiased estimate of limiting risk, we propose a foc...

Journal Article
TL;DR: Grafting treats the selection of suitable features as an integral part of learning a predictor in a regularized learning framework, and operates in an incremental iterative fashion, gradually building up a feature set while training a predictor model using gradient descent.
Abstract: We present a novel and flexible approach to the problem of feature selection, called grafting. Rather than considering feature selection as separate from learning, grafting treats the selection of suitable features as an integral part of learning a predictor in a regularized learning framework. To make this regularized learning process sufficiently fast for large scale problems, grafting operates in an incremental iterative fashion, gradually building up a feature set while training a predictor model using gradient descent. At each iteration, a fast gradient-based heuristic is used to quickly assess which feature is most likely to improve the existing model, that feature is then added to the model, and the model is incrementally optimized using gradient descent. The algorithm scales linearly with the number of data points and at most quadratically with the number of features. Grafting can be used with a variety of predictor model classes, both linear and non-linear, and can be used for both classification and regression. Experiments are reported here on a variant of grafting for classification, using both linear and non-linear models, and using a logistic regression-inspired loss function. Results on a variety of synthetic and real world data sets are presented. Finally the relationship between grafting, stagewise additive modelling, and boosting is explored.

Proceedings ArticleDOI
13 Oct 2003
TL;DR: The evaluation shows that local invariant descriptors are an appropriate representation for object classes such as cars, and it underlines the importance of feature selection.
Abstract: We introduce a novel method for constructing and selecting scale-invariant object parts. Scale-invariant local descriptors are first grouped into basic parts. A classifier is then learned for each of these parts, and feature selection is used to determine the most discriminative ones. This approach allows robust pan detection, and it is invariant under scale changes-that is, neither the training images nor the test images have to be normalized. The proposed method is evaluated in car detection tasks with significant variations in viewing conditions, and promising results are demonstrated. Different local regions, classifiers and feature selection methods are quantitatively compared. Our evaluation shows that local invariant descriptors are an appropriate representation for object classes such as cars, and it underlines the importance of feature selection.

Proceedings ArticleDOI
24 Aug 2003
TL;DR: A novel local algorithm that returns all variables with direct edges to and from a target variable T as well as a local algorithms that returns the Markov Blanket of T, which are promising not only for discovery of local causal structure, and variable selection for classification, but also for the induction of complete BNs.
Abstract: Data Mining with Bayesian Network learning has two important characteristics: under conditions learned edges between variables correspond to casual influences, and second, for every variable T in the network a special subset (Markov Blanket) identifiable by the network is the minimal variable set required to predict T. However, all known algorithms learning a complete BN do not scale up beyond a few hundred variables. On the other hand, all known sound algorithms learning a local region of the network require an exponential number of training instances to the size of the learned region.The contribution of this paper is two-fold. We introduce a novel local algorithm that returns all variables with direct edges to and from a target variable T as well as a local algorithm that returns the Markov Blanket of T. Both algorithms (i) are sound, (ii) can be run efficiently in datasets with thousands of variables, and (iii) significantly outperform in terms of approximating the true neighborhood previous state-of-the-art algorithms using only a fraction of the training size required by the existing methods. A fundamental difference between our approach and existing ones is that the required sample depends on the generating graph connectivity and not the size of the local region; this yields up to exponential savings in sample relative to previously known algorithms. The results presented here are promising not only for discovery of local causal structure, and variable selection for classification, but also for the induction of complete BNs.

Proceedings Article
01 Jan 2003
TL;DR: A novel, sound, sample-efficient, and highly-scalable algorithm for variable selection for classification, regression and prediction called HITON, which reduces the number of variables in the prediction models by three orders of magnitude relative to the original variable set while improving or maintaining accuracy.
Abstract: We introduce a novel, sound, sample-efficient, and highly-scalable algorithm for variable selection for classification, regression and prediction called HITON. The algorithm works by inducing the Markov Blanket of the variable to be classified or predicted. A wide variety of biomedical tasks with different characteristics were used for an empirical evaluation. Namely, (i) bioactivity prediction for drug discovery, (ii) clinical diagnosis of arrhythmias, (iii) bibliographic text categorization, (iv) lung cancer diagnosis from gene expression array data, and (v) proteomics-based prostate cancer detection. State-of-the-art algorithms for each domain were selected for baseline comparison. Results: (1) HITON reduces the number of variables in the prediction models by three orders of magnitude relative to the original variable set while improving or maintaining accuracy. (2) HITON outperforms the baseline algorithms by selecting more than two orders-ofmagnitude smaller variable sets than the baselines, in the selected tasks and datasets.