scispace - formally typeset
Search or ask a question

Showing papers on "Feature selection published in 2014"


Journal ArticleDOI
TL;DR: The objective is to provide a generic introduction to variable elimination which can be applied to a wide array of machine learning problems and focus on Filter, Wrapper and Embedded methods.

3,517 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: It is shown how an ensemble of regression trees can be used to estimate the face's landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions.
Abstract: This paper addresses the problem of Face Alignment for a single image. We show how an ensemble of regression trees can be used to estimate the face's landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions. We present a general framework based on gradient boosting for learning an ensemble of regression trees that optimizes the sum of square error loss and naturally handles missing or partially labelled data. We show how using appropriate priors exploiting the structure of image data helps with efficient feature selection. Different regularization strategies and its importance to combat overfitting are also investigated. In addition, we analyse the effect of the quantity of training data on the accuracy of the predictions and explore the effect of data augmentation using synthesized data.

2,545 citations


Journal ArticleDOI
TL;DR: This survey paper tackles a comprehensive overview of the last update in this field of sentiment analysis with sophisticated categorizations of a large number of recent articles and the illustration of the recent trend of research in the sentiment analysis and its related areas.

2,152 citations


Journal ArticleDOI
Xudong Cao1, Yichen Wei1, Fang Wen1, Jian Sun1
TL;DR: A very efficient, highly accurate, “Explicit Shape Regression” approach for face alignment that significantly outperforms the state-of-the-art in terms of both accuracy and efficiency.
Abstract: We present a very efficient, highly accurate, "Explicit Shape Regression" approach for face alignment. Unlike previous regression-based approaches, we directly learn a vectorial regression function to infer the whole facial shape (a set of facial landmarks) from the image and explicitly minimize the alignment errors over the training data. The inherent shape constraint is naturally encoded into the regressor in a cascaded learning framework and applied from coarse to fine during the test, without using a fixed parametric shape model as in most previous methods. To make the regression more effective and efficient, we design a two-level boosted regression, shape indexed features and a correlation-based feature selection method. This combination enables us to learn accurate models from large training data in a short time (20 min for 2,000 training images), and run regression extremely fast in test (15 ms for a 87 landmarks shape). Experiments on challenging data show that our approach significantly outperforms the state-of-the-art in terms of both accuracy and efficiency.

1,239 citations


Journal ArticleDOI
01 Oct 2014-Genetics
TL;DR: The BGLR R-package implements a large collection of Bayesian regression models, including parametric variable selection and shrinkage methods and semiparametric procedures, which allows integrating various parametric and nonparametric shrinkage and variable selection procedures in a unified and consistent manner.
Abstract: Many modern genomic data analyses require implementing regressions where the number of parameters (p, e.g., the number of marker effects) exceeds sample size (n). Implementing these large-p-with-small-n regressions poses several statistical and computational challenges, some of which can be confronted using Bayesian methods. This approach allows integrating various parametric and nonparametric shrinkage and variable selection procedures in a unified and consistent manner. The BGLR R-package implements a large collection of Bayesian regression models, including parametric variable selection and shrinkage methods and semiparametric procedures (Bayesian reproducing kernel Hilbert spaces regressions, RKHS). The software was originally developed for genomic applications; however, the methods implemented are useful for many nongenomic applications as well. The response can be continuous (censored or not) or categorical (either binary or ordinal). The algorithm is based on a Gibbs sampler with scalar updates and the implementation takes advantage of efficient compiled C and Fortran routines. In this article we describe the methods implemented in BGLR, present examples of the use of the package, and discuss practical issues emerging in real-data analysis.

987 citations


Journal ArticleDOI
TL;DR: This work reviews feature extraction methods for emotion recognition from EEG based on 33 studies, and results suggest preference to locations over parietal and centro-parietal lobes.
Abstract: Emotion recognition from EEG signals allows the direct assessment of the “inner” state of a user, which is considered an important factor in human-machine-interaction. Many methods for feature extraction have been studied and the selection of both appropriate features and electrode locations is usually based on neuro-scientific findings. Their suitability for emotion recognition, however, has been tested using a small amount of distinct feature sets and on different, usually small data sets. A major limitation is that no systematic comparison of features exists. Therefore, we review feature extraction methods for emotion recognition from EEG based on 33 studies. An experiment is conducted comparing these features using machine learning techniques for feature selection on a self recorded data set. Results are presented with respect to performance of different feature selection methods, usage of selected feature types, and selection of electrode locations. Features selected by multivariate methods slightly outperform univariate methods. Advanced feature extraction techniques are found to have advantages over commonly used spectral power bands. Results also suggest preference to locations over parietal and centro-parietal lobes.

743 citations


Journal ArticleDOI
TL;DR: An algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and a repeated nested cross- validation algorithm for model assessment are described and evaluated.
Abstract: We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches. We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case. We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models. We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.

644 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: A novel Boosted Deep Belief Network for performing the three training stages iteratively in a unified loopy framework and showed that the BDBN framework yielded dramatic improvements in facial expression analysis.
Abstract: A training process for facial expression recognition is usually performed sequentially in three individual stages: feature learning, feature selection, and classifier construction. Extensive empirical studies are needed to search for an optimal combination of feature representation, feature set, and classifier to achieve good recognition performance. This paper presents a novel Boosted Deep Belief Network (BDBN) for performing the three training stages iteratively in a unified loopy framework. Through the proposed BDBN framework, a set of features, which is effective to characterize expression-related facial appearance/shape changes, can be learned and selected to form a boosted strong classifier in a statistical way. As learning continues, the strong classifier is improved iteratively and more importantly, the discriminative capabilities of selected features are strengthened as well according to their relative importance to the strong classifier via a joint fine-tune process in the BDBN framework. Extensive experiments on two public databases showed that the BDBN framework yielded dramatic improvements in facial expression analysis.

608 citations


Journal ArticleDOI
TL;DR: An experimental evaluation on the most representative datasets using well-known feature selection methods is presented, bearing in mind that the aim is not to provide the best feature selection method, but to facilitate their comparative study by the research community.

530 citations


Journal ArticleDOI
TL;DR: The knockoff filter is introduced, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables, and empirical results show that the resulting method has far more power than existing selection rules when the proportion of null variables is high.
Abstract: In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR) - the expected fraction of false discoveries among all discoveries - is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. This paper introduces the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, or the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. As the name suggests, the method operates by manufacturing knockoff variables that are cheap - their construction does not require any new data - and are designed to mimic the correlation structure found within the existing variables, in a way that allows for accurate FDR control, beyond what is possible with permutation-based methods. The method of knockoffs is very general and flexible, and can work with a broad class of test statistics. We test the method in combination with statistics from the Lasso for sparse regression, and obtain empirical results showing that the resulting method has far more power than existing selection rules when the proportion of null variables is high.

503 citations


Journal ArticleDOI
TL;DR: A review of the state of the art of information-theoretic feature selection methods can be found in this paper, where the concepts of feature relevance, redundance, and complementarity are clearly defined, as well as Markov blanket.
Abstract: In this work, we present a review of the state of the art of information-theoretic feature selection methods. The concepts of feature relevance, redundance, and complementarity (synergy) are clearly defined, as well as Markov blanket. The problem of optimal feature selection is defined. A unifying theoretical framework is described, which can retrofit successful heuristic criteria, indicating the approximations made by each method. A number of open problems in the field are presented.

Journal ArticleDOI
TL;DR: This paper proposes a novel unsupervised feature selection framework, termed as the joint embedding learning and sparse regression (JELSR), in which the embedding learned with sparse regression to perform feature selection.
Abstract: Feature selection has aroused considerable research interests during the last few decades. Traditional learning-based feature selection methods separate embedding learning and feature ranking. In this paper, we propose a novel unsupervised feature selection framework, termed as the joint embedding learning and sparse regression (JELSR), in which the embedding learning and sparse regression are jointly performed. Specifically, the proposed JELSR joins embedding learning with sparse regression to perform feature selection. To show the effectiveness of the proposed framework, we also provide a method using the weight via local linear approximation and adding the $\ell_{2,1}$ -norm regularization, and design an effective algorithm to solve the corresponding optimization problem. Furthermore, we also conduct some insightful discussion on the proposed feature selection approach, including the convergence analysis, computational complexity, and parameter determination. In all, the proposed framework not only provides a new perspective to view traditional methods but also evokes some other deep researches for feature selection. Compared with traditional unsupervised feature selection methods, our approach could integrate the merits of embedding learning and sparse regression. Promising experimental results on different kinds of data sets, including image, voice data and biological data, have validated the effectiveness of our proposed algorithm.

Journal ArticleDOI
01 May 2014
TL;DR: Experiments on twenty benchmark datasets show that PSO with the new initialisation strategies and/or the new updating mechanisms can automatically evolve a feature subset with a smaller number of features and higher classification performance than using all features.
Abstract: In classification, feature selection is an important data pre-processing technique, but it is a difficult problem due mainly to the large search space. Particle swarm optimisation (PSO) is an efficient evolutionary computation technique. However, the traditional personal best and global best updating mechanism in PSO limits its performance for feature selection and the potential of PSO for feature selection has not been fully investigated. This paper proposes three new initialisation strategies and three new personal best and global best updating mechanisms in PSO to develop novel feature selection approaches with the goals of maximising the classification performance, minimising the number of features and reducing the computational time. The proposed initialisation strategies and updating mechanisms are compared with the traditional initialisation and the traditional updating mechanism. Meanwhile, the most promising initialisation strategy and updating mechanism are combined to form a new approach (PSO(4-2)) to address feature selection problems and it is compared with two traditional feature selection methods and two PSO based methods. Experiments on twenty benchmark datasets show that PSO with the new initialisation strategies and/or the new updating mechanisms can automatically evolve a feature subset with a smaller number of features and higher classification performance than using all features. PSO(4-2) outperforms the two traditional methods and two PSO based algorithm in terms of the computational time, the number of features and the classification performance. The superior performance of this algorithm is due mainly to both the proposed initialisation strategy, which aims to take the advantages of both the forward selection and backward selection to decrease the number of features and the computational time, and the new updating mechanism, which can overcome the limitations of traditional updating mechanisms by taking the number of features into account, which reduces the number of features and the computational time.

Journal ArticleDOI
TL;DR: Feature reduction is an essential step before training a machine learning model to avoid overfitting and therefore improving model prediction accuracy and generalization ability and in this review, feature reduction techniques used with machine learning in neuroimaging studies are discussed.
Abstract: Machine learning techniques are increasingly being used in making relevant predictions and inferences on individual subjects neuroimaging scan data. Previous studies have mostly focused on categorical discrimination of patients and matched healthy controls and more recently, on prediction of individual continuous variables such as clinical scores or age. However, these studies are greatly hampered by the large number of predictor variables (voxels) and low observations (subjects) also known as the curse-of-dimensionality or small-n-large-p problem. As a result, feature reduction techniques such as feature subset selection and dimensionality reduction are used to remove redundant predictor variables and experimental noise, a process which mitigates the curse-of-dimensionality and small-n-large-p effects. Feature reduction is an essential step before training a machine learning model to avoid overfitting and therefore improving model prediction accuracy and generalization ability. In this review, we discuss feature reduction techniques used with machine learning in neuroimaging studies.

Journal ArticleDOI
TL;DR: This paper presents a data mining (DM) based approach to developing ensemble models for predicting next-day energy consumption and peak power demand, with the aim of improving the prediction accuracy.

Journal ArticleDOI
TL;DR: The concepts of feature relevance, general procedures, evaluation criteria, and the characteristics of feature selection are introduced and guidelines are provided for user to select a feature selection algorithm without knowing the information of each algorithm.
Abstract: Relevant feature identification has become an essential task to apply data mining algorithms effectively in real-world scenarios. Therefore, many feature selection methods have been proposed to obtain the relevant feature or feature subsets in the literature to achieve their objectives of classification and clustering. This paper introduces the concepts of feature relevance, general procedures, evaluation criteria, and the characteristics of feature selection. A comprehensive overview, categorization, and comparison of existing feature selection methods are also done, and the guidelines are also provided for user to select a feature selection algorithm without knowing the information of each algorithm. We conclude this work with real world applications, challenges, and future research directions of feature selection.

Journal ArticleDOI
TL;DR: A novel spam detection method that focused on reducing the false positive error of mislabeling nonspam as spam, which demonstrated the MBPSO is superior to GA, RSA, PSO, and BPSO in terms of classification performance and wrappers are more effective than filters with regard to classification performance indexes.
Abstract: In this paper, we proposed a novel spam detection method that focused on reducing the false positive error of mislabeling nonspam as spam. First, we used the wrapper-based feature selection method to extract crucial features. Second, the decision tree was chosen as the classifier model with C4.5 as the training algorithm. Third, the cost matrix was introduced to give different weights to two error types, i.e., the false positive and the false negative errors. We define the weight parameter as a to adjust the relative importance of the two error types. Fourth, K-fold cross validation was employed to reduce out-of-sample error. Finally, the binary PSO with mutation operator (MBPSO) was used as the subset search strategy. Our experimental dataset contains 6000 emails, which were collected during the year of 2012. We conducted a Kolmogorov–Smirnov hypothesis test on the capital-run-length related features and found that all the p values were less than 0.001. Afterwards, we found a = 7 was the most appropriate in our model. Among seven meta-heuristic algorithms, we demonstrated the MBPSO is superior to GA, RSA, PSO, and BPSO in terms of classification performance. The sensitivity, specificity, and accuracy of the decision tree with feature selection by MBPSO were 91.02%, 97.51%, and 94.27%, respectively. We also compared the MBPSO with conventional feature selection methods such as SFS and SBS. The results showed that the MBPSO performs better than SFS and SBS. We also demonstrated that wrappers are more effective than filters with regard to classification performance indexes. It was clearly shown that the proposed method is effective, and it can reduce the false positive error without compromising the sensitivity and accuracy values.

Journal ArticleDOI
TL;DR: Experimental results that were achieved using the proposed novel HGA-NN classifier are promising for feature selection and classification in retail credit risk assessment and indicate that the H GA-NNclassifier is a promising addition to existing data mining techniques.
Abstract: In this paper, an advanced novel heuristic algorithm is presented, the hybrid genetic algorithm with neural networks (HGA-NN), which is used to identify an optimum feature subset and to increase the classification accuracy and scalability in credit risk assessment. This algorithm is based on the following basic hypothesis: the high-dimensional input feature space can be preliminarily restricted to only the important features. In this preliminary restriction, fast algorithms for feature ranking and earlier experience are used. Additionally, enhancements are made in the creation of the initial population, as well as by introducing an incremental stage in the genetic algorithm. The performances of the proposed HGA-NN classifier are evaluated using a real-world credit dataset that is collected at a Croatian bank, and the findings are further validated on another real-world credit dataset that is selected in a UCI database. The classification accuracy is compared with that presented in the literature. Experimental results that were achieved using the proposed novel HGA-NN classifier are promising for feature selection and classification in retail credit risk assessment and indicate that the HGA-NN classifier is a promising addition to existing data mining techniques.

Journal ArticleDOI
TL;DR: In this paper, a masked EM algorithm is proposed for high-dimensional data, which allows accurate and time-efficient clustering of up to millions of points in thousands of dimensions.
Abstract: Cluster analysis faces two problems in high dimensions: the "curse of dimensionality" that can lead to overfitting and poor generalization performance and the sheer time taken for conventional algorithms to process large amounts of high-dimensional data. We describe a solution to these problems, designed for the application of spike sorting for next-generation, high-channel-count neural probes. In this problem, only a small subset of features provides information about the cluster membership of any one data vector, but this informative feature subset is not the same for all data points, rendering classical feature selection ineffective. We introduce a "masked EM" algorithm that allows accurate and time-efficient clustering of up to millions of points in thousands of dimensions. We demonstrate its applicability to synthetic data and to real-world high-channel-count spike sorting data.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed kernel-based feature selection method with a criterion that is an integration of the previous work and the linear combination of features improves the classification performance of the SVM.
Abstract: Hyperspectral imaging fully portrays materials through numerous and contiguous spectral bands. It is a very useful technique in various fields, including astronomy, medicine, food safety, forensics, and target detection. However, hyperspectral images include redundant measurements, and most classification studies encountered the Hughes phenomenon. Finding a small subset of effective features to model the characteristics of classes represented in the data for classification is a critical preprocessing step required to render a classifier effective in hyperspectral image classification. In our previous work, an automatic method for selecting the radial basis function (RBF) parameter (i.e., σ) for a support vector machine (SVM) was proposed. A criterion that contains the between-class and within-class information was proposed to measure the separability of the feature space with respect to the RBF kernel. Thereafter, the optimal RBF kernel parameter was obtained by optimizing the criterion. This study proposes a kernel-based feature selection method with a criterion that is an integration of the previous work and the linear combination of features. In this new method, two properties can be achieved according to the magnitudes of the coefficients being calculated: the small subset of features and the ranking of features. Experimental results on both one simulated dataset and two hyperspectral images (the Indian Pine Site dataset and the Pavia University dataset) show that the proposed method improves the classification performance of the SVM.

Journal ArticleDOI
TL;DR: This paper presents an unsupervised feature selection method based on ant colony optimization, called UFSACO, which seeks to find the optimal feature subset through several iterations without using any learning algorithms.

Journal ArticleDOI
TL;DR: A greedy feature selection method using mutual information that combines both feature–feature mutual information and feature–class mutual information to find an optimal subset of features to minimize redundancy and to maximize relevance among features is introduced.
Abstract: Feature selection is used to choose a subset of relevant features for effective classification of data. In high dimensional data classification, the performance of a classifier often depends on the feature subset used for classification. In this paper, we introduce a greedy feature selection method using mutual information. This method combines both feature–feature mutual information and feature–class mutual information to find an optimal subset of features to minimize redundancy and to maximize relevance among features. The effectiveness of the selected feature subset is evaluated using multiple classifiers on multiple datasets. The performance of our method both in terms of classification accuracy and execution time performance, has been found significantly high for twelve real-life datasets of varied dimensionality and number of instances when compared with several competing feature selection techniques.

Journal ArticleDOI
TL;DR: A novel disease-specific feature selection method which consists of a one-versus-one (OvO) features ranking stage and a feature search stage wrapped in the same OvO-rule support vector machine (SVM) binary classifier.

Journal ArticleDOI
TL;DR: A novel unsupervised feature selection algorithm, named clustering-guided sparse structural learning (CGSSL), is proposed by integrating cluster analysis and sparse structural analysis into a joint framework and experimentally evaluated and demonstrated efficiency and effectiveness.
Abstract: Many pattern analysis and data mining problems have witnessed high-dimensional data represented by a large number of features, which are often redundant and noisy. Feature selection is one main technique for dimensionality reduction that involves identifying a subset of the most useful features. In this paper, a novel unsupervised feature selection algorithm, named clustering-guided sparse structural learning (CGSSL), is proposed by integrating cluster analysis and sparse structural analysis into a joint framework and experimentally evaluated. Nonnegative spectral clustering is developed to learn more accurate cluster labels of the input samples, which guide feature selection simultaneously. Meanwhile, the cluster labels are also predicted by exploiting the hidden structure shared by different features, which can uncover feature correlations to make the results more reliable. Row-wise sparse models are leveraged to make the proposed model suitable for feature selection. To optimize the proposed formulation, we propose an efficient iterative algorithm. Finally, extensive experiments are conducted on 12 diverse benchmarks, including face data, handwritten digit data, document data, and biomedical data. The encouraging experimental results in comparison with several representative algorithms and the theoretical analysis demonstrate the efficiency and effectiveness of the proposed algorithm for feature selection.

Journal ArticleDOI
TL;DR: This review provides an overview of wavelength selection methods in food-related areas and offers a thoughtful perspective on future potentials and challenges in the development of HSI systems.
Abstract: During the past decade, hyperspectral imaging (HSI) has been rapidly developing and widely applied in the food industry by virtue of the use of chemometric techniques in which wavelength selection methods play an important role. This paper is a review of such variable selection methods and their limitations, describing the basic taxonomy of the methods and their respective advantages and disadvantages. Special attention is paid to recent developments in wavelength selection techniques for HSI in the field of food quality and safety evaluations. Typical and commonly used methods in HSI, such as partial least squares regression, stepwise regression and spectrum analysis, are described in detail. Some sophisticated methods, such as successive projections algorithm, uninformative variable elimination, simulated annealing, artificial neural network and genetic algorithm methods, are also discussed. Finally, new methods not currently used but that could have substantial impact on the field are presented. In short, this review provides an overview of wavelength selection methods in food-related areas and offers a thoughtful perspective on future potentials and challenges in the development of HSI systems.

Journal ArticleDOI
TL;DR: New supervised feature selection methods based on hybridization of Particle Swarm Optimization, PSO based Relative Reduct andPSO based Quick Reduct are presented for the diseases diagnosis, proving the efficiency of the proposed technique as well as enhancements over the existing feature selection techniques.

Journal ArticleDOI
TL;DR: This work introduces incremental mechanisms for three representative information entropies and develops a group incremental rough feature selection algorithm based on information entropy that aims to find the new feature subset in a much shorter time when multiple objects are added to a decision table.
Abstract: Many real data increase dynamically in size. This phenomenon occurs in several fields including economics, population studies, and medical research. As an effective and efficient mechanism to deal with such data, incremental technique has been proposed in the literature and attracted much attention, which stimulates the result in this paper. When a group of objects are added to a decision table, we first introduce incremental mechanisms for three representative information entropies and then develop a group incremental rough feature selection algorithm based on information entropy. When multiple objects are added to a decision table, the algorithm aims to find the new feature subset in a much shorter time. Experiments have been carried out on eight UCI data sets and the experimental results show that the algorithm is effective and efficient.

Journal ArticleDOI
TL;DR: The experimental results illustrate that the proposed multi-sensor system can achieve an overall recognition accuracy of 96.4% by adopting the mean and variance features, using the Decision Tree classifier.

Journal ArticleDOI
TL;DR: A complete algorithmic description, a learning code and a learned face detector that can be applied to any color image are proposed and a post-processing step is proposed to reduce detection redundancy using a robustness argument.
Abstract: In this article, we decipher the Viola-Jones algorithm, the first ever real-time face detection system. There are three ingredients working in concert to enable a fast and accurate detection: the integral image for feature computation, Adaboost for feature selection and an attentional cascade for efficient computational resource allocation. Here we propose a complete algorithmic description, a learning code and a learned face detector that can be applied to any color image. Since the Viola-Jones algorithm typically gives multiple detections, a post-processing step is also proposed to reduce detection redundancy using a robustness argument. Source Code The source code and the online demo are accessible at the IPOL web page of this article 1 .

Journal ArticleDOI
TL;DR: This article investigates the problem of online feature selection (OFS) in which an online learner is only allowed to maintain a classifier involved only a small and fixed number of features, and presents novel algorithms to solve each of the two problems.
Abstract: Feature selection is an important technique for data mining Despite its importance, most studies of feature selection are restricted to batch learning Unlike traditional batch learning methods, online learning represents a promising family of efficient and scalable machine learning algorithms for large-scale applications Most existing studies of online learning require accessing all the attributes/features of training instances Such a classical setting is not always appropriate for real-world applications when data instances are of high dimensionality or it is expensive to acquire the full set of attributes/features To address this limitation, we investigate the problem of online feature selection (OFS) in which an online learner is only allowed to maintain a classifier involved only a small and fixed number of features The key challenge of online feature selection is how to make accurate prediction for an instance using a small number of active features This is in contrast to the classical setup of online learning where all the features can be used for prediction We attempt to tackle this challenge by studying sparsity regularization and truncation techniques Specifically, this article addresses two different tasks of online feature selection: 1) learning with full input, where an learner is allowed to access all the features to decide the subset of active features, and 2) learning with partial input, where only a limited number of features is allowed to be accessed for each instance by the learner We present novel algorithms to solve each of the two problems and give their performance analysis We evaluate the performance of the proposed algorithms for online feature selection on several public data sets, and demonstrate their applications to real-world problems including image classification in computer vision and microarray gene expression analysis in bioinformatics The encouraging results of our experiments validate the efficacy and efficiency of the proposed techniques