scispace - formally typeset
Search or ask a question

Showing papers on "Feature (machine learning) published in 2008"


Journal ArticleDOI
TL;DR: A novel unsupervised learning method for human action categories that can recognize and localize multiple actions in long and complex video sequences containing multiple motions.
Abstract: We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.

1,440 citations


Book
21 Feb 2008
TL;DR: The aim of this review is first to present the core architecture of a HMM-based LVCSR system and then to describe the various refinements which are needed to achieve state-of-the-art performance.
Abstract: Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs. Whereas the basic principles underlying HMM-based LVCSR are rather straightforward, the approximations and simplifying assumptions involved in a direct implementation of these principles would result in a system which has poor accuracy and unacceptable sensitivity to changes in operating environment. Thus, the practical application of HMMs in modern systems involves considerable sophistication. The aim of this review is first to present the core architecture of a HMM-based LVCSR system and then describe the various refinements which are needed to achieve state-of-the-art performance. These refinements include feature projection, improved covariance modelling, discriminative parameter estimation, adaptation and normalisation, noise compensation and multi-pass system combination. The review concludes with a case study of LVCSR for Broadcast News and Conversation transcription in order to illustrate the techniques described.

763 citations


Journal ArticleDOI
TL;DR: Recursive Feature Elimination is evaluated in terms of sensitivity of discriminative maps (Receiver Operative Characteristic analysis) and generalization performances and compare it to previously used univariate voxel selection strategies based on activation and discrimination measures.

532 citations


Journal ArticleDOI
TL;DR: This study proposed the use of stylometric analysis techniques to help identify individuals based on writing style, and incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes.
Abstract: One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94p when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.

442 citations


Proceedings Article
08 Dec 2008
TL;DR: Through experiments on the text-aided image classification and cross-language classification tasks, it is demonstrated that the translated learning framework can greatly outperform many state-of-the-art baseline methods.
Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a "bridge" to link one feature space (known as the "source space") to another space (known as the "target space") through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods.

305 citations


Journal ArticleDOI
TL;DR: The authors proposed a new model of human concept learning based on Bayesian inference for a grammatically structured hypothesis space and compared the model predictions to human generalization judgments in several well-known category learning experiments, and found good agreement for both average and individual participant generalizations.

294 citations


Proceedings ArticleDOI
25 Oct 2008
TL;DR: This work explores the use of the MIRA algorithm of Crammer et al. as an alternative to MERT and shows that by parallel processing and exploiting more of the parse forest, it can obtain results using MIRA that match or surpass MERT in terms of both translation quality and computational cost.
Abstract: Minimum-error-rate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative to MERT. We first show that by parallel processing and exploiting more of the parse forest, we can obtain results using MIRA that match or surpass MERT in terms of both translation quality and computational cost. We then test the method on two classes of features that address deficiencies in the Hiero hierarchical phrase-based model: first, we simultaneously train a large number of Marton and Resnik's soft syntactic constraints, and, second, we introduce a novel structural distortion model. In both cases we obtain significant improvements in translation performance. Optimizing them in combination, for a total of 56 feature weights, we improve performance by 2.6 Bleu on a subset of the NIST 2006 Arabic-English evaluation data.

283 citations


Proceedings ArticleDOI
20 Jul 2008
TL;DR: This paper proposes a method for training discriminative probabilistic models with labeled features and unlabeled instances and expresses soft constraints using generalized expectation (GE) criteria terms in a parameter estimation objective function that express preferences on values of a model expectation.
Abstract: It is difficult to apply machine learning to new domains because often we lack labeled problem instances. In this paper, we provide a solution to this problem that leverages domain knowledge in the form of affinities between input features and classes. For example, in a baseball vs. hockey text classification problem, even without any labeled data, we know that the presence of the word puck is a strong indicator of hockey. We refer to this type of domain knowledge as a labeled feature. In this paper, we propose a method for training discriminative probabilistic models with labeled features and unlabeled instances. Unlike previous approaches that use labeled features to create labeled pseudo-instances, we use labeled features directly to constrain the model's predictions on unlabeled instances. We express these soft constraints using generalized expectation (GE) criteria --- terms in a parameter estimation objective function that express preferences on values of a model expectation. In this paper we train multinomial logistic regression models using GE criteria, but the method we develop is applicable to other discriminative probabilistic models. The complete objective function also includes a Gaussian prior on parameters, which encourages generalization by spreading parameter weight to unlabeled features. Experimental results on text classification data sets show that this method outperforms heuristic approaches to training classifiers with labeled features. Experiments with human annotators show that it is more beneficial to spend limited annotation time labeling features rather than labeling instances. For example, after only one minute of labeling features, we can achieve 80% accuracy on the ibm vs. mac text classification problem using GE-FL, whereas ten minutes labeling documents results in an accuracy of only 77%

278 citations


Journal ArticleDOI
TL;DR: A novel, detailed classification of developed AFR systems has been introduced and potentials and limitations of these approaches are discussed, and directions for further research work are emphasized.

273 citations


Proceedings ArticleDOI
23 Jun 2008
TL;DR: This work proposes a new procedure for recognition of low-resolution faces, when there is a high-resolution training set available, and shows that recognition of faces of as low as 6 times 6 pixel size is considerably improved compared to matching using a super-resolution reconstruction followed by classification, and to matching with a low- resolution training set.
Abstract: Face recognition degrades when faces are of very low resolution since many details about the difference between one person and another can only be captured in images of sufficient resolution. In this work, we propose a new procedure for recognition of low-resolution faces, when there is a high-resolution training set available. Most previous super-resolution approaches are aimed at reconstruction, with recognition only as an after-thought. In contrast, in the proposed method, face features, as they would be extracted for a face recognition algorithm (e.g., eigenfaces, Fisher-faces, etc.), are included in a super-resolution method as prior information. This approach simultaneously provides measures of fit of the super-resolution result, from both reconstruction and recognition perspectives. This is different from the conventional paradigms of matching in a low-resolution domain, or, alternatively, applying a super-resolution algorithm to a low-resolution face and then classifying the super-resolution result. We show, for example, that recognition of faces of as low as 6 times 6 pixel size is considerably improved compared to matching using a super-resolution reconstruction followed by classification, and to matching with a low-resolution training set.

272 citations


Book ChapterDOI
01 Jun 2008
TL;DR: Fusing the multi-modality data resulted in a large increase in the recognition rates in comparison with the unimodal systems: the multimodal approach gave an improvement of more than 10% when compared to the most successful unimmodal system.
Abstract: In this paper we present a multimodal approach for the recognition of eight emotions. Our approach integrates information from facial expressions, body movement and gestures and speech. We trained and tested a model with a Bayesian classifier, using a multimodal corpus with eight emotions and ten subjects. Firstly, individual classifiers were trained for each modality. Next, data were fused at the feature level and the decision level. Fusing the multimodal data resulted in a large increase in the recognition rates in comparison with the unimodal systems: the multimodal approach gave an improvement of more than 10% when compared to the most successful unimodal system. Further, the fusion performed at the feature level provided better results than the one performed at the decision level.

Journal Article
TL;DR: A novel coordinate descent algorithm for training linear SVM with the L2-loss function that is more efficient and stable than state of the art methods such as Pegasos and TRON.
Abstract: Linear support vector machines (SVM) are useful for classifying large-scale sparse data. Problems with sparse features are common in applications such as document classification and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while fixing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. As each sub-problem involves only values of a corresponding feature, the proposed approach is suitable when accessing a feature is more convenient than accessing an instance. Experiments show that our method is more efficient and stable than state of the art methods such as Pegasos and TRON.

Journal ArticleDOI
TL;DR: The importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes are revealed.
Abstract: Several classification and feature selection methods have been studied for the identification of differentially expressed genes in microarray data. Classification methods such as SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods have been used in recent studies. The accuracy of these methods has been calculated with validation methods such as v-fold validation. However there is lack of comparison between these methods to find a better framework for classification, clustering and analysis of microarray gene expression results. In this study, we compared the efficiency of the classification methods including; SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods. The v-fold cross validation was used to calculate the accuracy of the classifiers. Some of the common clustering methods including K-means, DBC, and EM clustering were applied to the datasets and the efficiency of these methods have been analysed. Further the efficiency of the feature selection methods including support vector machine recursive feature elimination (SVM-RFE), Chi Squared, and CSF were compared. In each case these methods were applied to eight different binary (two class) microarray datasets. We evaluated the class prediction efficiency of each gene list in training and test cross-validation using supervised classifiers. We presented a study in which we compared some of the common used classification, clustering, and feature selection methods. We applied these methods to eight publicly available datasets, and compared how these methods performed in class prediction of test datasets. We reported that the choice of feature selection methods, the number of genes in the gene list, the number of cases (samples) substantially influence classification success. Based on features chosen by these methods, error rates and accuracy of several classification algorithms were obtained. Results revealed the importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes.

01 Jan 2008
TL;DR: The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.
Abstract: Dimensionality reduction and feature subset selection are two techniques for reducing the attribute space of a feature set, which is an important component of both supervised and unsupervised classification or regression problems. While in feature subset selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In this paper we investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas. On the one hand, we consider e-mail filtering, where the feature space contains various properties of e-mail messages, and on the other hand, we consider drug discovery problems, where quantitative representations of molecular structures are encoded in terms of information-preserving descriptor values. Subsets of the original attributes constructed by filter and wrapper techniques as well as subsets of linear combinations of the original attributes constructed by three different variants of the principle component analysis (PCA) are compared in terms of the classification performance achieved with various machine learning algorithms as well as in terms of runtime performance. We successively reduce the size of the attribute sets and investigate the changes in the classification results. Moreover, we explore the relationship between the variance captured in the linear combinations within PCA and the resulting classification accuracy. The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.

Proceedings ArticleDOI
Qi Su1, Xinying Xu1, Honglei Guo2, Zhili Guo2, Xian Wu2, Xiaoxun Zhang2, Bin Swen1, Zhong Su2 
21 Apr 2008
TL;DR: A novel mutual reinforcement approach to deal with the feature-level opinion mining problem, which can largely predict opinions relating to different product features, even for the case without the explicit appearance of product feature words in reviews.
Abstract: The boom of product review websites, blogs and forums on the web has attracted many research efforts on opinion mining. Recently, there was a growing interest in the finer-grained opinion mining, which detects opinions on different review features as opposed to the whole review level. The researches on feature-level opinion mining mainly rely on identifying the explicit relatedness between product feature words and opinion words in reviews. However, the sentiment relatedness between the two objects is usually complicated. For many cases, product feature words are implied by the opinion words in reviews. The detection of such hidden sentiment association is still a big challenge in opinion mining. Especially, it is an even harder task of feature-level opinion mining on Chinese reviews due to the nature of Chinese language. In this paper, we propose a novel mutual reinforcement approach to deal with the feature-level opinion mining problem. More specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. Moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. Based on the pre-constructed association set, our approach can largely predict opinions relating to different product features, even for the case without the explicit appearance of product feature words in reviews. Thus it provides a more accurate opinion evaluation. The experimental results demonstrate that our method outperforms the state-of-art algorithms.

Posted Content
TL;DR: The extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the l1-norm or the block l1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

Journal ArticleDOI
TL;DR: The application of Gabor filter based feature extraction in combination with learning vector quantization (LVQ) for recognition of seven different facial expressions from still pictures of the human face proves the feasibility of computer vision based facial expression recognition for practical applications like surveillance and human computer interaction.

15 Sep 2008
TL;DR: In this article, the authors investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas, e.g., e-mail filtering and drug discovery.
Abstract: Dimensionality reduction and feature subset selection are two techniques for reducing the attribute space of a feature set, which is an important component of both supervised and unsupervised classification or regression problems. While in feature subset selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In this paper we investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas. On the one hand, we consider e-mail filtering, where the feature space contains various properties of e-mail messages, and on the other hand, we consider drug discovery problems, where quantitative representations of molecular structures are encoded in terms of information-preserving descriptor values. Subsets of the original attributes constructed by filter and wrapper techniques as well as subsets of linear combinations of the original attributes constructed by three different variants of the principle component analysis (PCA) are compared in terms of the classification performance achieved with various machine learning algorithms as well as in terms of runtime performance. We successively reduce the size of the attribute sets and investigate the changes in the classification results. Moreover, we explore the relationship between the variance captured in the linear combinations within PCA and the resulting classification accuracy. The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.

Proceedings ArticleDOI
24 Aug 2008
TL;DR: This paper considers a general framework for extracting shared structures in multi-label classification, and includes several well-known algorithms as special cases, thus elucidating their intrinsic relationships.
Abstract: Multi-label problems arise in various domains such as multi-topic document categorization and protein function prediction. One natural way to deal with such problems is to construct a binary classifier for each label, resulting in a set of independent binary classification problems. Since the multiple labels share the same input space, and the semantics conveyed by different labels are usually correlated, it is essential to exploit the correlation information contained in different labels. In this paper, we consider a general framework for extracting shared structures in multi-label classification. In this framework, a common subspace is assumed to be shared among multiple labels. We show that the optimal solution to the proposed formulation can be obtained by solving a generalized eigenvalue problem, though the problem is non-convex. For high-dimensional problems, direct computation of the solution is expensive, and we develop an efficient algorithm for this case. One appealing feature of the proposed framework is that it includes several well-known algorithms as special cases, thus elucidating their intrinsic relationships. We have conducted extensive experiments on eleven multi-topic web page categorization tasks, and results demonstrate the effectiveness of the proposed formulation in comparison with several representative algorithms.

Proceedings ArticleDOI
12 Dec 2008
TL;DR: This work develops a new cross-domain SVM (CDSVM) algorithm for adapting previously learned support vectors from one domain to help classification in another domain, and proposes an intuitive selection criterion to determine which cross- domain learning method to use for each concept.
Abstract: Exploding amounts of multimedia data increasingly require automatic indexing and classification, e.g. training classifiers to produce high-level features, or semantic concepts, chosen to represent image content, like car, person, etc. When changing the applied domain (i.e. from news domain to consumer home videos), the classifiers trained in one domain often perform poorly in the other domain due to changes in feature distributions. Additionally, classifiers trained on the new domain alone may suffer from too few positive training samples. Appropriately adapting data/models from an old domain to help classify data in a new domain is an important issue. In this work, we develop a new cross-domain SVM (CDSVM) algorithm for adapting previously learned support vectors from one domain to help classification in another domain. Better precision is obtained with almost no additional computational cost. Also, we give a comprehensive summary and comparative study of the state- of-the-art SVM-based cross-domain learning methods. Evaluation over the latest large-scale TRECVID benchmark data set shows that our CDSVM method can improve mean average precision over 36 concepts by 7.5%. For further performance gain, we also propose an intuitive selection criterion to determine which cross-domain learning method to use for each concept.

Proceedings ArticleDOI
23 Jun 2008
TL;DR: A novel approach using a 4D (x,y,z,t) action feature model (4D-AFM) for recognizing actions from arbitrary views and promising recognition results have been demonstrated on the multi-view IXMAS dataset.
Abstract: In this paper we present a novel approach using a 4D (x,y,z,t) action feature model (4D-AFM) for recognizing actions from arbitrary views The 4D-AFM elegantly encodes shape and motion of actors observed from multiple views The modeling process starts with reconstructing 3D visual hulls of actors at each time instant Spatiotemporal action features are then computed in each view by analyzing the differential geometric properties of spatio-temporal volumes (3D STVs) generated by concatenating the actorpsilas silhouette over the course of the action (x, y, t) These features are mapped to the sequence of 3D visual hulls over time (4D) to build the initial 4D-AFM Actions are recognized based on the scores of matching action features from the input videos to the model points of 4D-AFMs by exploiting pairwise interactions of features Promising recognition results have been demonstrated on the multi-view IXMAS dataset using both single and multi-view input videos

Journal ArticleDOI
TL;DR: Several kernel functions to model parse tree properties in kernel-based machines, for example, perceptrons or support vector machines are proposed and tree kernels allow for a general and easily portable feature engineering method which is applicable to a large family of natural language processing tasks.
Abstract: The availability of large scale data sets of manually annotated predicate-argument structures has recently favored the use of machine learning approaches to the design of automated semantic role labeling (SRL) systems. The main research in this area relates to the design choices for feature representation and for effective decompositions of the task in different learning models. Regarding the former choice, structural properties of full syntactic parses are largely employed as they represent ways to encode different principles suggested by the linking theory between syntax and semantics. The latter choice relates to several learning schemes over global views of the parses. For example, re-ranking stages operating over alternative predicate-argument sequences of the same sentence have shown to be very effective. In this article, we propose several kernel functions to model parse tree properties in kernel-based machines, for example, perceptrons or support vector machines. In particular, we define different kinds of tree kernels as general approaches to feature engineering in SRL. Moreover, we extensively experiment with such kernels to investigate their contribution to individual stages of an SRL architecture both in isolation and in combination with other traditional manually coded features. The results for boundary recognition, classification, and re-ranking stages provide systematic evidence about the significant impact of tree kernels on the overall accuracy, especially when the amount of training data is small. As a conclusive result, tree kernels allow for a general and easily portable feature engineering method which is applicable to a large family of natural language processing tasks.

Journal ArticleDOI
TL;DR: A single multi-class kernel machine that informatively combines the available feature groups and is able to provide the state-of-the-art in performance accuracy on the fold recognition problem is demonstrated.
Abstract: Motivation: The problems of protein fold recognition and remote homology detection have recently attracted a great deal of interest as they represent challenging multi-feature multi-class problems for which modern pattern recognition methods achieve only modest levels of performance. As with many pattern recognition problems, there are multiple feature spaces or groups of attributes available, such as global characteristics like the amino-acid composition (C), predicted secondary structure (S), hydrophobicity (H), van der Waals volume (V), polarity (P), polarizability (Z), as well as attributes derived from local sequence alignment such as the Smith–Waterman scores. This raises the need for a classification method that is able to assess the contribution of these potentially heterogeneous object descriptors while utilizing such information to improve predictive performance. To that end, we offer a single multi-class kernel machine that informatively combines the available feature groups and, as is demonstrated in this article, is able to provide the state-of-the-art in performance accuracy on the fold recognition problem. Furthermore, the proposed approach provides some insight by assessing the significance of recently introduced protein features and string kernels. The proposed method is well-founded within a Bayesian hierarchical framework and a variational Bayes approximation is derived which allows for efficient CPU processing times. Results: The best performance which we report on the SCOP PDB-40D benchmark data-set is a 70% accuracy by combining all the available feature groups from global protein characteristics but also including sequence-alignment features. We offer an 8% improvement on the best reported performance that combines multi-class k-nn classifiers while at the same time reducing computational costs and assessing the predictive power of the various available features. Furthermore, we examine the performance of our methodology on the SCOP 1.53 benchmark data-set that simulates remote homology detection and examine the combination of various state-of-the-art string kernels that have recently been proposed. Contact: theo@dcs.gla.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

Book ChapterDOI
Amr Ahmed1, Kai Yu, Wei Xu, Yihong Gong, Eric P. Xing1 
12 Oct 2008
TL;DR: This paper presents a framework for training hierarchical feed-forward models for visual recognition, using transfer learning from pseudo tasks, and shows that these pseudo tasks induce an informative inverse-Wishart prior on the functional behavior of the network, offering an effective way to incorporate useful prior knowledge into the network training.
Abstract: Building visual recognition models that adapt across different domains is a challenging task for computer vision. While feature-learning machines in the form of hierarchial feed-forward models (e.g., convolutional neural networks) showed promise in this direction, they are still difficult to train especially when few training examples are available. In this paper, we present a framework for training hierarchical feed-forward models for visual recognition, using transfer learning from pseudo tasks. These pseudo tasks are automatically constructed from data without supervision and comprise a set of simple pattern-matching operations. We show that these pseudo tasks induce an informative inverse-Wishart prior on the functional behavior of the network, offering an effective way to incorporate useful prior knowledge into the network training. In addition to being extremely simple to implement, and adaptable across different domains with little or no extra tuning, our approach achieves promising results on challenging visual recognition tasks, including object recognition, gender recognition, and ethnicity recognition.

Journal ArticleDOI
TL;DR: This work proposes a submitochondria localizer whose feature extraction method is based on the Chou’s pseudo-amino acid composition, and proposes a very few parameterized method, named ALL-Loc, where all the “ original” features are used to train a linear support vector machine.
Abstract: Given a protein that is localized in the mitochondria it is very important to know the submitochondria localization of that protein to understand its function. In this work, we propose a submitochondria localizer whose feature extraction method is based on the Chou's pseudo-amino acid composition. The pseudo-amino acid based features are obtained by combining pseudo-amino acid compositions with hundreds of amino-acid indices and amino-acid substitution matrices, then from this huge set of features a small set of 15 "artificial" features is created. The feature creation is performed by genetic programming combining one or more "original" features by means of some mathematical operators. Finally, the set of combined features are used to train a radial basis function support vector machine. This method is named GP-Loc. Moreover, we also propose a very few parameterized method, named ALL-Loc, where all the "original" features are used to train a linear support vector machine. The overall prediction accuracy obtained by GP-Loc is 89% when the jackknife cross-validation is used, this result outperforms the performance obtained in the literature (85.2%) using the same dataset. While the overall prediction accuracy obtained by ALL-Loc is 83.9%.

Journal ArticleDOI
TL;DR: Experimental results show that the method of combining 2D-LDA (Linear Discriminant Analysis) and SVM (Support Vector Machine) outperforms others and takes only 0.0357 second to process one image of size 256 × 256.
Abstract: Facial expression provides an important behavioral measure for studies of emotion, cognitive processes, and social interaction. Facial expression recognition has recently become a promising research area. Its applications include human-computer interfaces, human emotion analysis, and medical care and cure. In this paper, we investigate various feature representation and expression classification schemes to recognize seven different facial expressions, such as happy, neutral, angry, disgust, sad, fear and surprise, in the JAFFE database. Experimental results show that the method of combining 2D-LDA (Linear Discriminant Analysis) and SVM (Support Vector Machine) outperforms others. The recognition rate of this method is 95.71% by using leave-one-out strategy and 94.13% by using cross-validation strategy. It takes only 0.0357 second to process one image of size 256 × 256.

Proceedings ArticleDOI
23 Jun 2008
TL;DR: A novel automatic feature selection method based on maximizing the average relative entropy of marginalized class-conditional feature distributions and apply it to a complete pool of candidate features composed of normalized Euclidean distances between 83 facial feature points in the 3D space is proposed.
Abstract: In this paper, the problem of person-independent facial expression recognition from 3D facial shapes is investigated. We propose a novel automatic feature selection method based on maximizing the average relative entropy of marginalized class-conditional feature distributions and apply it to a complete pool of candidate features composed of normalized Euclidean distances between 83 facial feature points in the 3D space. Using a regularized multi-class AdaBoost classification algorithm, we achieve a 95.1% average recognition rate for six universal facial expressions on the publicly available 3D facial expression database BU-3DFE [1], with a highest average recognition rate of 99.2% for the recognition of surprise. We compare these results with the results based on a set of manually devised features and demonstrate that the auto features yield better results than the manual features. Our results outperform the results presented in the previous work [2] and [3], namely average recognition rates of 83.6% and 91.3% on the same database, respectively.

Patent
07 Mar 2008
TL;DR: In this paper, a facial expression recognition system that uses a face detection apparatus realizing efficient learning and high-speed detection processing based on ensemble learning when detecting an area representing a detection target and that is robust against shifts of face position included in images and capable of highly accurate expression recognition, and a learning method for the system, are provided.
Abstract: A facial expression recognition system that uses a face detection apparatus realizing efficient learning and high-speed detection processing based on ensemble learning when detecting an area representing a detection target and that is robust against shifts of face position included in images and capable of highly accurate expression recognition, and a learning method for the system, are provided. When learning data to be used by the face detection apparatus by Adaboost, processing to select high-performance weak hypotheses from all weak hypotheses, then generate new weak hypotheses from these high-performance weak hypotheses on the basis of statistical characteristics, and select one weak hypothesis having the highest discrimination performance from these weak hypotheses, is repeated to sequentially generate a weak hypothesis, and a final hypothesis is thus acquired. In detection, using an abort threshold value that has been learned in advance, whether provided data can be obviously judged as a non-face is determined every time one weak hypothesis outputs the result of discrimination. If it can be judged so, processing is aborted. A predetermined Gabor filter is selected from the detected face image by an Adaboost technique, and a support vector for only a feature quantity extracted by the selected filter is learned, thus performing expression recognition.

Proceedings ArticleDOI
12 May 2008
TL;DR: Improvements over standard cepstral features and probabilistic MLP features are shown for different tasks and different neural net input representations, and a significant improvement is observed when phoneme MLP training targets are replaced by phoneme states and when delta features are added.
Abstract: This work continues in development of the recently proposed bottle-neck features for ASR. A five-layers MLP used in bottleneck feature extraction allows to obtain arbitrary feature size without dimensionality reduction by transforms, independently on the MLP training targets. The MLP topology - number and sizes of layers, suitable training targets, the impact of output feature transforms, the need of delta features, and the dimensionality of the final feature vector are studied with respect to the best ASR result. Optimized features are employed in three LVCSR tasks: Arabic broadcast news, English conversational telephone speech and English meetings. Improvements over standard cepstral features and probabilistic MLP features are shown for different tasks and different neural net input representations. A significant improvement is observed when phoneme MLP training targets are replaced by phoneme states and when delta features are added.

Proceedings ArticleDOI
12 Jul 2008
TL;DR: The results show that AR coefficients obvious discriminate different human activities and it can be extract as an effective feature for the recognition of accelerometer date.
Abstract: In this paper, the autoregressive (AR) model of time-series is presented to recognize human activity from a tri-axial accelerometer data. Four orders of autoregressive model for accelerometer data is built and the AR coefficients are extracted as features for activity recognition. Classification of the human activities is performed with support vector machine (SVM). The average recognition results for four activities (running, still, jumping and walking) using the proposed AR-based features are 92.25%, which are better than using traditional frequently used time domains features (mean, standard deviation, energy and correlation of acceleration data) and FFT features. The results show that AR coefficients obvious discriminate different human activities and it can be extract as an effective feature for the recognition of accelerometer date.