scispace - formally typeset
Search or ask a question

Showing papers on "Feature (machine learning) published in 2015"


Proceedings ArticleDOI
07 Jun 2015
TL;DR: In this article, a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN was proposed to model the video as an ordered sequence of frames.
Abstract: Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 73.0%).

2,066 citations


Proceedings Article
07 Dec 2015
TL;DR: This work introduces a robust new AutoML system based on scikit-learn, which improves on existing AutoML methods by automatically taking into account past performance on similar datasets, and by constructing ensembles from the models evaluated during the optimization.
Abstract: The success of machine learning in a broad range of applications has led to an ever-growing demand for machine learning systems that can be used off the shelf by non-experts. To be effective in practice, such systems need to automatically choose a good algorithm and feature preprocessing steps for a new dataset at hand, and also set their respective hyperparameters. Recent work has started to tackle this automated machine learning (AutoML) problem with the help of efficient Bayesian optimization methods. Building on this, we introduce a robust new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters). This system, which we dub AUTO-SKLEARN, improves on existing AutoML methods by automatically taking into account past performance on similar datasets, and by constructing ensembles from the models evaluated during the optimization. Our system won the first phase of the ongoing ChaLearn AutoML challenge, and our comprehensive analysis on over 100 diverse datasets shows that it substantially outperforms the previous state of the art in AutoML. We also demonstrate the performance gains due to each of our contributions and derive insights into the effectiveness of the individual components of AUTO-SKLEARN.

1,318 citations


Proceedings Article
07 Dec 2015
TL;DR: A new model of natural textures based on the feature spaces of convolutional neural networks optimised for object recognition is introduced, showing that across layers the texture representations increasingly capture the statistical properties of natural images while making object information more and more explicit.
Abstract: Here we introduce a new model of natural textures based on the feature spaces of convolutional neural networks optimised for object recognition. Samples from the model are of high perceptual quality demonstrating the generative power of neural networks trained in a purely discriminative fashion. Within the model, textures are represented by the correlations between feature maps in several layers of the network. We show that across layers the texture representations increasingly capture the statistical properties of natural images while making object information more and more explicit. The model provides a new tool to generate stimuli for neuroscience and might offer insights into the deep representations learned by convolutional neural networks.

1,081 citations


Proceedings Article
25 Jul 2015
TL;DR: This method adopts a deep convolutional neural networks (CNN) to automate feature learning from the raw inputs in a systematic way and makes it outperform other HAR algorithms, as verified in the experiments on the Opportunity Activity Recognition Challenge and other benchmark datasets.
Abstract: This paper focuses on human activity recognition (HAR) problem, in which inputs are multichannel time series signals acquired from a set of bodyworn inertial sensors and outputs are predefined human activities. In this problem, extracting effective features for identifying activities is a critical but challenging task. Most existing work relies on heuristic hand-crafted feature design and shallow feature learning architectures, which cannot find those distinguishing features to accurately classify different activities. In this paper, we propose a systematic feature learning method for HAR problem. This method adopts a deep convolutional neural networks (CNN) to automate feature learning from the raw inputs in a systematic way. Through the deep architecture, the learned features are deemed as the higher level abstract representation of low level raw time series signals. By leveraging the labelled information via supervised learning, the learned features are endowed with more discriminative power. Unified in one model, feature learning and classification are mutually enhanced. All these unique advantages of the CNN make it outperform other HAR algorithms, as verified in the experiments on the Opportunity Activity Recognition Challenge and other benchmark datasets.

894 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper uses Convolutional Neural Networks to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches to develop 128-D descriptors whose euclidean distances reflect patch similarity and can be used as a drop-in replacement for any task involving SIFT.
Abstract: Deep learning has revolutionalized image-level tasks such as classification, but patch-level tasks, such as correspondence, still rely on hand-crafted features, e.g. SIFT. In this paper we use Convolutional Neural Networks (CNNs) to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches. We deal with the large number of potential pairs with the combination of a stochastic sampling of the training set and an aggressive mining strategy biased towards patches that are hard to classify. By using the L2 distance during both training and testing we develop 128-D descriptors whose euclidean distances reflect patch similarity, and which can be used as a drop-in replacement for any task involving SIFT. We demonstrate consistent performance gains over the state of the art, and generalize well against scaling and rotation, perspective transformation, non-rigid deformation, and illumination changes. Our descriptors are efficient to compute and amenable to modern GPUs, and are publicly available.

848 citations


Proceedings ArticleDOI
30 Jul 2015
TL;DR: It is shown that the observed features model is most effective at capturing the information present for entity pairs with textual relations, and a combination of the two combines the strengths of both model types.
Abstract: In this paper we show the surprising effectiveness of a simple observed features model in comparison to latent feature models on two benchmark knowledge base completion datasets, FB15K and WN18. We also compare latent and observed feature models on a more challenging dataset derived from FB15K, and additionally coupled with textual mentions from a web-scale corpus. We show that the observed features model is most effective at capturing the information present for entity pairs with textual relations, and a combination of the two combines the strengths of both model types.

808 citations


Proceedings ArticleDOI
09 Aug 2015
TL;DR: This paper presents a convolutional neural network architecture for reranking pairs of short texts, where the optimal representation of text pairs and a similarity function to relate them in a supervised way from the available training data are learned.
Abstract: Learning a similarity function between pairs of objects is at the core of learning to rank approaches In information retrieval tasks we typically deal with query-document pairs, in question answering -- question-answer pairs However, before learning can take place, such pairs needs to be mapped from the original space of symbolic words into some feature space encoding various aspects of their relatedness, eg lexical, syntactic and semantic Feature engineering is often a laborious task and may require external knowledge sources that are not always available or difficult to obtain Recently, deep learning approaches have gained a lot of attention from the research community and industry for their ability to automatically learn optimal feature representation for a given task, while claiming state-of-the-art performance in many tasks in computer vision, speech recognition and natural language processing In this paper, we present a convolutional neural network architecture for reranking pairs of short texts, where we learn the optimal representation of text pairs and a similarity function to relate them in a supervised way from the available training data Our network takes only words in the input, thus requiring minimal preprocessing In particular, we consider the task of reranking short text pairs where elements of the pair are sentences We test our deep learning system on two popular retrieval tasks from TREC: Question Answering and Microblog Retrieval Our model demonstrates strong performance on the first task beating previous state-of-the-art systems by about 3\% absolute points in both MAP and MRR and shows comparable results on tweet reranking, while enjoying the benefits of no manual feature engineering and no additional syntactic parsers

796 citations


Proceedings ArticleDOI
Heechul Jung1, Sihaeng Lee1, Junho Yim1, Sunjeong Park1, Junmo Kim1 
07 Dec 2015
TL;DR: A deep learning technique, which is regarded as a tool to automatically extract useful features from raw data, is adopted and is combined using a new integration method in order to boost the performance of the facial expression recognition.
Abstract: Temporal information has useful features for recognizing facial expressions. However, to manually design useful features requires a lot of effort. In this paper, to reduce this effort, a deep learning technique, which is regarded as a tool to automatically extract useful features from raw data, is adopted. Our deep network is based on two different models. The first deep network extracts temporal appearance features from image sequences, while the other deep network extracts temporal geometry features from temporal facial landmark points. These two models are combined using a new integration method in order to boost the performance of the facial expression recognition. Through several experiments, we show that the two models cooperate with each other. As a result, we achieve superior performance to other state-of-the-art methods in the CK+ and Oulu-CASIA databases. Furthermore, we show that our new integration method gives more accurate results than traditional methods, such as a weighted summation and a feature concatenation method.

668 citations


Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed an online visual tracking algorithm by learning discriminative saliency map using Convolutional Neural Network (CNN), which takes outputs from hidden layers of the network as feature descriptors since they show excellent representation performance in various general visual recognition problems.
Abstract: We propose an online visual tracking algorithm by learning discriminative saliency map using Convolutional Neural Network (CNN). Given a CNN pre-trained on a large-scale image repository in offline, our algorithm takes outputs from hidden layers of the network as feature descriptors since they show excellent representation performance in various general visual recognition problems. The features are used to learn discriminative target appearance models using an online Support Vector Machine (SVM). In addition, we construct target-specific saliency map by backpropagating CNN features with guidance of the SVM, and obtain the final tracking result in each frame based on the appearance model generatively constructed with the saliency map. Since the saliency map visualizes spatial configuration of target effectively, it improves target localization accuracy and enable us to achieve pixel-level target segmentation. We verify the effectiveness of our tracking algorithm through extensive experiment on a challenging benchmark, where our method illustrates outstanding performance compared to the state-of-the-art tracking algorithms.

665 citations


Posted Content
TL;DR: This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.
Abstract: Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 72.8%).

496 citations


Journal ArticleDOI
TL;DR: A machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

Journal ArticleDOI
TL;DR: ELM theories manage to address the open problem which has puzzled the neural networks, machine learning and neuroscience communities for 60 years: whether hidden nodes/neurons need to be tuned in learning, and proved that in contrast to the common knowledge and conventional neural network learning tenets,hidden nodes/NEurons do not need to been iteratively tuned in wide types of neural networks and learning models.
Abstract: The emergent machine learning technique—extreme learning machines (ELMs)—has become a hot area of research over the past years, which is attributed to the growing research activities and significant contributions made by numerous researchers around the world. Recently, it has come to our attention that a number of misplaced notions and misunderstandings are being dissipated on the relationships between ELM and some earlier works. This paper wishes to clarify that (1) ELM theories manage to address the open problem which has puzzled the neural networks, machine learning and neuroscience communities for 60 years: whether hidden nodes/neurons need to be tuned in learning, and proved that in contrast to the common knowledge and conventional neural network learning tenets, hidden nodes/neurons do not need to be iteratively tuned in wide types of neural networks and learning models (Fourier series, biological learning, etc.). Unlike ELM theories, none of those earlier works provides theoretical foundations on feedforward neural networks with random hidden nodes; (2) ELM is proposed for both generalized single-hidden-layer feedforward network and multi-hidden-layer feedforward networks (including biological neural networks); (3) homogeneous architecture-based ELM is proposed for feature learning, clustering, regression and (binary/multi-class) classification. (4) Compared to ELM, SVM and LS-SVM tend to provide suboptimal solutions, and SVM and LS-SVM do not consider feature representations in hidden layers of multi-hidden-layer feedforward networks either.

Journal ArticleDOI
TL;DR: This work presents the application of single DNN for both SR and LR using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks and demonstrates large gains on performance.
Abstract: The impressive gains in performance obtained using deep neural networks (DNNs) for automatic speech recognition (ASR) have motivated the application of DNNs to other speech technologies such as speaker recognition (SR) and language recognition (LR). Prior work has shown performance gains for separate SR and LR tasks using DNNs for direct classification or for feature extraction. In this work we present the application of single DNN for both SR and LR using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks. Using a single DNN trained for ASR on Switchboard data we demonstrate large gains on performance in both benchmarks: a 55% reduction in EER for the DAC13 out-of-domain condition and a 48% reduction in ${C_{avg}}$ on the LRE11 30 s test condition. It is also shown that further gains are possible using score or feature fusion leading to the possibility of a single i-vector extractor producing state-of-the-art SR and LR performance

Proceedings ArticleDOI
TL;DR: A new paradigm for steganalysis to learn features automatically via deep learning models through a customized Convolutional Neural Network that achieves comparable performance on BOSSbase and the realistic and large ImageNet database.
Abstract: Current work on steganalysis for digital images is focused on the construction of complex handcrafted features. This paper proposes a new paradigm for steganalysis to learn features automatically via deep learning models. We novelly propose a customized Convolutional Neural Network for steganalysis. The proposed model can capture the complex dependencies that are useful for steganalysis. Compared with existing schemes, this model can automatically learn feature representations with several convolutional layers. The feature extraction and classification steps are unified under a single architecture, which means the guidance of classification can be used during the feature extraction step. We demonstrate the effectiveness of the proposed model on three state-of-theart spatial domain steganographic algorithms - HUGO, WOW, and S-UNIWARD. Compared to the Spatial Rich Model (SRM), our model achieves comparable performance on BOSSbase and the realistic and large ImageNet database. © (2015) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.

Proceedings ArticleDOI
Zuxuan Wu1, Xi Wang1, Yu-Gang Jiang1, Hao Ye1, Xiangyang Xue1 
13 Oct 2015
TL;DR: Wang et al. as discussed by the authors proposed a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos.
Abstract: Classifying videos according to content semantics is an important problem with a wide range of applications. In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos. Specifically, the spatial and the short-term motion features are extracted separately by two Convolutional Neural Networks (CNN). These two types of CNN-based features are then combined in a regularized feature fusion network for classification, which is able to learn and utilize feature relationships for improved performance. In addition, Long Short Term Memory (LSTM) networks are applied on top of the two features to further model longer-term temporal clues. The main contribution of this work is the hybrid learning framework that can model several important aspects of the video data. We also show that (1) combining the spatial and the short-term motion features in the regularized fusion network is better than direct classification and fusion using the CNN with a softmax layer, and (2) the sequence-based LSTM is highly complementary to the traditional classification strategy without considering the temporal frame orders. Extensive experiments are conducted on two popular and challenging benchmarks, the UCF-101 Human Actions and the Columbia Consumer Videos (CCV). On both benchmarks, our framework achieves very competitive performance: 91.3% on the UCF-101 and 83.5% on the CCV.

Proceedings ArticleDOI
Jie Zhou1, Wei Xu1
01 Jul 2015
TL;DR: This work proposes to use deep bi-directional recurrent network as an end-to-end system for SRL, which takes only original text information as input feature, without using any syntactic knowledge.
Abstract: Semantic role labeling (SRL) is one of the basic natural language processing (NLP) problems. To this date, most of the successful SRL systems were built on top of some form of parsing results (Koomen et al., 2005; Palmer et al., 2010; Pradhan et al., 2013), where pre-defined feature templates over the syntactic structure are used. The attempts of building an end-to-end SRL learning system without using parsing were less successful (Collobert et al., 2011). In this work, we propose to use deep bi-directional recurrent network as an end-to-end system for SRL. We take only original text information as input feature, without using any syntactic knowledge. The proposed algorithm for semantic role labeling was mainly evaluated on CoNLL-2005 shared task and achieved F1 score of 81.07. This result outperforms the previous state-of-the-art system from the combination of different parsing trees or models. We also obtained the same conclusion with F1 = 81.27 on CoNLL2012 shared task. As a result of simplicity, our model is also computationally efficient that the parsing speed is 6.7k tokens per second. Our analysis shows that our model is better at handling longer sentences than traditional models. And the latent variables of our model implicitly capture the syntactic structure of a sentence.

Journal ArticleDOI
07 Jun 2015
TL;DR: This paper finds that features from different channels could share some similar hidden structures, and proposes a joint learning model to simultaneously explore the shared and feature-specific components as an instance of heterogeneous multi-task learning for RGB-D activity recognition.
Abstract: In this paper, we focus on heterogeneous features learning for RGB-D activity recognition. We find that features from different channels (RGB, depth) could share some similar hidden structures, and then propose a joint learning model to simultaneously explore the shared and feature-specific components as an instance of heterogeneous multi-task learning. The proposed model formed in a unified framework is capable of: 1) jointly mining a set of subspaces with the same dimensionality to exploit latent shared features across different feature channels, 2) meanwhile, quantifying the shared and feature-specific components of features in the subspaces, and 3) transferring feature-specific intermediate transforms (i-transforms) for learning fusion of heterogeneous features across datasets. To efficiently train the joint model, a three-step iterative optimization algorithm is proposed, followed by a simple inference model. Extensive experimental results on four activity datasets have demonstrated the efficacy of the proposed method. A new RGB-D activity dataset focusing on human-object interaction is further contributed, which presents more challenges for RGB-D activity benchmarking.

Proceedings ArticleDOI
01 Nov 2015
TL;DR: Wang et al. as mentioned in this paper proposed an end-to-end hierarchical architecture for skeleton-based action recognition with CNN, which represented a skeleton sequence as a matrix by concatenating the joint coordinates in each instant and arranging those vector representations in a chronological order.
Abstract: Temporal dynamics of postures over time is crucial for sequence-based action recognition. Human actions can be represented by the corresponding motions of articulated skeleton. Most of the existing approaches for skeleton based action recognition model the spatial-temporal evolution of actions based on hand-crafted features. As a kind of hierarchically adaptive filter banks, Convolutional Neural Network (CNN) performs well in representation learning. In this paper, we propose an end-to-end hierarchical architecture for skeleton based action recognition with CNN. Firstly, we represent a skeleton sequence as a matrix by concatenating the joint coordinates in each instant and arranging those vector representations in a chronological order. Then the matrix is quantified into an image and normalized to handle the variable-length problem. The final image is fed into a CNN model for feature extraction and recognition. For the specific structure of such images, the simple max-pooling plays an important role on spatial feature selection as well as temporal frequency adjustment, which can obtain more discriminative joint information for different actions and meanwhile address the variable-frequency problem. Experimental results demonstrate that our method achieves the state-of-art performance with high computational efficiency, especially surpassing the existing result by more than 15 percentage on the challenging ChaLearn gesture recognition dataset.

Journal ArticleDOI
TL;DR: Important topics from different classification techniques, such as databases available for experimentation, appropriate feature extraction and selection methods, classifiers and performance issues are discussed, with emphasis on research published in the last decade.
Abstract: Speaker emotion recognition is achieved through processing methods that include isolation of the speech signal and extraction of selected features for the final classification. In terms of acoustics, speech processing techniques offer extremely valuable paralinguistic information derived mainly from prosodic and spectral features. In some cases, the process is assisted by speech recognition systems, which contribute to the classification using linguistic information. Both frameworks deal with a very challenging problem, as emotional states do not have clear-cut boundaries and often differ from person to person. In this article, research papers that investigate emotion recognition from audio channels are surveyed and classified, based mostly on extracted and selected features and their classification methodology. Important topics from different classification techniques, such as databases available for experimentation, appropriate feature extraction and selection methods, classifiers and performance issues are discussed, with emphasis on research published in the last decade. This survey also provides a discussion on open trends, along with directions for future research on this topic.

Proceedings Article
06 Jul 2015
TL;DR: An online visual tracking algorithm by learning discriminative saliency map using Convolutional Neural Network using hidden layers of the network to improve target localization accuracy and achieve pixel-level target segmentation.
Abstract: We propose an online visual tracking algorithm by learning discriminative saliency map using Convolutional Neural Network (CNN). Given a CNN pre-trained on a large-scale image repository in offline, our algorithm takes outputs from hidden layers of the network as feature descriptors since they show excellent representation performance in various general visual recognition problems. The features are used to learn discriminative target appearance models using an online Support Vector Machine (SVM). In addition, we construct target-specific saliency map by backprojecting CNN features with guidance of the SVM, and obtain the final tracking result in each frame based on the appearance model generatively constructed with the saliency map. Since the saliency map reveals spatial configuration of target effectively, it improves target localization accuracy and enables us to achieve pixel-level target segmentation. We verify the effectiveness of our tracking algorithm through extensive experiment on a challenging benchmark, where our method illustrates outstanding performance compared to the state-of-the-art tracking algorithms.

Posted Content
TL;DR: R*CNN as mentioned in this paper exploits the simple observation that actions are accompanied by contextual cues to build a strong action recognition system and adapts RCNN to use more than one region for classification while still maintaining the ability to localize the action.
Abstract: There are multiple cues in an image which reveal what action a person is performing. For example, a jogger has a pose that is characteristic for jogging, but the scene (e.g. road, trail) and the presence of other joggers can be an additional source of information. In this work, we exploit the simple observation that actions are accompanied by contextual cues to build a strong action recognition system. We adapt RCNN to use more than one region for classification while still maintaining the ability to localize the action. We call our system R*CNN. The action-specific models and the feature maps are trained jointly, allowing for action specific representations to emerge. R*CNN achieves 90.2% mean AP on the PASAL VOC Action dataset, outperforming all other approaches in the field by a significant margin. Last, we show that R*CNN is not limited to action recognition. In particular, R*CNN can also be used to tackle fine-grained tasks such as attribute classification. We validate this claim by reporting state-of-the-art performance on the Berkeley Attributes of People dataset.

Proceedings ArticleDOI
06 Sep 2015
TL;DR: This paper presents a speech emotion recognition system using a recurrent neural network (RNN) model trained by an efficient learning algorithm that takes into account the long-range context effect and the uncertainty of emotional label expressions.
Abstract: This paper presents a speech emotion recognition system using a recurrent neural network (RNN) model trained by an efficient learning algorithm The proposed system takes into account the long-range context effect and the uncertainty of emotional label expressions To extract high-level representation of emotional states with regard to its temporal dynamics, a powerful learning method with a bidirectional long short-term memory (BLSTM) model is adopted To overcome the uncertainty of emotional labels, such that all frames in the same utterance are mapped into the same emotional label, it is assumed that the label of each frame is regarded as a sequence of random variables Then, the sequences are trained by the proposed learning algorithm The weighted accuracy of the proposed emotion recognition system is improved up to 12% compared to the DNN-ELM based emotion recognition system used as a baseline

Journal ArticleDOI
TL;DR: The research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks and that integration of information from compatible corpora can significant improve classification performance.

Posted Content
TL;DR: In this paper, a new model of natural textures based on the feature spaces of convolutional neural networks optimised for object recognition is introduced, where textures are represented by the correlations between feature maps in several layers of the network.
Abstract: Here we introduce a new model of natural textures based on the feature spaces of convolutional neural networks optimised for object recognition. Samples from the model are of high perceptual quality demonstrating the generative power of neural networks trained in a purely discriminative fashion. Within the model, textures are represented by the correlations between feature maps in several layers of the network. We show that across layers the texture representations increasingly capture the statistical properties of natural images while making object information more and more explicit. The model provides a new tool to generate stimuli for neuroscience and might offer insights into the deep representations learned by convolutional neural networks.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This work exploits the simple observation that actions are accompanied by contextual cues to build a strong action recognition system and adapt RCNN to use more than one region for classification while still maintaining the ability to localize the action.
Abstract: There are multiple cues in an image which reveal what action a person is performing. For example, a jogger has a pose that is characteristic for jogging, but the scene (e.g. road, trail) and the presence of other joggers can be an additional source of information. In this work, we exploit the simple observation that actions are accompanied by contextual cues to build a strong action recognition system. We adapt RCNN to use more than one region for classification while still maintaining the ability to localize the action. We call our system R*CNN. The action-specific models and the feature maps are trained jointly, allowing for action specific representations to emerge. R*CNN achieves 90.2% mean AP on the PASAL VOC Action dataset, outperforming all other approaches in the field by a significant margin. Last, we show that R*CNN is not limited to action recognition. In particular, R*CNN can also be used to tackle fine-grained tasks such as attribute classification. We validate this claim by reporting state-of-the-art performance on the Berkeley Attributes of People dataset.

Journal ArticleDOI
TL;DR: This article extended two Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus.
Abstract: Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

Journal ArticleDOI
TL;DR: This work describes a representation of repeated structures suitable for scalable retrieval and geometric verification and demonstrates that the explicit detection of repeated patterns is beneficial for robust visual word matching for geometric verification.
Abstract: Repeated structures such as building facades, fences or road markings often represent a significant challenge for place recognition. Repeated structures are notoriously hard for establishing correspondences using multi-view geometry. They violate the feature independence assumed in the bag-of-visual-words representation which often leads to over-counting evidence and significant degradation of retrieval performance. In this work we show that repeated structures are not a nuisance but, when appropriately represented, they form an important distinguishing feature for many places. We describe a representation of repeated structures suitable for scalable retrieval and geometric verification. The retrieval is based on robust detection of repeated image structures and a suitable modification of weights in the bag-of-visual-word model. We also demonstrate that the explicit detection of repeated patterns is beneficial for robust visual word matching for geometric verification. Place recognition results are shown on datasets of street-level imagery from Pittsburgh and San Francisco demonstrating significant gains in recognition performance compared to the standard bag-of-visual-words baseline as well as the more recently proposed burstiness weighting and Fisher vector encoding.

Journal ArticleDOI
TL;DR: This paper adopts different supervised learning methods from the field of machine learning to develop multiclass classifiers that identify the transportation mode, including driving a car, riding a bicycle, Riding a bus, walking, and running.
Abstract: This paper adopts different supervised learning methods from the field of machine learning to develop multiclass classifiers that identify the transportation mode, including driving a car, riding a bicycle, riding a bus, walking, and running. Methods that were considered include K-nearest neighbor, support vector machines (SVMs), and tree-based models that comprise a single decision tree, bagging, and random forest (RF) methods. For training and validating purposes, data were obtained from smartphone sensors, including accelerometer, gyroscope, and rotation vector sensors. K-fold cross-validation as well as out-of-bag error was used for model selection and validation purposes. Several features were created from which a subset was identified through the minimum redundancy maximum relevance method. Data obtained from the smartphone sensors were found to provide important information to distinguish between transportation modes. The performance of different methods was evaluated and compared. The RF and SVM methods were found to produce the best performance. Furthermore, an effort was made to develop a new additional feature that entails creating a combination of other features by adopting a simulated annealing algorithm and a random forest method.

Proceedings ArticleDOI
05 Jan 2015
TL;DR: A new framework for age feature extraction based on the deep learning model is built and experimental results show that the proposed approach is significantly better than the state-of-the-art.
Abstract: Human age provides key demographic information. It is also considered as an important soft biometric trait for human identification or search. Compared to other pattern recognition problems (e.g., object classification, scene categorization), age estimation is much more challenging since the difference between facial images with age variations can be more subtle and the process of aging varies greatly among different individuals. In this work, we investigate deep learning techniques for age estimation based on the convolutional neural network (CNN). A new framework for age feature extraction based on the deep learning model is built. Compared to previous models based on CNN, we use feature maps obtained in different layers for our estimation work instead of using the feature obtained at the top layer. Additionally, a manifold learning algorithm is incorporated in the proposed scheme and this improves the performance significantly. Furthermore, we also evaluate different classification and regression schemes in estimating age using the deep learned aging pattern (DLA). To the best of our knowledge, this is the first time that deep learning technique is introduced and applied to solve the age estimation problem. Experimental results on two datasets show that the proposed approach is significantly better than the state-of-the-art.

Proceedings Article
07 Dec 2015
TL;DR: It is shown that winner-take-all autoencoders can be used to to learn deep sparse representations from the MNIST, CIFAR-10, ImageNet, Street View House Numbers and Toronto Face datasets, and achieve competitive classification performance.
Abstract: In this paper, we propose a winner-take-all method for learning hierarchical sparse representations in an unsupervised fashion. We first introduce fully-connected winner-take-all autoencoders which use mini-batch statistics to directly enforce a lifetime sparsity in the activations of the hidden units. We then propose the convolutional winner-take-all autoencoder which combines the benefits of convolutional architectures and autoencoders for learning shift-invariant sparse representations. We describe a way to train convolutional autoencoders layer by layer, where in addition to lifetime sparsity, a spatial sparsity within each feature map is achieved using winner-take-all activation functions. We will show that winner-take-all autoencoders can be used to to learn deep sparse representations from the MNIST, CIFAR-10, ImageNet, Street View House Numbers and Toronto Face datasets, and achieve competitive classification performance.