scispace - formally typeset
Search or ask a question

Showing papers on "Feature extraction published in 2015"


Proceedings ArticleDOI
07 Dec 2015
TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

7,091 citations


Journal ArticleDOI
TL;DR: This work equips the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement, and develops a new network structure, called SPP-net, which can generate a fixed-length representation regardless of image size/scale.
Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224 $\times$ 224) input image. This requirement is “artificial” and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102 $\times$ faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

5,919 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: HED turns pixel-wise edge classification into image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets to approach the human ability to resolve the challenging ambiguity in edge and object boundary detection.
Abstract: We develop a new edge detection algorithm that addresses two critical issues in this long-standing vision problem: (1) holistic image training, and (2) multi-scale feature learning. Our proposed method, holistically-nested edge detection (HED), turns pixel-wise edge classification into image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets. HED automatically learns rich hierarchical representations (guided by deep supervision on side responses) that are crucially important in order to approach the human ability to resolve the challenging ambiguity in edge and object boundary detection. We significantly advance the state-of-the-art on the BSD500 dataset (ODS F-score of 0.782) and the NYU Depth dataset (ODS F-score of 0.746), and do so with an improved speed (0.4 second per image) that is orders of magnitude faster than recent CNN-based edge detection algorithms.

2,173 citations


Proceedings Article
07 Dec 2015
TL;DR: In this paper, a convolutional neural network that operates directly on graphs is proposed to learn end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape.
Abstract: We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.

1,857 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: Blinear models, a recognition architecture that consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an image descriptor, are proposed.
Abstract: We propose bilinear models, a recognition architecture that consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an image descriptor. This architecture can model local pairwise feature interactions in a translationally invariant manner which is particularly useful for fine-grained categorization. It also generalizes various orderless texture descriptors such as the Fisher vector, VLAD and O2P. We present experiments with bilinear models where the feature extractors are based on convolutional neural networks. The bilinear form simplifies gradient computation and allows end-to-end training of both networks using image labels only. Using networks initialized from the ImageNet dataset followed by domain specific fine-tuning we obtain 84.1% accuracy of the CUB-200-2011 dataset requiring only category labels at training time. We present experiments and visualizations that analyze the effects of fine-tuning and the choice two networks on the speed and accuracy of the models. Results show that the architecture compares favorably to the existing state of the art on a number of fine-grained datasets while being substantially simpler and easier to train. Moreover, our most accurate model is fairly efficient running at 8 frames/sec on a NVIDIA Tesla K40 GPU. The source code for the complete system will be made available at http://vis-www.cs.umass.edu/bcnn.

1,386 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: An in-depth study on the properties of CNN features offline pre-trained on massive image data and classification task on ImageNet shows that the proposed tacker outperforms the state-of-the-art significantly.
Abstract: We propose a new approach for general object tracking with fully convolutional neural network. Instead of treating convolutional neural network (CNN) as a black-box feature extractor, we conduct in-depth study on the properties of CNN features offline pre-trained on massive image data and classification task on ImageNet. The discoveries motivate the design of our tracking system. It is found that convolutional layers in different levels characterize the target from different perspectives. A top layer encodes more semantic features and serves as a category detector, while a lower layer carries more discriminative information and can better separate the target from distracters with similar appearance. Both layers are jointly used with a switch mechanism during tracking. It is also found that for a tracking target, only a subset of neurons are relevant. A feature map selection method is developed to remove noisy and irrelevant feature maps, which can reduce computation redundancy and improve tracking accuracy. Extensive evaluation on the widely used tracking benchmark [36] shows that the proposed tacker outperforms the state-of-the-art significantly.

1,061 citations


Proceedings ArticleDOI
01 Sep 2015
TL;DR: This paper proposes a novel model dubbed the Piecewise Convolutional Neural Networks (PCNNs) with multi-instance learning to address the problem of wrong label problem when using distant supervision for relation extraction and adopts convolutional architecture with piecewise max pooling to automatically learn relevant features.
Abstract: Two problems arise when using distant supervision for relation extraction. First, in this method, an already existing knowledge base is heuristically aligned to texts, and the alignment results are treated as labeled data. However, the heuristic alignment can fail, resulting in wrong label problem. In addition, in previous approaches, statistical models have typically been applied to ad hoc features. The noise that originates from the feature extraction process can cause poor performance. In this paper, we propose a novel model dubbed the Piecewise Convolutional Neural Networks (PCNNs) with multi-instance learning to address these two problems. To solve the first problem, distant supervised relation extraction is treated as a multi-instance problem in which the uncertainty of instance labels is taken into account. To address the latter problem, we avoid feature engineering and instead adopt convolutional architecture with piecewise max pooling to automatically learn relevant features. Experiments show that our method is effective and outperforms several competitive baseline methods.

1,059 citations


Journal ArticleDOI
TL;DR: Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state-of-the-art features either prefixed, highly hand-crafted, or carefully learned [by deep neural networks (DNNs)].
Abstract: In this paper, we propose a very simple deep learning network for image classification that is based on very basic data processing components: 1) cascaded principal component analysis (PCA); 2) binary hashing; and 3) blockwise histograms. In the proposed architecture, the PCA is employed to learn multistage filter banks. This is followed by simple binary hashing and block histograms for indexing and pooling. This architecture is thus called the PCA network (PCANet) and can be extremely easily and efficiently designed and learned. For comparison and to provide a better understanding, we also introduce and study two simple variations of PCANet: 1) RandNet and 2) LDANet. They share the same topology as PCANet, but their cascaded filters are either randomly selected or learned from linear discriminant analysis. We have extensively tested these basic networks on many benchmark visual data sets for different tasks, including Labeled Faces in the Wild (LFW) for face verification; the MultiPIE, Extended Yale B, AR, Facial Recognition Technology (FERET) data sets for face recognition; and MNIST for hand-written digit recognition. Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state-of-the-art features either prefixed, highly hand-crafted, or carefully learned [by deep neural networks (DNNs)]. Even more surprisingly, the model sets new records for many classification tasks on the Extended Yale B, AR, and FERET data sets and on MNIST variations. Additional experiments on other public data sets also demonstrate the potential of PCANet to serve as a simple but highly competitive baseline for texture classification and object recognition.

1,034 citations


Journal ArticleDOI
TL;DR: A new feature extraction (FE) and image classification framework are proposed for hyperspectral data analysis based on deep belief network (DBN) and a novel deep architecture is proposed, which combines the spectral-spatial FE and classification together to get high classification accuracy.
Abstract: Hyperspectral data classification is a hot topic in remote sensing community. In recent years, significant effort has been focused on this issue. However, most of the methods extract the features of original data in a shallow manner. In this paper, we introduce a deep learning approach into hyperspectral image classification. A new feature extraction (FE) and image classification framework are proposed for hyperspectral data analysis based on deep belief network (DBN). First, we verify the eligibility of restricted Boltzmann machine (RBM) and DBN by the following spectral information-based classification. Then, we propose a novel deep architecture, which combines the spectral–spatial FE and classification together to get high classification accuracy. The framework is a hybrid of principal component analysis (PCA), hierarchical learning-based FE, and logistic regression (LR). Experimental results with hyperspectral data indicate that the classifier provide competitive solution with the state-of-the-art methods. In addition, this paper reveals that deep learning system has huge potential for hyperspectral data classification.

1,028 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: The results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers, and show that the convolutional features provide improved results compared to standard hand-crafted features.
Abstract: Visual object tracking is a challenging computer vision problem with numerous real-world applications. This paper investigates the impact of convolutional features for the visual tracking problem. We propose to use activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks. These activations have several advantages compared to the standard deep features (fully connected layers). Firstly, they miti-gate the need of task specific fine-tuning. Secondly, they contain structural information crucial for the tracking problem. Lastly, these activations have low dimensionality. We perform comprehensive experiments on three benchmark datasets: OTB, ALOV300++ and the recently introduced VOT2015. Surprisingly, different to image classification, our results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers. Our results further show that the convolutional features provide improved results compared to standard hand-crafted features. Finally, results comparable to state-of-the-art trackers are obtained on all three benchmark datasets.

961 citations


Posted Content
TL;DR: This paper proposes two very deep neural network architectures, referred to as DeepID3, for face recognition, rebuilt from stacked convolution and inception layers proposed in VGG net and GoogLeNet to make them suitable to face recognition.
Abstract: The state-of-the-art of face recognition has been significantly advanced by the emergence of deep learning. Very deep neural networks recently achieved great success on general object recognition because of their superb learning capacity. This motivates us to investigate their effectiveness on face recognition. This paper proposes two very deep neural network architectures, referred to as DeepID3, for face recognition. These two architectures are rebuilt from stacked convolution and inception layers proposed in VGG net and GoogLeNet to make them suitable to face recognition. Joint face identification-verification supervisory signals are added to both intermediate and final feature extraction layers during training. An ensemble of the proposed two architectures achieves 99.53% LFW face verification accuracy and 96.0% LFW rank-1 face identification accuracy, respectively. A further discussion of LFW face verification result is given in the end.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Zhang et al. as discussed by the authors introduced a neural network architecture which has fully connected layers on top of CNNs responsible for feature extraction at three different scales, and proposed a refinement method to enhance the spatial coherence of their saliency results.
Abstract: Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be learned from multiscale features extracted using deep convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for feature extraction at three different scales. We then propose a refinement method to enhance the spatial coherence of our saliency results. Finally, aggregating multiple saliency maps computed for different levels of image segmentation can further boost the performance, yielding saliency maps better than those generated from a single segmentation. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-Measure by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively on these two datasets.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper uses Convolutional Neural Networks to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches to develop 128-D descriptors whose euclidean distances reflect patch similarity and can be used as a drop-in replacement for any task involving SIFT.
Abstract: Deep learning has revolutionalized image-level tasks such as classification, but patch-level tasks, such as correspondence, still rely on hand-crafted features, e.g. SIFT. In this paper we use Convolutional Neural Networks (CNNs) to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches. We deal with the large number of potential pairs with the combination of a stochastic sampling of the training set and an aggressive mining strategy biased towards patches that are hard to classify. By using the L2 distance during both training and testing we develop 128-D descriptors whose euclidean distances reflect patch similarity, and which can be used as a drop-in replacement for any task involving SIFT. We demonstrate consistent performance gains over the state of the art, and generalize well against scaling and rotation, perspective transformation, non-rigid deformation, and illumination changes. Our descriptors are efficient to compute and amenable to modern GPUs, and are publicly available.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: A unified approach to combining feature computation and similarity networks for training a patch matching system that improves accuracy over previous state-of-the-art results on patch matching datasets, while reducing the storage requirement for descriptors is confirmed.
Abstract: Motivated by recent successes on learning feature representations and on learning feature comparison functions, we propose a unified approach to combining both for training a patch matching system. Our system, dubbed Match-Net, consists of a deep convolutional network that extracts features from patches and a network of three fully connected layers that computes a similarity between the extracted features. To ensure experimental repeatability, we train MatchNet on standard datasets and employ an input sampler to augment the training set with synthetic exemplar pairs that reduce overfitting. Once trained, we achieve better computational efficiency during matching by disassembling MatchNet and separately applying the feature computation and similarity networks in two sequential stages. We perform a comprehensive set of experiments on standard datasets to carefully study the contributions of each aspect of MatchNet, with direct comparisons to established methods. Our results confirm that our unified approach improves accuracy over previous state-of-the-art results on patch matching datasets, while reducing the storage requirement for descriptors. We make pre-trained MatchNet publicly available.

Journal ArticleDOI
TL;DR: Various ways of performing dimensionality reduction on high-dimensional microarray data are summarised to provide a clearer idea of when to use each one of them for saving computational time and resources.
Abstract: We summarise various ways of performing dimensionality reduction on high-dimensional microarray data. Many different feature selection and feature extraction methods exist and they are being widely used. All these methods aim to remove redundant and irrelevant features so that classification of new instances will be more accurate. A popular source of data is microarrays, a biological platform for gathering gene expressions. Analysing microarrays can be difficult due to the size of the data they provide. In addition the complicated relations among the different genes make analysis more difficult and removing excess features can improve the quality of the results. We present some of the most popular methods for selecting significant features and provide a comparison between them. Their advantages and disadvantages are outlined in order to provide a clearer idea of when to use each one of them for saving computational time and resources.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper shows that deep features and traditional hand-engineered features have quite different distributions of pairwise similarities, hence existing aggregation methods have to be carefully re-evaluated and reveals that in contrast to shallow features, the simple aggregation method based on sum pooling provides the best performance for deep convolutional features.
Abstract: Several recent works have shown that image descriptors produced by deep convolutional neural networks provide state-of-the-art performance for image classification and retrieval problems. It also has been shown that the activations from the convolutional layers can be interpreted as local features describing particular image regions. These local features can be aggregated using aggregating methods developed for local features (e.g. Fisher vectors), thus providing new powerful global descriptor. In this paper we investigate possible ways to aggregate local deep features to produce compact descriptors for image retrieval. First, we show that deep features and traditional hand-engineered features have quite different distributions of pairwise similarities, hence existing aggregation methods have to be carefully re-evaluated. Such re-evaluation reveals that in contrast to shallow features, the simple aggregation method based on sum pooling provides the best performance for deep convolutional features. This method is efficient, has few parameters, and bears little risk of overfitting when e.g. learning the PCA matrix. In addition, we suggest a simple yet efficient query expansion scheme suitable for the proposed aggregation method. Overall, the new compact global descriptor improves the state-of-the-art on four common benchmarks considerably.

Proceedings ArticleDOI
Heechul Jung1, Sihaeng Lee1, Junho Yim1, Sunjeong Park1, Junmo Kim1 
07 Dec 2015
TL;DR: A deep learning technique, which is regarded as a tool to automatically extract useful features from raw data, is adopted and is combined using a new integration method in order to boost the performance of the facial expression recognition.
Abstract: Temporal information has useful features for recognizing facial expressions. However, to manually design useful features requires a lot of effort. In this paper, to reduce this effort, a deep learning technique, which is regarded as a tool to automatically extract useful features from raw data, is adopted. Our deep network is based on two different models. The first deep network extracts temporal appearance features from image sequences, while the other deep network extracts temporal geometry features from temporal facial landmark points. These two models are combined using a new integration method in order to boost the performance of the facial expression recognition. Through several experiments, we show that the two models cooperate with each other. As a result, we achieve superior performance to other state-of-the-art methods in the CK+ and Oulu-CASIA databases. Furthermore, we show that our new integration method gives more accurate results than traditional methods, such as a weighted summation and a feature concatenation method.

Journal ArticleDOI
TL;DR: A novel and effective geospatial object detection framework is proposed by combining the weakly supervised learning (WSL) and high-level feature learning by jointly integrating saliency, intraclass compactness, and interclass separability in a Bayesian framework.
Abstract: The abundant spatial and contextual information provided by the advanced remote sensing technology has facilitated subsequent automatic interpretation of the optical remote sensing images (RSIs). In this paper, a novel and effective geospatial object detection framework is proposed by combining the weakly supervised learning (WSL) and high-level feature learning. First, deep Boltzmann machine is adopted to infer the spatial and structural information encoded in the low-level and middle-level features to effectively describe objects in optical RSIs. Then, a novel WSL approach is presented to object detection where the training sets require only binary labels indicating whether an image contains the target object or not. Based on the learnt high-level features, it jointly integrates saliency, intraclass compactness, and interclass separability in a Bayesian framework to initialize a set of training examples from weakly labeled images and start iterative learning of the object detector. A novel evaluation criterion is also developed to detect model drift and cease the iterative learning. Comprehensive experiments on three optical RSI data sets have demonstrated the efficacy of the proposed approach in benchmarking with several state-of-the-art supervised-learning-based object detection approaches.

Journal ArticleDOI
TL;DR: A novel deep-learning-based feature-selection method is proposed, which formulates the feature- selection problem as a feature reconstruction problem, and an iterative algorithm is developed to adapt the DBN to produce the inquired reconstruction weights.
Abstract: With the popular use of high-resolution satellite images, more and more research efforts have been placed on remote sensing scene classification/recognition. In scene classification, effective feature selection can significantly boost the final performance. In this letter, a novel deep-learning-based feature-selection method is proposed, which formulates the feature-selection problem as a feature reconstruction problem. Note that the popular deep-learning technique, i.e., the deep belief network (DBN), achieves feature abstraction by minimizing the reconstruction error over the whole feature set, and features with smaller reconstruction errors would hold more feature intrinsics for image representation. Therefore, the proposed method selects features that are more reconstructible as the discriminative features. Specifically, an iterative algorithm is developed to adapt the DBN to produce the inquired reconstruction weights. In the experiments, 2800 remote sensing scene images of seven categories are collected for performance evaluation. Experimental results demonstrate the effectiveness of the proposed method.

Proceedings ArticleDOI
17 Dec 2015
TL;DR: This paper leverages recent progress on Convolutional Neural Networks (CNNs) and proposes a novel RGB-D architecture for object recognition that is composed of two separate CNN processing streams - one for each modality - which are consecutively combined with a late fusion network.
Abstract: Robust object recognition is a crucial ingredient of many, if not all, real-world robotics applications. This paper leverages recent progress on Convolutional Neural Networks (CNNs) and proposes a novel RGB-D architecture for object recognition. Our architecture is composed of two separate CNN processing streams - one for each modality - which are consecutively combined with a late fusion network. We focus on learning with imperfect sensor data, a typical problem in real-world robotics tasks. For accurate learning, we introduce a multi-stage training methodology and two crucial ingredients for handling depth data with CNNs. The first, an effective encoding of depth information for CNNs that enables learning without the need for large depth datasets. The second, a data augmentation scheme for robust learning with depth images by corrupting them with realistic noise patterns. We present state-of-the-art results on the RGB-D object dataset [15] and show recognition in challenging RGB-D real-world noisy settings.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: In this paper, a Pose-based Convolutional Neural Network descriptor (P-CNN) is proposed for action recognition in videos, which aggregates motion and appearance information along tracks of human body parts.
Abstract: This work targets human action recognition in video. While recent methods typically represent actions by statistics of local video features, here we argue for the importance of a representation derived from human pose. To this end we propose a new Pose-based Convolutional Neural Network descriptor (P-CNN) for action recognition. The descriptor aggregates motion and appearance information along tracks of human body parts. We investigate different schemes of temporal aggregation and experiment with P-CNN features obtained both for automatically estimated and manually annotated human poses. We evaluate our method on the recent and challenging JHMDB and MPII Cooking datasets. For both datasets our method shows consistent improvement over the state of the art.

Journal ArticleDOI
TL;DR: The proposed framework employs local binary patterns to extract local image features, such as edges, corners, and spots, and employs the efficient extreme learning machine with a very simple structure as the classifier.
Abstract: It is of great interest in exploiting texture information for classification of hyperspectral imagery (HSI) at high spatial resolution. In this paper, a classification paradigm to exploit rich texture information of HSI is proposed. The proposed framework employs local binary patterns (LBPs) to extract local image features, such as edges, corners, and spots. Two levels of fusion (i.e., feature-level fusion and decision-level fusion) are applied to the extracted LBP features along with global Gabor features and original spectral features, where feature-level fusion involves concatenation of multiple features before the pattern classification process while decision-level fusion performs on probability outputs of each individual classification pipeline and soft-decision fusion rule is adopted to merge results from the classifier ensemble. Moreover, the efficient extreme learning machine with a very simple structure is employed as the classifier. Experimental results on several HSI data sets demonstrate that the proposed framework is superior to some traditional alternatives.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: In this paper, the authors investigated if the awareness of egomotion (i.e. self motion) can be used as a supervisory signal for feature learning and found that features learnt using self-motion as supervision compare favourably to features learned using class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching.
Abstract: The current dominant paradigm for feature learning in computer vision relies on training neural networks for the task of object recognition using millions of hand labelled images. Is it also possible to learn features for a diverse set of visual tasks using any other form of supervision? In biology, living organisms developed the ability of visual perception for the purpose of moving and acting in the world. Drawing inspiration from this observation, in this work we investigated if the awareness of egomotion(i.e. self motion) can be used as a supervisory signal for feature learning. As opposed to the knowledge of class labels, information about egomotion is freely available to mobile agents. We found that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt using class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching.

Journal ArticleDOI
TL;DR: A new no-reference (NR) image quality assessment (IQA) metric is proposed using the recently revealed free-energy-based brain theory and classical human visual system (HVS)-inspired features to predict an image that the HVS perceives from a distorted image based on the free energy theory.
Abstract: In this paper we propose a new no-reference (NR) image quality assessment (IQA) metric using the recently revealed free-energy-based brain theory and classical human visual system (HVS)-inspired features. The features used can be divided into three groups. The first involves the features inspired by the free energy principle and the structural degradation model. Furthermore, the free energy theory also reveals that the HVS always tries to infer the meaningful part from the visual stimuli. In terms of this finding, we first predict an image that the HVS perceives from a distorted image based on the free energy theory, then the second group of features is composed of some HVS-inspired features (such as structural information and gradient magnitude) computed using the distorted and predicted images. The third group of features quantifies the possible losses of “naturalness” in the distorted image by fitting the generalized Gaussian distribution to mean subtracted contrast normalized coefficients. After feature extraction, our algorithm utilizes the support vector machine based regression module to derive the overall quality score. Experiments on LIVE, TID2008, CSIQ, IVC, and Toyama databases confirm the effectiveness of our introduced NR IQA metric compared to the state-of-the-art.

Journal ArticleDOI
TL;DR: The proposed algorithm clearly outperforms standard principal component analysis and its kernel counterpart (kPCA), as well as current state-of-the-art algorithms of aerial classification, while being extremely computationally efficient at learning representations of data.
Abstract: This paper introduces the use of single layer and deep convolutional networks for remote sensing data analysis. Direct application to multi- and hyper-spectral imagery of supervised (shallow or deep) convolutional networks is very challenging given the high input data dimensionality and the relatively small amount of available labeled data. Therefore, we propose the use of greedy layer-wise unsupervised pre-training coupled with a highly efficient algorithm for unsupervised learning of sparse features. The algorithm is rooted on sparse representations and enforces both population and lifetime sparsity of the extracted features, simultaneously. We successfully illustrate the expressive power of the extracted representations in several scenarios: classification of aerial scenes, as well as land-use classification in very high resolution (VHR), or land-cover classification from multi- and hyper-spectral images. The proposed algorithm clearly outperforms standard Principal Component Analysis (PCA) and its kernel counterpart (kPCA), as well as current state-of-the-art algorithms of aerial classification, while being extremely computationally efficient at learning representations of data. Results show that single layer convolutional networks can extract powerful discriminative features only when the receptive field accounts for neighboring pixels, and are preferred when the classification requires high resolution and detailed results. However, deep architectures significantly outperform single layers variants, capturing increasing levels of abstraction and complexity throughout the feature hierarchy.

Journal ArticleDOI
TL;DR: A novel pose recovery method using non-linear mapping with multi-layered deep neural network and back-propagation deep learning to obtain a unified feature description by standard eigen-decomposition of the hypergraph Laplacian matrix.
Abstract: Video-based human pose recovery is usually conducted by retrieving relevant poses using image features. In the retrieving process, the mapping between 2D images and 3D poses is assumed to be linear in most of the traditional methods. However, their relationships are inherently non-linear, which limits recovery performance of these methods. In this paper, we propose a novel pose recovery method using non-linear mapping with multi-layered deep neural network. It is based on feature extraction with multimodal fusion and back-propagation deep learning. In multimodal fusion, we construct hypergraph Laplacian with low-rank representation. In this way, we obtain a unified feature description by standard eigen-decomposition of the hypergraph Laplacian matrix. In back-propagation deep learning, we learn a non-linear mapping from 2D images to 3D poses with parameter fine-tuning. The experimental results on three data sets show that the recovery error has been reduced by 20%–25%, which demonstrates the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: It is demonstrated that the selection of optimal neighborhoods for individual 3D points significantly improves the results of 3D scene analysis and may even further increase the quality of the derived results while significantly reducing both processing time and memory consumption.
Abstract: 3D scene analysis in terms of automatically assigning 3D points a respective semantic label has become a topic of great importance in photogrammetry, remote sensing, computer vision and robotics. In this paper, we address the issue of how to increase the distinctiveness of geometric features and select the most relevant ones among these for 3D scene analysis. We present a new, fully automated and versatile framework composed of four components: (i) neighborhood selection, (ii) feature extraction, (iii) feature selection and (iv) classification. For each component, we consider a variety of approaches which allow applicability in terms of simplicity, efficiency and reproducibility, so that end-users can easily apply the different components and do not require expert knowledge in the respective domains. In a detailed evaluation involving 7 neighborhood definitions, 21 geometric features, 7 approaches for feature selection, 10 classifiers and 2 benchmark datasets, we demonstrate that the selection of optimal neighborhoods for individual 3D points significantly improves the results of 3D scene analysis. Additionally, we show that the selection of adequate feature subsets may even further increase the quality of the derived results while significantly reducing both processing time and memory consumption.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: In this article, a multi-task autoencoder (MTAE) is proposed to transform the original image into analogs in multiple related domains, which are then used as inputs to a classifier.
Abstract: The problem of domain generalization is to take knowledge acquired from a number of related domains, where training data is available, and to then successfully apply it to previously unseen domains. We propose a new feature learning algorithm, Multi-Task Autoencoder (MTAE), that provides good generalization performance for cross-domain object recognition. The algorithm extends the standard denoising autoencoder framework by substituting artificially induced corruption with naturally occurring inter-domain variability in the appearance of objects. Instead of reconstructing images from noisy versions, MTAE learns to transform the original image into analogs in multiple related domains. It thereby learns features that are robust to variations across domains. The learnt features are then used as inputs to a classifier. We evaluated the performance of the algorithm on benchmark image recognition datasets, where the task is to learn features from multiple datasets and to then predict the image label from unseen datasets. We found that (denoising) MTAE outperforms alternative autoencoder-based models as well as the current state-of-the-art algorithms for domain generalization.

Journal ArticleDOI
TL;DR: A connectionist-hidden Markov model (HMM) system for noise-robust AVSR is introduced and it is demonstrated that approximately 65 % word recognition rate gain is attained with denoised MFCCs under 10 dB signal-to-noise-ratio (SNR) for the audio signal input.
Abstract: Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attaining high recognition performance. In the machine-learning community, deep learning approaches have recently attracted increasing attention because deep neural networks can effectively extract robust latent features that enable various recognition algorithms to demonstrate revolutionary generalization capabilities under diverse application conditions. This study introduces a connectionist-hidden Markov model (HMM) system for noise-robust AVSR. First, a deep denoising autoencoder is utilized for acquiring noise-robust audio features. By preparing the training data for the network with pairs of consecutive multiple steps of deteriorated audio features and the corresponding clean features, the network is trained to output denoised audio features from the corresponding features deteriorated by noise. Second, a convolutional neural network (CNN) is utilized to extract visual features from raw mouth area images. By preparing the training data for the CNN as pairs of raw images and the corresponding phoneme label outputs, the network is trained to predict phoneme labels from the corresponding mouth area input images. Finally, a multi-stream HMM (MSHMM) is applied for integrating the acquired audio and visual HMMs independently trained with the respective features. By comparing the cases when normal and denoised mel-frequency cepstral coefficients (MFCCs) are utilized as audio features to the HMM, our unimodal isolated word recognition results demonstrate that approximately 65 % word recognition rate gain is attained with denoised MFCCs under 10 dB signal-to-noise-ratio (SNR) for the audio signal input. Moreover, our multimodal isolated word recognition results utilizing MSHMM with denoised MFCCs and acquired visual features demonstrate that an additional word recognition rate gain is attained for the SNR conditions below 10 dB.

Journal ArticleDOI
TL;DR: The proposed unsupervised-feature-learning-based scene classification method provides more accurate classification results than the other latent-Dirichlet-allocation-based methods and the sparse coding method.
Abstract: Due to the rapid technological development of various different satellite sensors, a huge volume of high-resolution image data sets can now be acquired. How to efficiently represent and recognize the scenes from such high-resolution image data has become a critical task. In this paper, we propose an unsupervised feature learning framework for scene classification. By using the saliency detection algorithm, we extract a representative set of patches from the salient regions in the image data set. These unlabeled data patches are exploited by an unsupervised feature learning method to learn a set of feature extractors which are robust and efficient and do not need elaborately designed descriptors such as the scale-invariant-feature-transform-based algorithm. We show that the statistics generated from the learned feature extractors can characterize a complex scene very well and can produce excellent classification accuracy. In order to reduce overfitting in the feature learning step, we further employ a recently developed regularization method called “dropout,” which has proved to be very effective in image classification. In the experiments, the proposed method was applied to two challenging high-resolution data sets: the UC Merced data set containing 21 different aerial scene categories with a submeter resolution and the Sydney data set containing seven land-use categories with a 60-cm spatial resolution. The proposed method obtained results that were equal to or even better than the previous best results with the UC Merced data set, and it also obtained the highest accuracy with the Sydney data set, demonstrating that the proposed unsupervised-feature-learning-based scene classification method provides more accurate classification results than the other latent-Dirichlet-allocation-based methods and the sparse coding method.