scispace - formally typeset
Search or ask a question

Showing papers on "Contextual image classification published in 2015"


Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations


Journal ArticleDOI
TL;DR: This work equips the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement, and develops a new network structure, called SPP-net, which can generate a fixed-length representation regardless of image size/scale.
Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224 $\times$ 224) input image. This requirement is “artificial” and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102 $\times$ faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

5,919 citations


Posted Content
TL;DR: In this article, a new convolutional network module is proposed to aggregate multi-scale contextual information without losing resolution, and the architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage.
Abstract: State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction and image classification are structurally different. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.

4,018 citations


Journal ArticleDOI
10 Jul 2015-PLOS ONE
TL;DR: This work proposes a general solution to the problem of understanding classification decisions by pixel-wise decomposition of nonlinear classifiers by introducing a methodology that allows to visualize the contributions of single pixels to predictions for kernel-based classifiers over Bag of Words features and for multilayered neural networks.
Abstract: Understanding and interpreting classification decisions of automated image classification systems is of high value in many applications, as it allows to verify the reasoning of the system and provides additional information to the human expert. Although machine learning methods are solving very successfully a plethora of tasks, they have in most cases the disadvantage of acting as a black box, not providing any information about what made them arrive at a particular decision. This work proposes a general solution to the problem of understanding classification decisions by pixel-wise decomposition of nonlinear classifiers. We introduce a methodology that allows to visualize the contributions of single pixels to predictions for kernel-based classifiers over Bag of Words features and for multilayered neural networks. These pixel contributions can be visualized as heatmaps and are provided to a human expert who can intuitively not only verify the validity of the classification decision, but also focus further analysis on regions of potential interest. We evaluate our method for classifiers trained on PASCAL VOC 2009 images, synthetic image data containing geometric shapes, the MNIST handwritten digits data set and for the pre-trained ImageNet model available as part of the Caffe open source package.

3,330 citations


Proceedings Article
07 May 2015
TL;DR: DeepLab as mentioned in this paper combines the responses at the final layer with a fully connected CRF to localize segment boundaries at a level of accuracy beyond previous methods, achieving 71.6% IOU accuracy in the test set.
Abstract: Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segmentation"). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our "DeepLab" system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the 'hole' algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.

2,469 citations


Journal ArticleDOI
TL;DR: Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state-of-the-art features either prefixed, highly hand-crafted, or carefully learned [by deep neural networks (DNNs)].
Abstract: In this paper, we propose a very simple deep learning network for image classification that is based on very basic data processing components: 1) cascaded principal component analysis (PCA); 2) binary hashing; and 3) blockwise histograms. In the proposed architecture, the PCA is employed to learn multistage filter banks. This is followed by simple binary hashing and block histograms for indexing and pooling. This architecture is thus called the PCA network (PCANet) and can be extremely easily and efficiently designed and learned. For comparison and to provide a better understanding, we also introduce and study two simple variations of PCANet: 1) RandNet and 2) LDANet. They share the same topology as PCANet, but their cascaded filters are either randomly selected or learned from linear discriminant analysis. We have extensively tested these basic networks on many benchmark visual data sets for different tasks, including Labeled Faces in the Wild (LFW) for face verification; the MultiPIE, Extended Yale B, AR, Facial Recognition Technology (FERET) data sets for face recognition; and MNIST for hand-written digit recognition. Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state-of-the-art features either prefixed, highly hand-crafted, or carefully learned [by deep neural networks (DNNs)]. Even more surprisingly, the model sets new records for many classification tasks on the Extended Yale B, AR, and FERET data sets and on MNIST variations. Additional experiments on other public data sets also demonstrate the potential of PCANet to serve as a simple but highly competitive baseline for texture classification and object recognition.

1,034 citations


Journal ArticleDOI
TL;DR: A new feature extraction (FE) and image classification framework are proposed for hyperspectral data analysis based on deep belief network (DBN) and a novel deep architecture is proposed, which combines the spectral-spatial FE and classification together to get high classification accuracy.
Abstract: Hyperspectral data classification is a hot topic in remote sensing community. In recent years, significant effort has been focused on this issue. However, most of the methods extract the features of original data in a shallow manner. In this paper, we introduce a deep learning approach into hyperspectral image classification. A new feature extraction (FE) and image classification framework are proposed for hyperspectral data analysis based on deep belief network (DBN). First, we verify the eligibility of restricted Boltzmann machine (RBM) and DBN by the following spectral information-based classification. Then, we propose a novel deep architecture, which combines the spectral–spatial FE and classification together to get high classification accuracy. The framework is a hybrid of principal component analysis (PCA), hierarchical learning-based FE, and logistic regression (LR). Experimental results with hyperspectral data indicate that the classifier provide competitive solution with the state-of-the-art methods. In addition, this paper reveals that deep learning system has huge potential for hyperspectral data classification.

1,028 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: The results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers, and show that the convolutional features provide improved results compared to standard hand-crafted features.
Abstract: Visual object tracking is a challenging computer vision problem with numerous real-world applications. This paper investigates the impact of convolutional features for the visual tracking problem. We propose to use activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks. These activations have several advantages compared to the standard deep features (fully connected layers). Firstly, they miti-gate the need of task specific fine-tuning. Secondly, they contain structural information crucial for the tracking problem. Lastly, these activations have low dimensionality. We perform comprehensive experiments on three benchmark datasets: OTB, ALOV300++ and the recently introduced VOT2015. Surprisingly, different to image classification, our results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers. Our results further show that the convolutional features provide improved results compared to standard hand-crafted features. Finally, results comparable to state-of-the-art trackers are obtained on all three benchmark datasets.

961 citations


Proceedings ArticleDOI
07 Jun 2015
TL;DR: A general framework to train CNNs with only a limited number of clean labels and millions of easily obtained noisy labels is introduced and the relationships between images, class labels and label noises are model with a probabilistic graphical model and further integrate it into an end-to-end deep learning system.
Abstract: Large-scale supervised datasets are crucial to train convolutional neural networks (CNNs) for various computer vision problems. However, obtaining a massive amount of well-labeled data is usually very expensive and time consuming. In this paper, we introduce a general framework to train CNNs with only a limited number of clean labels and millions of easily obtained noisy labels. We model the relationships between images, class labels and label noises with a probabilistic graphical model and further integrate it into an end-to-end deep learning system. To demonstrate the effectiveness of our approach, we collect a large-scale real-world clothing classification dataset with both noisy and clean labels. Experiments on this dataset indicate that our approach can better correct the noisy labels and improves the performance of trained CNNs.

893 citations


Proceedings ArticleDOI
07 Jun 2015
TL;DR: This paper proposes to apply visual attention to fine-grained classification task using deep neural network and achieves the best accuracy under the weakest supervision condition, and is competitive against other methods that rely on additional annotations.
Abstract: Fine-grained classification is challenging because categories can only be discriminated by subtle and local differences. Variances in the pose, scale or rotation usually make the problem more difficult. Most fine-grained classification systems follow the pipeline of finding foreground object or object parts (where) to extract discriminative features (what).

755 citations


Journal ArticleDOI
TL;DR: This paper surveys state-of-the-art transfer learning algorithms in visual categorization applications, such as object recognition, image classification, and human action recognition, to find out if they can be efficiently solved.
Abstract: Regular machine learning and data mining techniques study the training data for future inferences under a major assumption that the future data are within the same feature space or have the same distribution as the training data. However, due to the limited availability of human labeled training data, training data that stay in the same feature space or have the same distribution as the future data cannot be guaranteed to be sufficient enough to avoid the over-fitting problem. In real-world applications, apart from data in the target domain, related data in a different domain can also be included to expand the availability of our prior knowledge about the target future data. Transfer learning addresses such cross-domain learning problems by extracting useful information from data in a related domain and transferring them for being used in target tasks. In recent years, with transfer learning being applied to visual categorization, some typical problems, e.g., view divergence in action recognition tasks and concept drifting in image classification tasks, can be efficiently solved. In this paper, we survey state-of-the-art transfer learning algorithms in visual categorization applications, such as object recognition, image classification, and human action recognition.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: This project shows that compelling classification performance can be achieved on fine-grained categories even without labeled training data, and establishes a substantially improved state-of-the-art on the Animals with Attributes and Caltech-UCSD Birds datasets.
Abstract: Image classification has advanced significantly in recent years with the availability of large-scale image sets. However, fine-grained classification remains a major challenge due to the annotation cost of large numbers of fine-grained categories. This project shows that compelling classification performance can be achieved on such categories even without labeled training data. Given image and class embeddings, we learn a compatibility function such that matching embeddings are assigned a higher score than mismatching ones; zero-shot classification of an image proceeds by finding the label yielding the highest joint compatibility score. We use state-of-the-art image features and focus on different supervised attributes and unsupervised output embeddings either derived from hierarchies or learned from unlabeled text corpora. We establish a substantially improved state-of-the-art on the Animals with Attributes and Caltech-UCSD Birds datasets. Most encouragingly, we demonstrate that purely unsupervised output embeddings (learned from Wikipedia and improved with finegrained text) achieve compelling results, even outperforming the previous supervised state-of-the-art. By combining different output embeddings, we further improve results.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: ConvNets trained for recognizing everyday objects for the classification of aerial and remote sensing images obtained the best results for aerial images, while for remote sensing, they performed well but were outperformed by low-level color descriptors, such as BIC.
Abstract: In this paper, we evaluate the generalization power of deep features (ConvNets) in two new scenarios: aerial and remote sensing image classification. We evaluate experimentally ConvNets trained for recognizing everyday objects for the classification of aerial and remote sensing images. ConvNets obtained the best results for aerial images, while for remote sensing, they performed well but were outperformed by low-level color descriptors, such as BIC. We also present a correlation analysis, showing the potential for combining/fusing different ConvNets with other descriptors or even for combining multiple ConvNets. A preliminary set of experiments fusing ConvNets obtains state-of-the-art results for the well-known UCMerced dataset.

Journal ArticleDOI
TL;DR: The proposed algorithm clearly outperforms standard principal component analysis and its kernel counterpart (kPCA), as well as current state-of-the-art algorithms of aerial classification, while being extremely computationally efficient at learning representations of data.
Abstract: This paper introduces the use of single layer and deep convolutional networks for remote sensing data analysis. Direct application to multi- and hyper-spectral imagery of supervised (shallow or deep) convolutional networks is very challenging given the high input data dimensionality and the relatively small amount of available labeled data. Therefore, we propose the use of greedy layer-wise unsupervised pre-training coupled with a highly efficient algorithm for unsupervised learning of sparse features. The algorithm is rooted on sparse representations and enforces both population and lifetime sparsity of the extracted features, simultaneously. We successfully illustrate the expressive power of the extracted representations in several scenarios: classification of aerial scenes, as well as land-use classification in very high resolution (VHR), or land-cover classification from multi- and hyper-spectral images. The proposed algorithm clearly outperforms standard Principal Component Analysis (PCA) and its kernel counterpart (kPCA), as well as current state-of-the-art algorithms of aerial classification, while being extremely computationally efficient at learning representations of data. Results show that single layer convolutional networks can extract powerful discriminative features only when the receptive field accounts for neighboring pixels, and are preferred when the classification requires high resolution and detailed results. However, deep architectures significantly outperform single layers variants, capturing increasing levels of abstraction and complexity throughout the feature hierarchy.

Journal ArticleDOI
TL;DR: The proposed unsupervised-feature-learning-based scene classification method provides more accurate classification results than the other latent-Dirichlet-allocation-based methods and the sparse coding method.
Abstract: Due to the rapid technological development of various different satellite sensors, a huge volume of high-resolution image data sets can now be acquired. How to efficiently represent and recognize the scenes from such high-resolution image data has become a critical task. In this paper, we propose an unsupervised feature learning framework for scene classification. By using the saliency detection algorithm, we extract a representative set of patches from the salient regions in the image data set. These unlabeled data patches are exploited by an unsupervised feature learning method to learn a set of feature extractors which are robust and efficient and do not need elaborately designed descriptors such as the scale-invariant-feature-transform-based algorithm. We show that the statistics generated from the learned feature extractors can characterize a complex scene very well and can produce excellent classification accuracy. In order to reduce overfitting in the feature learning step, we further employ a recently developed regularization method called “dropout,” which has proved to be very effective in image classification. In the experiments, the proposed method was applied to two challenging high-resolution data sets: the UC Merced data set containing 21 different aerial scene categories with a submeter resolution and the Sydney data set containing seven land-use categories with a 60-cm spatial resolution. The proposed method obtained results that were equal to or even better than the previous best results with the UC Merced data set, and it also obtained the highest accuracy with the Sydney data set, demonstrating that the proposed unsupervised-feature-learning-based scene classification method provides more accurate classification results than the other latent-Dirichlet-allocation-based methods and the sparse coding method.

Posted Content
TL;DR: A general methodology based on region perturbation for evaluating ordered collections of pixels such as heatmaps and shows that the recently proposed layer-wise relevance propagation algorithm qualitatively and quantitatively provides a better explanation of what made a DNN arrive at a particular classification decision than the sensitivity-based approach or the deconvolution method.
Abstract: Deep Neural Networks (DNNs) have demonstrated impressive performance in complex machine learning tasks such as image classification or speech recognition. However, due to their multi-layer nonlinear structure, they are not transparent, i.e., it is hard to grasp what makes them arrive at a particular classification or recognition decision given a new unseen data sample. Recently, several approaches have been proposed enabling one to understand and interpret the reasoning embodied in a DNN for a single test image. These methods quantify the ''importance'' of individual pixels wrt the classification decision and allow a visualization in terms of a heatmap in pixel/input space. While the usefulness of heatmaps can be judged subjectively by a human, an objective quality measure is missing. In this paper we present a general methodology based on region perturbation for evaluating ordered collections of pixels such as heatmaps. We compare heatmaps computed by three different methods on the SUN397, ILSVRC2012 and MIT Places data sets. Our main result is that the recently proposed Layer-wise Relevance Propagation (LRP) algorithm qualitatively and quantitatively provides a better explanation of what made a DNN arrive at a particular classification decision than the sensitivity-based approach or the deconvolution method. We provide theoretical arguments to explain this result and discuss its practical implications. Finally, we investigate the use of heatmaps for unsupervised assessment of neural network performance.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: This paper attempts to model deep learning in a weakly supervised learning (multiple instance learning) framework, where each image follows a dual multi-instance assumption, where its object proposals and possible text annotations can be regarded as two instance sets.
Abstract: The recent development in learning deep representations has demonstrated its wide applications in traditional vision tasks like classification and detection. However, there has been little investigation on how we could build up a deep learning framework in a weakly supervised setting. In this paper, we attempt to model deep learning in a weakly supervised learning (multiple instance learning) framework. In our setting, each image follows a dual multi-instance assumption, where its object proposals and possible text annotations can be regarded as two instance sets. We thus design effective systems to exploit the MIL property with deep learning strategies from the two ends; we also try to jointly learn the relationship between object and annotation proposals. We conduct extensive experiments and prove that our weakly supervised deep learning framework not only achieves convincing performance in vision tasks including classification and image annotation, but also extracts reasonable region-keyword pairs with little supervision, on both widely used benchmarks like PASCAL VOC and MIT Indoor Scene 67, and also a dataset for image-and patch-level annotations.

Journal ArticleDOI
TL;DR: A hybrid method using Random Forest and texture analysis to accurately differentiate land covers of urban vegetated areas, and analyze how classification accuracy changes with texture window size demonstrates that UAV provides an efficient and ideal platform for urban vegetation mapping.
Abstract: Unmanned aerial vehicle (UAV) remote sensing has great potential for vegetation mapping in complex urban landscapes due to the ultra-high resolution imagery acquired at low altitudes. Because of payload capacity restrictions, off-the-shelf digital cameras are widely used on medium and small sized UAVs. The limitation of low spectral resolution in digital cameras for vegetation mapping can be reduced by incorporating texture features and robust classifiers. Random Forest has been widely used in satellite remote sensing applications, but its usage in UAV image classification has not been well documented. The objectives of this paper were to propose a hybrid method using Random Forest and texture analysis to accurately differentiate land covers of urban vegetated areas, and analyze how classification accuracy changes with texture window size. Six least correlated second-order texture measures were calculated at nine different window sizes and added to original Red-Green-Blue (RGB) images as ancillary data. A Random Forest classifier consisting of 200 decision trees was used for classification in the spectral-textural feature space. Results indicated the following: (1) Random Forest outperformed traditional Maximum Likelihood classifier and showed similar performance to object-based image analysis in urban vegetation classification; (2) the inclusion of texture features improved classification accuracy significantly; (3) classification accuracy followed an inverted U relationship with texture window size. The results demonstrate that UAV provides an efficient and ideal platform for urban vegetation mapping. The hybrid method proposed in this paper shows good performance in differentiating urban vegetation mapping. The drawbacks of off-the-shelf digital cameras can be reduced by adopting Random Forest and texture analysis at the same time.

Journal ArticleDOI
TL;DR: An analysis of the effects of input data characteristics on RF classifications (including RF out-of-bag error, independent classification accuracy and class proportion error) through a case study in peatland classification using LiDAR derivatives is presented.
Abstract: Random Forest (RF) is a widely used algorithm for classification of remotely sensed data. Through a case study in peatland classification using LiDAR derivatives, we present an analysis of the effects of input data characteristics on RF classifications (including RF out-of-bag error, independent classification accuracy and class proportion error). Training data selection and specific input variables (i.e., image channels) have a large impact on the overall accuracy of the image classification. High-dimension datasets should be reduced so that only uncorrelated important variables are used in classifications. Despite the fact that RF is an ensemble approach, independent error assessments should be used to evaluate RF results, and iterative classifications are recommended to assess the stability of predicted classes. Results are also shown to be highly sensitive to the size of the training data set. In addition to being as large as possible, the training data sets used in RF classification should also be (a) randomly distributed or created in a manner that allows for the class proportions of the training data to be representative of actual class proportions in the landscape; and (b) should have minimal spatial autocorrelation to improve classification results and to mitigate inflated estimates of RF out-of-bag classification accuracy.

Journal ArticleDOI
TL;DR: The main objective of this survey paper is to recall the concept of the APs along with all its modifications and generalizations with special emphasis on remote sensing image classification and summarize the important aspects of its efficient utilization while also listing potential future works.
Abstract: Just over a decade has passed since the concept of morphological profile was defined for the analysis of remote sensing images. Since then, the morphological profile has largely proved to be a powerful tool able to model spatial information (e.g., contextual relations) of the image. However, due to the shortcomings of using the morphological profiles, many variants, extensions, and refinements of its definition have appeared stating that the morphological profile is still under continuous development. In this case, recently introduced theoretically sound attribute profiles (APs) can be considered as a generalization of the morphological profile, which is a powerful tool to model spatial information existing in the scene. Although the concept of the AP has been introduced in remote sensing only recently, an extensive literature on its use in different applications and on different types of data has appeared. To that end, the great amount of contributions in the literature that address the application of the AP to many tasks (e.g., classification, object detection, segmentation, change detection, etc.) and to different types of images (e.g., panchromatic, multispectral, and hyperspectral) proves how the AP is an effective and modern tool. The main objective of this survey paper is to recall the concept of the APs along with all its modifications and generalizations with special emphasis on remote sensing image classification and summarize the important aspects of its efficient utilization while also listing potential future works.

Journal ArticleDOI
TL;DR: Experimental results with three Radarsat-2 images in quad polarization mode indicate that classification accuracies could be significantly increased by integrating spatial and polarimetric features using ensemble learning strategies.
Abstract: Fully Polarimetric Synthetic Aperture Radar (PolSAR) has the advantages of all-weather, day and night observation and high resolution capabilities. The collected data are usually sorted in Sinclair matrix, coherence or covariance matrices which are directly related to physical properties of natural media and backscattering mechanism. Additional information related to the nature of scattering medium can be exploited through polarimetric decomposition theorems. Accordingly, PolSAR image classification gains increasing attentions from remote sensing communities in recent years. However, the above polarimetric measurements or parameters cannot provide sufficient information for accurate PolSAR image classification in some scenarios, e.g. in complex urban areas where different scattering mediums may exhibit similar PolSAR response due to couples of unavoidable reasons. Inspired by the complementarity between spectral and spatial features bringing remarkable improvements in optical image classification, the complementary information between polarimetric and spatial features may also contribute to PolSAR image classification. Therefore, the roles of textural features such as contrast, dissimilarity, homogeneity and local range, morphological profiles (MPs) in PolSAR image classification are investigated using two advanced ensemble learning (EL) classifiers: Random Forest and Rotation Forest. Supervised Wishart classifier and support vector machines (SVMs) are used as benchmark classifiers for the evaluation and comparison purposes. Experimental results with three Radarsat-2 images in quad polarization mode indicate that classification accuracies could be significantly increased by integrating spatial and polarimetric features using ensemble learning strategies. Rotation Forest can get better accuracy than SVM and Random Forest, in the meantime, Random Forest is much faster than Rotation Forest.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: A valve linkage function (VLF) for back-propagation chaining is proposed to form the deep localization, alignment and classification (LAC) system and can adaptively compromise the errors of classification and alignment when training the LAC model.
Abstract: We propose a fine-grained recognition system that incorporates part localization, alignment, and classification in one deep neural network. This is a nontrivial process, as the input to the classification module should be functions that enable back-propagation in constructing the solver. Our major contribution is to propose a valve linkage function (VLF) for back-propagation chaining and form our deep localization, alignment and classification (LAC) system. The VLF can adaptively compromise the errors of classification and alignment when training the LAC model. It in turn helps update localization. The performance on fine-grained object data bears out the effectiveness of our LAC system.

Posted Content
TL;DR: In this article, the authors show that for instance-level image retrieval, lower layers often perform better than the last layers in convolutional neural networks, and adopt VLAD encoding to encode features into a single vector for each image.
Abstract: Deep convolutional neural networks have been successfully applied to image classification tasks. When these same networks have been applied to image retrieval, the assumption has been made that the last layers would give the best performance, as they do in classification. We show that for instance-level image retrieval, lower layers often perform better than the last layers in convolutional neural networks. We present an approach for extracting convolutional features from different layers of the networks, and adopt VLAD encoding to encode features into a single vector for each image. We investigate the effect of different layers and scales of input images on the performance of convolutional features using the recent deep networks OxfordNet and GoogLeNet. Experiments demonstrate that intermediate layers or higher layers with finer scales produce better results for image retrieval, compared to the last layer. When using compressed 128-D VLAD descriptors, our method obtains state-of-the-art results and outperforms other VLAD and CNN based approaches on two out of three test datasets. Our work provides guidance for transferring deep networks trained on image classification to image retrieval tasks.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Nearest Non-Outlier (NNO) as mentioned in this paper is an open-world recognition algorithm that evolves model efficiently, adding object categories incrementally while detecting outliers and managing open space risk.
Abstract: With the of advent rich classification models and high computational power visual recognition systems have found many operational applications. Recognition in the real world poses multiple challenges that are not apparent in controlled lab environments. The datasets are dynamic and novel categories must be continuously detected and then added. At prediction time, a trained system has to deal with myriad unseen categories. Operational systems require minimal downtime, even to learn. To handle these operational issues, we present the problem of Open World Recognition and formally define it. We prove that thresholding sums of monotonically decreasing functions of distances in linearly transformed feature space can balance “open space risk” and empirical risk. Our theory extends existing algorithms for open world recognition. We present a protocol for evaluation of open world recognition systems. We present the Nearest Non-Outlier (NNO) algorithm that evolves model efficiently, adding object categories incrementally while detecting outliers and managing open space risk. We perform experiments on the ImageNet dataset with 1.2M+ images to validate the effectiveness of our method on large scale visual recognition tasks. NNO consistently yields superior results on open world recognition.

Posted Content
TL;DR: This work unifies this two-stage process for semantic segmentation into a single joint training algorithm and demonstrates the method on the semantic image segmentation task and shows encouraging results on the challenging PASCAL VOC 2012 dataset.
Abstract: Convolutional neural networks with many layers have recently been shown to achieve excellent results on many high-level tasks such as image classification, object detection and more recently also semantic segmentation. Particularly for semantic segmentation, a two-stage procedure is often employed. Hereby, convolutional networks are trained to provide good local pixel-wise features for the second step being traditionally a more global graphical model. In this work we unify this two-stage process into a single joint training algorithm. We demonstrate our method on the semantic image segmentation task and show encouraging results on the challenging PASCAL VOC 2012 dataset.

Journal ArticleDOI
07 Aug 2015
TL;DR: A taxonomical view of the field is provided and the current methodologies for multimodal classification of remote sensing images are reviewed, which highlight the most recent advances, which exploit synergies with machine learning and signal processing.
Abstract: Earth observation through remote sensing images allows the accurate characterization and identification of materials on the surface from space and airborne platforms. Multiple and heterogeneous image sources can be available for the same geographical region: multispectral, hyperspectral, radar, multitemporal, and multiangular images can today be acquired over a given scene. These sources can be combined/fused to improve classification of the materials on the surface. Even if this type of systems is generally accurate, the field is about to face new challenges: the upcoming constellations of satellite sensors will acquire large amounts of images of different spatial, spectral, angular, and temporal resolutions. In this scenario, multimodal image fusion stands out as the appropriate framework to address these problems. In this paper, we provide a taxonomical view of the field and review the current methodologies for multimodal classification of remote sensing images. We also highlight the most recent advances, which exploit synergies with machine learning and signal processing: sparse methods, kernel-based fusion, Markov modeling, and manifold alignment. Then, we illustrate the different approaches in seven challenging remote sensing applications: 1) multiresolution fusion for multispectral image classification; 2) image downscaling as a form of multitemporal image fusion and multidimensional interpolation among sensors of different spatial, spectral, and temporal resolutions; 3) multiangular image classification; 4) multisensor image fusion exploiting physically-based feature extractions; 5) multitemporal image classification of land covers in incomplete, inconsistent, and vague image sources; 6) spatiospectral multisensor fusion of optical and radar images for change detection; and 7) cross-sensor adaptation of classifiers. The adoption of these techniques in operational settings will help to monitor our planet from space in the very near future.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: This work provides guidance for transferring deep networks trained on image classification to image retrieval tasks, and presents an approach for extracting convolutional features from different layers of the networks, and adopts VLAD encoding to encode features into a single vector for each image.
Abstract: Deep convolutional neural networks have been successfully applied to image classification tasks. When these same networks have been applied to image retrieval, the assumption has been made that the last layers would give the best performance, as they do in classification. We show that for instance-level image retrieval, lower layers often perform better than the last layers in convolutional neural networks. We present an approach for extracting convolutional features from different layers of the networks, and adopt VLAD encoding to encode features into a single vector for each image. We investigate the effect of different layers and scales of input images on the performance of convolutional features using the recent deep networks OxfordNet and GoogLeNet. Experiments demonstrate that intermediate layers or higher layers with finer scales produce better results for image retrieval, compared to the last layer. When using compressed 128-D VLAD descriptors, our method obtains state-of-the-art results and outperforms other VLAD and CNN based approaches on two out of three test datasets. Our work provides guidance for transferring deep networks trained on image classification to image retrieval tasks.

Journal ArticleDOI
TL;DR: Experimental results on multi-focus and multi-modal image sets demonstrate that the ASR-based fusion method can outperform the conventional SR-based method in terms of both visual quality and objective assessment.
Abstract: In this study, a novel adaptive sparse representation (ASR) model is presented for simultaneous image fusion and denoising. As a powerful signal modelling technique, sparse representation (SR) has been successfully employed in many image processing applications such as denoising and fusion. In traditional SR-based applications, a highly redundant dictionary is always needed to satisfy signal reconstruction requirement since the structures vary significantly across different image patches. However, it may result in potential visual artefacts as well as high computational cost. In the proposed ASR model, instead of learning a single redundant dictionary, a set of more compact sub-dictionaries are learned from numerous high-quality image patches which have been pre-classified into several corresponding categories based on their gradient information. At the fusion and denoising processes, one of the sub-dictionaries is adaptively selected for a given set of source image patches. Experimental results on multi-focus and multi-modal image sets demonstrate that the ASR-based fusion method can outperform the conventional SR-based method in terms of both visual quality and objective assessment.

Posted Content
TL;DR: This paper investigates possible ways to aggregate local deep features to produce compact global descriptors for image retrieval and shows that deep features and traditional hand-engineered features have quite different distributions of pairwise similarities, hence existing aggregation methods have to be carefully re-evaluated.
Abstract: Several recent works have shown that image descriptors produced by deep convolutional neural networks provide state-of-the-art performance for image classification and retrieval problems. It has also been shown that the activations from the convolutional layers can be interpreted as local features describing particular image regions. These local features can be aggregated using aggregation approaches developed for local features (e.g. Fisher vectors), thus providing new powerful global descriptors. In this paper we investigate possible ways to aggregate local deep features to produce compact global descriptors for image retrieval. First, we show that deep features and traditional hand-engineered features have quite different distributions of pairwise similarities, hence existing aggregation methods have to be carefully re-evaluated. Such re-evaluation reveals that in contrast to shallow features, the simple aggregation method based on sum pooling provides arguably the best performance for deep convolutional features. This method is efficient, has few parameters, and bears little risk of overfitting when e.g. learning the PCA matrix. Overall, the new compact global descriptor improves the state-of-the-art on four common benchmarks considerably.

Journal ArticleDOI
TL;DR: The experiments of TerraSAR-X image demonstrate that the DCAE network can extract efficient features and perform better classification result compared with some related algorithms.
Abstract: Synthetic aperture radar (SAR) image classification is a hot topic in the interpretation of SAR images. However, the absence of effective feature representation and the presence of speckle noise in SAR images make classification difficult to handle. In order to overcome these problems, a deep convolutional autoencoder (DCAE) is proposed to extract features and conduct classification automatically. The deep network is composed of eight layers: a convolutional layer to extract texture features, a scale transformation layer to aggregate neighbor information, four layers based on sparse autoencoders to optimize features and classify, and last two layers for postprocessing. Compared with hand-crafted features, the DCAE network provides an automatic method to learn discriminative features from the image. A series of filters is designed as convolutional units to comprise the gray-level cooccurrence matrix and Gabor features together. Scale transformation is conducted to reduce the influence of the noise, which integrates the correlated neighbor pixels. Sparse autoencoders seek better representation of features to match the classifier, since training labels are added to fine-tune the parameters of the networks. Morphological smoothing removes the isolated points of the classification map. The whole network is designed ingeniously, and each part has a contribution to the classification accuracy. The experiments of TerraSAR-X image demonstrate that the DCAE network can extract efficient features and perform better classification result compared with some related algorithms.