scispace - formally typeset
Search or ask a question

Showing papers on "Contextual image classification published in 2014"


Posted Content
TL;DR: A series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13 suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.
Abstract: Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the \overfeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the \overfeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the \overfeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or $L2$ distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

4,033 citations


Book ChapterDOI
06 Sep 2014
TL;DR: This work equips the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement, and develops a new network structure, called SPP-net, which can generate a fixed-length representation regardless of image size/scale.
Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101.

3,945 citations


Posted Content
TL;DR: This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF).
Abstract: Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segmentation"). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our "DeepLab" system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the 'hole' algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.

3,389 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: In this paper, features extracted from the OverFeat network are used as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets.
Abstract: Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the OverFeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or L2 distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

3,346 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work designs a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset, and shows that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification.
Abstract: Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large- scale visual recognition challenge (ILSVRC2012). The suc- cess of CNNs is attributed to their ability to learn rich mid- level image representations as opposed to hand-designed low-level features used in other image classification meth- ods. Learning CNNs, however, amounts to estimating mil- lions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be effi- ciently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred rep- resentation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization.

3,316 citations


Book ChapterDOI
TL;DR: SPP-Net as mentioned in this paper proposes a spatial pyramid pooling strategy, which can generate a fixed-length representation regardless of image size/scale, and achieves state-of-the-art performance in object detection.
Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224x224) input image. This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102x faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

2,304 citations


Posted Content
TL;DR: In this article, a recurrent neural network (RNN) model is proposed to extract information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution.
Abstract: Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

2,107 citations


Proceedings Article
08 Dec 2014
TL;DR: A novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution is presented.
Abstract: Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

1,649 citations


Journal ArticleDOI
TL;DR: In this article, the authors introduce attribute-based classification, where objects are identified based on a high-level description that is phrased in terms of semantic attributes, such as the object's color or shape.
Abstract: We study the problem of object recognition for categories for which we have no training examples, a task also called zero--data or zero-shot learning. This situation has hardly been studied in computer vision research, even though it occurs frequently; the world contains tens of thousands of different object classes, and image collections have been formed and suitably annotated for only a few of them. To tackle the problem, we introduce attribute-based classification: Objects are identified based on a high-level description that is phrased in terms of semantic attributes, such as the object's color or shape. Because the identification of each such property transcends the specific learning task at hand, the attribute classifiers can be prelearned independently, for example, from existing image data sets unrelated to the current task. Afterward, new classes can be detected based on their attribute representation, without the need for a new training phase. In this paper, we also introduce a new data set, Animals with Attributes, of over 30,000 images of 50 animal classes, annotated with 85 semantic attributes. Extensive experiments on this and two more data sets show that attribute-based classification indeed is able to categorize images without access to any training images of the target classes.

1,559 citations


Book ChapterDOI
06 Sep 2014
TL;DR: A novel method to mine discriminative parts using Random Forests (rf), which allows us to mine for parts simultaneously for all classes and to share knowledge among them, and compares nicely to other s-o-a component-based classification methods.
Abstract: In this paper we address the problem of automatically recognizing pictured dishes. To this end, we introduce a novel method to mine discriminative parts using Random Forests (rf), which allows us to mine for parts simultaneously for all classes and to share knowledge among them. To improve efficiency of mining and classification, we only consider patches that are aligned with image superpixels, which we call components. To measure the performance of our rf component mining for food recognition, we introduce a novel and challenging dataset of 101 food categories, with 101’000 images. With an average accuracy of 50.76%, our model outperforms alternative classification methods except for cnn, including svm classification on Improved Fisher Vectors and existing discriminative part-mining algorithms by 11.88% and 8.13%, respectively. On the challenging mit-Indoor dataset, our method compares nicely to other s-o-a component-based classification methods.

1,216 citations


Journal ArticleDOI
TL;DR: PCANet as discussed by the authors is a simple deep learning network for image classification which comprises only the very basic data processing components: cascaded principal component analysis (PCA), binary hashing, and block-wise histograms.
Abstract: In this work, we propose a very simple deep learning network for image classification which comprises only the very basic data processing components: cascaded principal component analysis (PCA), binary hashing, and block-wise histograms. In the proposed architecture, PCA is employed to learn multistage filter banks. It is followed by simple binary hashing and block histograms for indexing and pooling. This architecture is thus named as a PCA network (PCANet) and can be designed and learned extremely easily and efficiently. For comparison and better understanding, we also introduce and study two simple variations to the PCANet, namely the RandNet and LDANet. They share the same topology of PCANet but their cascaded filters are either selected randomly or learned from LDA. We have tested these basic networks extensively on many benchmark visual datasets for different tasks, such as LFW for face verification, MultiPIE, Extended Yale B, AR, FERET datasets for face recognition, as well as MNIST for hand-written digits recognition. Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state of the art features, either prefixed, highly hand-crafted or carefully learned (by DNNs). Even more surprisingly, it sets new records for many classification tasks in Extended Yale B, AR, FERET datasets, and MNIST variations. Additional experiments on other public datasets also demonstrate the potential of the PCANet serving as a simple but highly competitive baseline for texture classification and object recognition.

Posted Content
TL;DR: This paper is able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN, and finds in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods.
Abstract: Deep convolutional neural networks (CNN) has become the most promising method for object recognition, repeatedly demonstrating record breaking results for image classification and object detection in recent years. However, a very deep CNN generally involves many layers with millions of parameters, making the storage of the network model to be extremely large. This prohibits the usage of deep CNNs on resource limited hardware, especially cell phones or other embedded devices. In this paper, we tackle this model storage issue by investigating information theoretical vector quantization methods for compressing the parameters of CNNs. In particular, we have found in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods. Simply applying k-means clustering to the weights or conducting product quantization can lead to a very good balance between model size and recognition accuracy. For the 1000-category classification task in the ImageNet challenge, we are able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN.

Book ChapterDOI
06 Sep 2014
TL;DR: Multi-scale orderless pooling (MOP-CNN) as discussed by the authors extracts CNN activations for local patches at multiple scale levels, performs orderless VLAD pooling of these activations at each level separately, and concatenates the result.
Abstract: Deep convolutional neural networks (CNN) have shown their promise as a universal representation for recognition. However, global CNN activations lack geometric invariance, which limits their robustness for classification and matching of highly variable scenes. To improve the invariance of CNN activations without degrading their discriminative power, this paper presents a simple but effective scheme called multi-scale orderless pooling (MOP-CNN). This scheme extracts CNN activations for local patches at multiple scale levels, performs orderless VLAD pooling of these activations at each level separately, and concatenates the result. The resulting MOP-CNN representation can be used as a generic feature for either supervised or unsupervised recognition tasks, from image classification to instance-level retrieval; it consistently outperforms global CNN activations without requiring any joint training of prediction layers for a particular target dataset. In absolute terms, it achieves state-of-the-art results on the challenging SUN397 and MIT Indoor Scenes classification datasets, and competitive results on ILSVRC2012/2013 classification and INRIA Holidays retrieval datasets.

Proceedings ArticleDOI
TL;DR: In this article, given image and class embeddings, they learn a compatibility function such that matching embedding are assigned a higher score than mismatching ones; zero-shot classification of an image proceeds by finding the label yielding the highest joint compatibility score.
Abstract: Image classification has advanced significantly in recent years with the availability of large-scale image sets. However, fine-grained classification remains a major challenge due to the annotation cost of large numbers of fine-grained categories. This project shows that compelling classification performance can be achieved on such categories even without labeled training data. Given image and class embeddings, we learn a compatibility function such that matching embeddings are assigned a higher score than mismatching ones; zero-shot classification of an image proceeds by finding the label yielding the highest joint compatibility score. We use state-of-the-art image features and focus on different supervised attributes and unsupervised output embeddings either derived from hierarchies or learned from unlabeled text corpora. We establish a substantially improved state-of-the-art on the Animals with Attributes and Caltech-UCSD Birds datasets. Most encouragingly, we demonstrate that purely unsupervised output embeddings (learned from Wikipedia and improved with fine-grained text) achieve compelling results, even outperforming the previous supervised state-of-the-art. By combining different output embeddings, we further improve results.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed edge-preserving filtering based classification method can improve the classification accuracy significantly in a very short time and can be easily applied in real applications.
Abstract: The integration of spatial context in the classification of hyperspectral images is known to be an effective way in improving classification accuracy. In this paper, a novel spectral-spatial classification framework based on edge-preserving filtering is proposed. The proposed framework consists of the following three steps. First, the hyperspectral image is classified using a pixelwise classifier, e.g., the support vector machine classifier. Then, the resulting classification map is represented as multiple probability maps, and edge-preserving filtering is conducted on each probability map, with the first principal component or the first three principal components of the hyperspectral image serving as the gray or color guidance image. Finally, according to the filtered probability maps, the class of each pixel is selected based on the maximum probability. Experimental results demonstrate that the proposed edge-preserving filtering based classification method can improve the classification accuracy significantly in a very short time. Thus, it can be easily applied in real applications.

Journal ArticleDOI
TL;DR: In this paper, the authors focus on the challenging problem of hyperspectral image classification, which has recently gained in popularity and attracted the interest of other scientific disciplines such as machine learning, image processing, and computer vision.
Abstract: The technological evolution of optical sensors over the last few decades has provided remote sensing analysts with rich spatial, spectral, and temporal information. In particular, the increase in spectral resolution of hyperspectral images (HSIs) and infrared sounders opens the doors to new application domains and poses new methodological challenges in data analysis. HSIs allow the characterization of objects of interest (e.g., land-cover classes) with unprecedented accuracy, and keeps inventories up to date. Improvements in spectral resolution have called for advances in signal processing and exploitation algorithms. This article focuses on the challenging problem of hyperspectral image classification, which has recently gained in popularity and attracted the interest of other scientific disciplines such as machine learning, image processing, and computer vision. In the remote sensing community, the term classification is used to denote the process that assigns single pixels to a set of classes, while the term segmentation is used for methods aggregating pixels into objects and then assigned to a class.

Journal ArticleDOI
TL;DR: Comparative experimental results indicate that the proposed HDNN significantly outperforms the traditional DNN on vehicle detection, by dividing the maps of the last convolutional layer and the max-pooling layer of DNN into multiple blocks of variable receptive field sizes or max- pooling field sizes to enable the HDNN to extract variable-scale features.
Abstract: Detecting small objects such as vehicles in satellite images is a difficult problem. Many features (such as histogram of oriented gradient, local binary pattern, scale-invariant featuretransform, etc.) have been used to improve the performance of object detection, but mostly in simple environments such as those on roads. Kembhavi et al. proposed that no satisfactory accuracy has been achieved in complex environments such as the City of San Francisco. Deep convolutional neural networks (DNNs) can learn rich features from the training data automatically and has achieved state-of-the-art performance in many image classification databases. Though the DNN has shown robustness to distortion, it only extracts features of the same scale, and hence is insufficient to tolerate large-scale variance of object. In this letter, we present a hybrid DNN (HDNN), by dividing the maps of the last convolutional layer and the max-pooling layer of DNN into multiple blocks of variable receptive field sizes or max-pooling field sizes, to enable the HDNN to extract variable-scale features. Comparative experimental results indicate that our proposed HDNN significantly outperforms the traditional DNN on vehicle detection.

Journal ArticleDOI
TL;DR: Comprehensive evaluations on two remote sensing image databases and comparisons with some state-of-the-art approaches demonstrate the effectiveness and superiority of the developed framework.
Abstract: The rapid development of remote sensing technology has facilitated us the acquisition of remote sensing images with higher and higher spatial resolution, but how to automatically understand the image contents is still a big challenge. In this paper, we develop a practical and rotation-invariant framework for multi-class geospatial object detection and geographic image classification based on collection of part detectors (COPD). The COPD is composed of a set of representative and discriminative part detectors, where each part detector is a linear support vector machine (SVM) classifier used for the detection of objects or recurring spatial patterns within a certain range of orientation. Specifically, when performing multi-class geospatial object detection, we learn a set of seed-based part detectors where each part detector corresponds to a particular viewpoint of an object class, so the collection of them provides a solution for rotation-invariant detection of multi-class objects. When performing geographic image classification, we utilize a large number of pre-trained part detectors to discovery distinctive visual parts from images and use them as attributes to represent the images. Comprehensive evaluations on two remote sensing image databases and comparisons with some state-of-the-art approaches demonstrate the effectiveness and superiority of the developed framework.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: A customized Convolutional Neural Networks with shallow convolution layer to classify lung image patches with interstitial lung disease and the same architecture can be generalized to perform other medical image or texture classification tasks.
Abstract: Image patch classification is an important task in many different medical imaging applications. In this work, we have designed a customized Convolutional Neural Networks (CNN) with shallow convolution layer to classify lung image patches with interstitial lung disease (ILD). While many feature descriptors have been proposed over the past years, they can be quite complicated and domain-specific. Our customized CNN framework can, on the other hand, automatically and efficiently learn the intrinsic image features from lung image patches that are most suitable for the classification purpose. The same architecture can be generalized to perform other medical image or texture classification tasks.

Posted Content
TL;DR: The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.

Journal ArticleDOI
TL;DR: The proposed FDDL model is extensively evaluated on various image datasets, and it shows superior performance to many state-of-the-art dictionary learning methods in a variety of classification tasks.
Abstract: The employed dictionary plays an important role in sparse representation or sparse coding based image reconstruction and classification, while learning dictionaries from the training data has led to state-of-the-art results in image classification tasks. However, many dictionary learning models exploit only the discriminative information in either the representation coefficients or the representation residual, which limits their performance. In this paper we present a novel dictionary learning method based on the Fisher discrimination criterion. A structured dictionary, whose atoms have correspondences to the subject class labels, is learned, with which not only the representation residual can be used to distinguish different classes, but also the representation coefficients have small within-class scatter and big between-class scatter. The classification scheme associated with the proposed Fisher discrimination dictionary learning (FDDL) model is consequently presented by exploiting the discriminative information in both the representation residual and the representation coefficients. The proposed FDDL model is extensively evaluated on various image datasets, and it shows superior performance to many state-of-the-art dictionary learning methods in a variety of classification tasks.

Journal ArticleDOI
TL;DR: The plain DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models, however, using additional unlabeled data for DBN pre-training and combining Dbn-based learned features with the original features provides significant gains over SVMs, which, in turn, performed better than both MaxEnt and Boosting.
Abstract: Applications of Deep Belief Nets (DBN) to various problems have been the subject of a number of recent studies ranging from image classification and speech recognition to audio classification. In this study we apply DBNs to a natural language understanding problem. The recent surge of activity in this area was largely spurred by the development of a greedy layer-wise pretraining method that uses an efficient learning algorithm called Contrastive Divergence (CD). CD allows DBNs to learn a multi-layer generative model from unlabeled data and the features discovered by this model are then used to initialize a feed-forward neural network which is fine-tuned with backpropagation. We compare a DBN-initialized neural network to three widely used text classification algorithms: Support Vector Machines (SVM), boosting and Maximum Entropy (MaxEnt). The plain DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models. However, using additional unlabeled data for DBN pre-training and combining DBN-based learned features with the original features provides significant gains over SVMs, which, in turn, performed better than both MaxEnt and Boosting.

Journal ArticleDOI
TL;DR: The proposed aerial scene classification method can be highly effective in developing a detection system that can be used to automatically scan large-scale high-resolution satellite imagery for detecting large facilities such as a shopping mall.
Abstract: The rich data provided by high-resolution satellite imagery allow us to directly model aerial scenes by understanding their spatial and structural patterns. While pixel- and object-based classification approaches are widely used for satellite image analysis, often these approaches exploit the high-fidelity image data in a limited way. In this paper, we explore an unsupervised feature learning approach for scene classification. Dense low-level feature descriptors are extracted to characterize the local spatial patterns. These unlabeled feature measurements are exploited in a novel way to learn a set of basis functions. The low-level feature descriptors are encoded in terms of the basis functions to generate new sparse representation for the feature descriptors. We show that the statistics generated from the sparse features characterize the scene well producing excellent classification accuracy. We apply our technique to several challenging aerial scene data sets: ORNL-I data set consisting of 1-m spatial resolution satellite imagery with diverse sensor and scene characteristics representing five land-use categories, UCMERCED data set representing twenty one different aerial scene categories with sub-meter resolution, and ORNL-II data set for large-facility scene detection. Our results are highly promising and, on the UCMERCED data set we outperform the previous best results. We demonstrate that the proposed aerial scene classification method can be highly effective in developing a detection system that can be used to automatically scan large-scale high-resolution satellite imagery for detecting large facilities such as a shopping mall.

Journal ArticleDOI
TL;DR: In this article, the importance of incorporating spatio-contextual information in remote sensing image classification was highlighted, including texture extraction, Markov random fields (MRFs), image segmentation and object-based image analysis.
Abstract: This paper reviewed major remote sensing image classification techniques, including pixel-wise, sub-pixel-wise, and object-based image classification methods, and highlighted the importance of incorporating spatio-contextual information in remote sensing image classification. Further, this paper grouped spatio-contextual analysis techniques into three major categories, including 1) texture extraction, 2) Markov random fields (MRFs) modeling, and 3) image segmentation and object-based image analysis. Finally, this paper argued the necessity of developing geographic information analysis models for spatial-contextual classifications using two case studies.

Proceedings ArticleDOI
08 Feb 2014
TL;DR: Various types of features, feature extraction techniques and explaining in what scenario, which features extraction technique, will be better are discussed and referred in case of character recognition application.
Abstract: Feature plays a very important role in the area of image processing. Before getting features, various image preprocessing techniques like binarization, thresholding, resizing, normalization etc. are applied on the sampled image. After that, feature extraction techniques are applied to get features that will be useful in classifying and recognition of images. Feature extraction techniques are helpful in various image processing applications e.g. character recognition. As features define the behavior of an image, they show its place in terms of storage taken, efficiency in classification and obviously in time consumption also. Here in this paper, we are going to discuss various types of features, feature extraction techniques and explaining in what scenario, which features extraction technique, will be better. Hereby in this paper, we are going to refer features and feature extraction methods in case of character recognition application.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed kernel-based feature selection method with a criterion that is an integration of the previous work and the linear combination of features improves the classification performance of the SVM.
Abstract: Hyperspectral imaging fully portrays materials through numerous and contiguous spectral bands. It is a very useful technique in various fields, including astronomy, medicine, food safety, forensics, and target detection. However, hyperspectral images include redundant measurements, and most classification studies encountered the Hughes phenomenon. Finding a small subset of effective features to model the characteristics of classes represented in the data for classification is a critical preprocessing step required to render a classifier effective in hyperspectral image classification. In our previous work, an automatic method for selecting the radial basis function (RBF) parameter (i.e., σ) for a support vector machine (SVM) was proposed. A criterion that contains the between-class and within-class information was proposed to measure the separability of the feature space with respect to the RBF kernel. Thereafter, the optimal RBF kernel parameter was obtained by optimizing the criterion. This study proposes a kernel-based feature selection method with a criterion that is an integration of the previous work and the linear combination of features. In this new method, two properties can be achieved according to the magnitudes of the coefficients being calculated: the small subset of features and the ranking of features. Experimental results on both one simulated dataset and two hyperspectral images (the Indian Pine Site dataset and the Pavia University dataset) show that the proposed method improves the classification performance of the SVM.

Journal ArticleDOI
TL;DR: An evaluation of different RapidEye bands using the two classifiers showed that incorporation of the red-edge band has a significant effect on the overall classification accuracy in vegetation cover types, indicating pursuit of high classification accuracy using high-spatial resolution imagery on complex landscapes remains paramount.
Abstract: Mapping of patterns and spatial distribution of land-use/cover (LULC) has long been based on remotely sensed data. In the recent past, efforts to improve the reliability of LULC maps have seen a proliferation of image classification techniques. Despite these efforts, derived LULC maps are still often judged to be of insufficient quality for operational applications, due to disagreement between generated maps and reference data. In this study we sought to pursue two objectives: first, to test the new-generation multispectral RapidEye imagery classification output using machine-learning random forest (RF) and support vector machines (SVM) classifiers in a heterogeneous coastal landscape; and second, to determine the importance of different RapidEye bands on classification output. Accuracy of the derived thematic maps was assessed by computing confusion matrices of the classifiers’ cover maps with respective independent validation data sets. An overall classification accuracy of 93.07% with a kappa value of ...

Journal ArticleDOI
TL;DR: In this paper, the spectral information provided by the Landsat Thematic Mapper (TM) data set and the same classification scheme over Guangzhou City, China, was tested with two unsupervised and 13 supervised classification algorithms, including a number of machine learning algorithms.
Abstract: Although a large number of new image classification algorithms have been developed, they are rarely tested with the same classification task. In this research, with the same Landsat Thematic Mapper (TM) data set and the same classification scheme over Guangzhou City, China, we tested two unsupervised and 13 supervised classification algorithms, including a number of machine learning algorithms that became popular in remote sensing during the past 20 years. Our analysis focused primarily on the spectral information provided by the TM data. We assessed all algorithms in a per-pixel classification decision experiment and all supervised algorithms in a segment-based experiment. We found that when sufficiently representative training samples were used, most algorithms performed reasonably well. Lack of training samples led to greater classification accuracy discrepancies than classification algorithms themselves. Some algorithms were more tolerable to insufficient (less representative) training samples than others. Many algorithms improved the overall accuracy marginally with per-segment decision making.

Journal ArticleDOI
TL;DR: Experimental results verify that the proposed evolutionary learning methodology significantly outperforms many state-of-the-art hand-designed features and two feature learning techniques in terms of classification accuracy.
Abstract: Feature extraction is the first and most critical step in image classification. Most existing image classification methods use hand-crafted features, which are not adaptive for different image domains. In this paper, we develop an evolutionary learning methodology to automatically generate domain-adaptive global feature descriptors for image classification using multiobjective genetic programming (MOGP). In our architecture, a set of primitive 2-D operators are randomly combined to construct feature descriptors through the MOGP evolving and then evaluated by two objective fitness criteria, i.e., the classification error and the tree complexity. After the entire evolution procedure finishes, the best-so-far solution selected by the MOGP is regarded as the (near-)optimal feature descriptor obtained. To evaluate its performance, the proposed approach is systematically tested on the Caltech-101, the MIT urban and nature scene, the CMU PIE, and Jochen Triesch Static Hand Posture II data sets, respectively. Experimental results verify that our method significantly outperforms many state-of-the-art hand-designed features and two feature learning techniques in terms of classification accuracy.

Journal ArticleDOI
TL;DR: Compared with other hyperspectral classification methods, the proposed IFRF method shows outstanding performance in terms of classification accuracy and computational efficiency.
Abstract: Feature extraction is known to be an effective way in both reducing computational complexity and increasing accuracy of hyperspectral image classification. In this paper, a simple yet quite powerful feature extraction method based on image fusion and recursive filtering (IFRF) is proposed. First, the hyperspectral image is partitioned into multiple subsets of adjacent hyperspectral bands. Then, the bands in each subset are fused together by averaging, which is one of the simplest image fusion methods. Finally, the fused bands are processed with transform domain recursive filtering to get the resulting features for classification. Experiments are performed on different hyperspectral images, with the support vector machines (SVMs) serving as the classifier. By using the proposed method, the accuracy of the SVM classifier can be improved significantly. Furthermore, compared with other hyperspectral classification methods, the proposed IFRF method shows outstanding performance in terms of classification accuracy and computational efficiency.