scispace - formally typeset
Open AccessJournal ArticleDOI

Spectral-spatial classification of hyperspectral imagery with 3D convolutional neural network

TLDR
A 3D convolutional neural network framework is proposed for accurate HSI classification, which is lighter, less likely to over-fit, and easier to train, and requires fewer parameters than other deep learning-based methods.
Abstract
Recent research has shown that using spectral–spatial information can considerably improve the performance of hyperspectral image (HSI) classification. HSI data is typically presented in the format of 3D cubes. Thus, 3D spatial filtering naturally offers a simple and effective method for simultaneously extracting the spectral–spatial features within such images. In this paper, a 3D convolutional neural network (3D-CNN) framework is proposed for accurate HSI classification. The proposed method views the HSI cube data altogether without relying on any preprocessing or post-processing, extracting the deep spectral–spatial-combined features effectively. In addition, it requires fewer parameters than other deep learning-based methods. Thus, the model is lighter, less likely to over-fit, and easier to train. For comparison and validation, we test the proposed method along with three other deep learning-based HSI classification methods—namely, stacked autoencoder (SAE), deep brief network (DBN), and 2D-CNN-based methods—on three real-world HSI datasets captured by different sensors. Experimental results demonstrate that our 3D-CNN-based method outperforms these state-of-the-art methods and sets a new record.

read more

Content maybe subject to copyright    Report

Aberystwyth University
Spectral–Spatial Classification of Hyperspectral Imagery with 3D Convolutional
Neural Network
Li, Ying; Zhang, Haokui ; Shen, Qiang
Published in:
Remote Sensing
DOI:
10.3390/rs9010067
Publication date:
2017
Citation for published version (APA):
Li, Y., Zhang, H., & Shen, Q. (2017). Spectral–Spatial Classification of Hyperspectral Imagery with 3D
Convolutional Neural Network. Remote Sensing, 9(1), [67]. https://doi.org/10.3390/rs9010067
Document License
CC BY
General rights
Copyright and moral rights for the publications made accessible in the Aberystwyth Research Portal (the Institutional Repository) are
retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the
legal requirements associated with these rights.
• Users may download and print one copy of any publication from the Aberystwyth Research Portal for the purpose of private study or
research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the Aberystwyth Research Portal
Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
tel: +44 1970 62 2400
email: is@aber.ac.uk
Download date: 26. Aug. 2022

Remote Sens. 2017, 9, 67; doi:10.3390/rs9010067 www.mdpi.com/journal/remotesensing
Article
SpectralSpatial Classification of Hyperspectral
Imagery with 3D Convolutional Neural Network
Ying Li
1,
*, Haokui Zhang
1
and Qiang Shen
2
1
School of Computer Science, Northwestern Polytechnical University, Shaanxi, Xi’an 710129, China;
hkzhang1991@mail.nwpu.edu.cn
2
Department of Computer Science, Institute of Mathematics, Physics and Computer Science,
Aberystwyth University, SY23 3DB Aberystwyth, UK; qqs@aber.ac.uk
* Correspondence: lybyp@nwpu.edu.cn; Tel.: +138-9143-3893
Academic Editors: Gonzalo Pajares Martinsanz and Prasad S. Thenkabail
Received: 17 September 2016; Accepted: 9 January 2017; Published: date
Abstract: Recent research has shown that using spectralspatial information can considerably
improve the performance of hyperspectral image (HSI) classification. HSI data is typically presented
in the format of 3D cubes. Thus, 3D spatial filtering naturally offers a simple and effective method
for simultaneously extracting the spectralspatial features within such images. In this paper, a 3D
convolutional neural network (3D-CNN) framework is proposed for accurate HSI classification.
The proposed method views the HSI cube data altogether without relying on any preprocessing or
post-processing, extracting the deep spectralspatial-combined features effectively. In addition, it
requires fewer parameters than other deep learning-based methods. Thus, the model is lighter, less
likely to over-fit, and easier to train. For comparison and validation, we test the proposed method
along with three other deep learning-based HSI classification methodsnamely, stacked
autoencoder (SAE), deep brief network (DBN), and 2D-CNN-based methodson three real-world
HSI datasets captured by different sensors. Experimental results demonstrate that our
3D-CNN-based method outperforms these state-of-the-art methods and sets a new record.
Keywords: hyperspectral image classification; deep learning; 2D convolutional neural networks;
3D convolutional neural networks; 3D structure
1. Introduction
By capturing digital images in hundreds of continuous narrow spectral bands spanning the
visible to infrared wavelengths, hyperspectral remote sensors produce 3D hyperspectral imagery
(HSI) containing both spectral and spatial information. The rich spectral information of HSI is
powerful, and has been widely employed in a range of successful applications in agriculture [1],
environmental sciences [2], wild-land fire tracking, and biological threat detection [3]. Classification
of each pixel in HSI plays a crucial role in these applications. Thus, a large number of HSI
classification methods have been proposed over the recent decades.
Conventional HSI classification methods are often based only on spectral information. Typical
classifiers include those based on distance measure [4], k-nearest-neighbors [5], maximum likelihood
criterion [6], and logistic regression [7]. The classification accuracy of these methods is usually
unsatisfactory due to the well-known small-sample problem”: a sufficient number of training
samples may not be available for the high number of spectral bands. This unbalance between the high
dimensionality of spectral bands and the limited number of training samples is known as the Hughes
phenomenon [8]. Spectral redundancy is also observed, as certain spectral bands of hyperspectral
data can be highly correlated. Furthermore, classification algorithms exploiting only the spectral
information fail to capture the important spatial variability perceived for high-resolution data,

Remote Sens. 2017, 9, 67 2 of 21
generally resulting in lower performance. To improve classification performance, an intuitive idea is
to design classifiers using both spectral and spatial information, incorporating the spatial structure
into the pixel-level classifiers. Spatial information provides additional discriminant information
related to the shape and size of different structures, whichif properly exploitedleads to more
accurate classification maps [9].
Spectralspatial classification methods can be generally divided into two categories. The first
exploits the spectral and spatial contextual information separately. In other words, the spatial
dependence is extracted in advance through various spatial filters, such as morphological profiles
[1012], entropy [13], attribute profiles [14], and low-rank representation [15,16]. Then, these
transformed spatial features are combined with the spectral features, where dimensionality reduction
(DR) may be applied (when appropriate) to perform pixel-wise classification. One can also use spatial
information to refine the classification results through a regularization process such as Markov
random field (MRF) [17] and graph cut [18] at the post-processing stages. In addition, optimization
approachesincluding Hopfield neural networks [19] or simulated annealing [20,21]have been
adopted to capture both spatial and spectral information on remote sensing images. The second
category usually conjunctively fuses spatial information with spectral features to produce joint
features [22]. For example, a series of 3D wavelet filters [23], 3D Gabor filters [24], or 3D scattering
wavelet filers [25] generated at different scales and frequencies are applied on hyperspectral data to
extract spectralspatial-combined features. Again, DR techniques may be utilized to extract low
dimensional spectralspatial features while preserving the discriminative information, such as tensor
discriminative locality alignment (TDLA)-based feature extraction [26] and sparse low-rank
approximation-based feature embedding [27]. Since HSI data are typically presented in 3D cubes, the
second type of approach can result in a large number of feature cubes containing important
information about local signal changes in space, spectrum, and joint spatial/spectral correlations,
which are essential for better performance.
Most conventional feature extraction methods are, however, based on handcrafted features and
“shallow” learning models, highly relying on domain knowledge. Handcrafted features may fail to
address the need to consider the details embedded in the real data; it is challenging to achieve an
optimal balance between discriminability and robustness for many types of HSI data [28]. Most
recently, deep learning has emerged as the state-of-the-art machine learning technique with great
potential for HSI classification [2833]. Instead of depending on shallow manually-engineered
features, deep learning techniques are able to automatically learn hierarchical features (from low-
level to high-level) from raw input data. Such learned features have achieved tremendous success in
many machine vision tasks. For example, Chen et al. applied unsupervised deep feature
learningincluding stacked autoencoder (SAE) [28] and deep brief network (DBN) [31]for
spectralspatial feature extraction and classification. While SAE and DBN can extract deep features
hierarchically in a layer-wise training fashion, the training samples composed of image patches have
to be flattened to one dimension in order to meet the input requirement of such models.
Unfortunately, the flattened training samples do not retain the same spatial information that the
original image may contain. Moreover, SAE and DBN are unsupervised, and do not directly make
use of the label information when learning the features. Zhao, Yue, Makantasis, and Liang et al. [3235]
have utilized convolutional neural networks (CNN) for HSI classification, where the spatial features
are obtained by a 2D-CNN model by exploiting the first few principal component (PC) bands of the
original HSI data. The CNN-based models have the ability to detect local features that are shown to
be capable of achieving improved classification performance over the fully connected SAE and DBN
models of Chen et al. A drawback is that these methods work by firstly employing principal
component analysis (PCA) to reduce the HSI data to a manageable scale prior to the training of the
2D-CNN model. As the spatial features and spectral features are extracted separately, they may not
fully exploit the joint spatial/spectral correlations information, which can be important for
classification.
In this paper, we present a novel approach, introducing 3D-CNN into HSI classification. By
applying 3D kernels to 3D HSI, 3D-CNN can learn the local signal changes in both the spatial and the

Remote Sens. 2017, 9, 67 3 of 21
spectral dimension of the feature cubes, exploiting important discrimination information for
classification. As the spectral features and the spatial features are extracted simultaneously, this work
takes full advantage of the structural characteristics of the 3D HSI data. Note that 3D-CNN has been
proposed in computer visionmainly for video-based applications [36,37]to learn spatiotemporal
features. In particular, the 3D-CNN method developed in [36] applied a set of hardwired kernels to
generate multiple channels (denoted by gray, gradient-x, gradient-y, and so on) of information from
the input frames. In contrast, our proposed approach takes full spectral bands as inputs, and does
not require any preprocessing or post-processing. The resulting deep classifier model is trained in an
end-to-end fashion. At the same scale, our 3D-CNN involves fewer parameters than other deep
learning-based methods, which is more appropriate for HSI classification problems that typically
have limited access to training samples. We compare our 3D-CNN-based approach with the
aforementioned state-of-the-art deep learning based techniques on three real HSI datasets which
were captured by different remote sensors. Experimental results demonstrate that the proposed
approach outperforms the compared.
The remainder of this paper is organized as follows. Section 2 first provides an introduction to
the relevant background, and then presents our 3D-CNN-based HSI classification framework. We
describe the datasets and experimental setup in Section 3 and discusses the experimental results in
Section 4, empirically comparing the proposed method with three other deep learning-based HSI
classification approachesnamely SAE-LR (logistic regression) [29], DBN-LR [31] and 2D-CNN [33].
Finally, we summarize the work and conclude this paper in Section 5.
2. Proposed Method
In this section, we explain the basic operations of our 3D-CNN-based classification method in
detail, elaborate on how to train this network, and analyze what the 3D-CNN model extracts from
HSI.
2.1. 3D Convolution Operation
2D-CNN has been demonstrated with great promise in the field of computer vision and image
processing, with applications such as image classification [3840], object detection [41,42], and depth
estimation from a single image [43]. The most significant advantage of 2D-CNN is that it offers a
principled way to extract features directly from the raw input imagery. However, directly applying
2D-CNN to HSI requires the convolution of every one of the network’s 2D inputs, and each with a
set of learnable kernels. The hundreds of channels along the spectral dimension (network inputs) of
HSI require a large number of kernels (parameters), which can be prone to over-fitting with increased
computational costs.
In order to deal with this problem, DR methods are usually applied to reduce the spectral
dimensionality prior to 2D-CNN being employed for feature extraction and classification [3335]. For
instance, in [33], the first three principal components (PCs) are extracted from HSI by PCA, and then
a 2D-CNN is used to extract deep features from condensed HSI with a window size of 42 × 42 in order
to predict the label of each pixel. Randomized PCA (R-PCA) was also introduced along the spectral
dimension to compress the entire HSI in [34], with the first 10 or 30 PCs being retained. This was
carried out prior to the 2D-CNN being used to extract deep features from the compressed HIS (with
a window size of 5 × 5), and subsequently to complete the classification task. Furthermore, the
approach presented in [35] requires three computational steps: The high-level features are first
extracted by a 2D-CNN, where the entire HSI is whitened with the PCA algorithm, retaining the
several top bands; the sparse representation technique is then applied to further reduce the high level
spatial features generated by the first step. Only after these two steps are classification results
obtained based on learned sparse dictionary. A clear disadvantage of these approaches is that they
do not preserve the spectral information well. To address this important issue, a more sophisticated
procedure for additional spectral feature extraction can be employed as reported in [32].
To take the advantage of the capability of automatically learning features in deep learning, we
herein introduce 3D-CNN into HSI processing. 3D-CNN uses 3D kernels for the 3D convolution

Remote Sens. 2017, 9, 67 4 of 21
operation, and can extract spatial features and spectral features simultaneously. Figure 1 illustrates
the key difference between the 2D convolution operation and the 3D convolution operation.
Figure 1. (a) 2D convolution operation, as per Formula (1). (b) 3D convolution operation, as per
Formula (2).
In the 2D convolution operation, input data is convolved with 2D kernels (see Figure 1a), before
going through the activation function to form the output data (i.e., feature maps). This operation can
be formulated as
11
( )( )
( 1)
00
jj
HW
xy hw x h y w
lj ljm l m lj
m h w
v f k v b








(1)
where
l
indicates the layer that is considered,
j
is the number of feature maps in this layer,
stands for the output at position
( , )xy
on the
j
th feature map in the
l
th layer,
b
is the bias, and
()f
is the activation function,
m
indexes over the set of feature maps in the
( 1)l
th layer
connected to the current feature map, and finally,
hw
ljm
k
is the value at position
( , )hw
of the kernel
connected to the
j
th feature map, with
l
H
and
l
W
being the height and width of the kernel,
respectively.
In conventional 2D-CNN, convolution operations are applied to the 2D feature maps that
capture features from the spatial dimension only. When applied to 3D data (e.g., for video analysis
[36]), it is desirable to capture features from both the spatial and temporal dimensions. To this end,
3D-CNN was proposed [36], where 3D convolution operations are applied to the 3D feature cubes in
an effort to compute spatiotemporal features from the 3D input data. Formally, the value at position
( , , )x y z
on the
j
th feature cube in the
l
th layer is given by:
1 1 1
( )( )( )
( 1)
0 0 0
l l l
H W R
xyz hwr x h y w z r
lj ljm l m lj
m h w r
v f k v b




(2)
where
l
R
is the size of the 3D kernel along the spectral dimension,
j
is the number of kernels in
this layer, and
hwr
ljm
k
is the
( , , )h w r
th value of the kernel connected to the
m
th feature cube in the
preceding layer.
In our 3D-CNN-based HSI classification model, each feature cube is treated independently. Thus,
m
is set to 1 in Equation (2), and the 3D convolution operation can be (re-)formulated as
1 1 1
( )( )( )
( 1)
0 0 0
l l l
H W D
xyz hwd x h y w z d
lij lj l i lj
h w d
v f k v b




(3)
where
l
D
is the spectral depth of the 3D kernel,
i
is the number of feature cubes in the previous
layer,
j
is the number of kernels in this layer,
xyz
lij
v
is the output at the position
( , , )x y z
that is

Citations
More filters
Journal ArticleDOI

Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources

TL;DR: The challenges of using deep learning for remote-sensing data analysis are analyzed, recent advances are reviewed, and resources are provided that hope will make deep learning in remote sensing seem ridiculously simple.
Journal ArticleDOI

Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework

TL;DR: An end-to-end spectral–spatial residual network that takes raw 3-D cubes as input data without feature engineering for hyperspectral image classification and achieves the state-of-the-art HSI classification accuracy in agricultural, rural–urban, and urban data sets.
Journal ArticleDOI

Deep Learning for Hyperspectral Image Classification: An Overview

TL;DR: In this paper, the authors present a systematic review of deep learning-based hyperspectral image classification literatures and compare several strategies for this topic, which can provide some guidelines for future studies on this topic.
Journal ArticleDOI

Cascaded Recurrent Neural Networks for Hyperspectral Image Classification

TL;DR: Wang et al. as discussed by the authors proposed a sequence-based recurrent neural network (RNN) for hyperspectral image classification, which makes use of a newly proposed activation function, parametric rectified tanh (PRetanh), instead of the popular tanh or rectified linear unit.
Journal ArticleDOI

Deep learning classifiers for hyperspectral imaging: A review

TL;DR: A comprehensive review of the current-state-of-the-art in DL for HSI classification, analyzing the strengths and weaknesses of the most widely used classifiers in the literature is provided, providing an exhaustive comparison of the discussed techniques.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Posted Content

Fast R-CNN

TL;DR: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection that builds on previous work to efficiently classify object proposals using deep convolutional networks.
Posted Content

Rich feature hierarchies for accurate object detection and semantic segmentation

TL;DR: This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Related Papers (5)