Spectral-spatial classification of hyperspectral imagery with 3D convolutional neural network

doi:10.3390/RS9010067

Aberystwyth University

Spectral–Spatial Classification of Hyperspectral Imagery with 3D Convolutional

Neural Network

Li, Ying; Zhang, Haokui ; Shen, Qiang

Published in:

Remote Sensing

DOI:

10.3390/rs9010067

Publication date:

2017

Citation for published version (APA):

Li, Y., Zhang, H., & Shen, Q. (2017). Spectral–Spatial Classification of Hyperspectral Imagery with 3D

Convolutional Neural Network. Remote Sensing, 9(1), [67]. https://doi.org/10.3390/rs9010067

Document License

CC BY

General rights

Copyright and moral rights for the publications made accessible in the Aberystwyth Research Portal (the Institutional Repository) are

retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the

legal requirements associated with these rights.

• Users may download and print one copy of any publication from the Aberystwyth Research Portal for the purpose of private study or

research.

• You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the Aberystwyth Research Portal

Take down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately

and investigate your claim.

tel: +44 1970 62 2400

email: is@aber.ac.uk

Download date: 26. Aug. 2022

Remote Sens. 2017, 9, 67; doi:10.3390/rs9010067 www.mdpi.com/journal/remotesensing

Article

Spectral–Spatial Classification of Hyperspectral

Imagery with 3D Convolutional Neural Network

Ying Li

1,

*, Haokui Zhang

1

and Qiang Shen

2

1

School of Computer Science, Northwestern Polytechnical University, Shaanxi, Xi’an 710129, China;

hkzhang1991@mail.nwpu.edu.cn

2

Department of Computer Science, Institute of Mathematics, Physics and Computer Science,

Aberystwyth University, SY23 3DB Aberystwyth, UK; qqs@aber.ac.uk

* Correspondence: lybyp@nwpu.edu.cn; Tel.: +138-9143-3893

Academic Editors: Gonzalo Pajares Martinsanz and Prasad S. Thenkabail

Received: 17 September 2016; Accepted: 9 January 2017; Published: date

Abstract: Recent research has shown that using spectral–spatial information can considerably

improve the performance of hyperspectral image (HSI) classification. HSI data is typically presented

in the format of 3D cubes. Thus, 3D spatial filtering naturally offers a simple and effective method

for simultaneously extracting the spectral–spatial features within such images. In this paper, a 3D

convolutional neural network (3D-CNN) framework is proposed for accurate HSI classification.

The proposed method views the HSI cube data altogether without relying on any preprocessing or

post-processing, extracting the deep spectral–spatial-combined features effectively. In addition, it

requires fewer parameters than other deep learning-based methods. Thus, the model is lighter, less

likely to over-fit, and easier to train. For comparison and validation, we test the proposed method

along with three other deep learning-based HSI classification methods—namely, stacked

autoencoder (SAE), deep brief network (DBN), and 2D-CNN-based methods—on three real-world

HSI datasets captured by different sensors. Experimental results demonstrate that our

3D-CNN-based method outperforms these state-of-the-art methods and sets a new record.

Keywords: hyperspectral image classification; deep learning; 2D convolutional neural networks;

3D convolutional neural networks; 3D structure

1. Introduction

By capturing digital images in hundreds of continuous narrow spectral bands spanning the

visible to infrared wavelengths, hyperspectral remote sensors produce 3D hyperspectral imagery

(HSI) containing both spectral and spatial information. The rich spectral information of HSI is

powerful, and has been widely employed in a range of successful applications in agriculture [1],

environmental sciences [2], wild-land fire tracking, and biological threat detection [3]. Classification

of each pixel in HSI plays a crucial role in these applications. Thus, a large number of HSI

classification methods have been proposed over the recent decades.

Conventional HSI classification methods are often based only on spectral information. Typical

classifiers include those based on distance measure [4], k-nearest-neighbors [5], maximum likelihood

criterion [6], and logistic regression [7]. The classification accuracy of these methods is usually

unsatisfactory due to the well-known “small-sample problem”: a sufficient number of training

samples may not be available for the high number of spectral bands. This unbalance between the high

dimensionality of spectral bands and the limited number of training samples is known as the Hughes

phenomenon [8]. Spectral redundancy is also observed, as certain spectral bands of hyperspectral

data can be highly correlated. Furthermore, classification algorithms exploiting only the spectral

information fail to capture the important spatial variability perceived for high-resolution data,

Remote Sens. 2017, 9, 67 2 of 21

generally resulting in lower performance. To improve classification performance, an intuitive idea is

to design classifiers using both spectral and spatial information, incorporating the spatial structure

into the pixel-level classifiers. Spatial information provides additional discriminant information

related to the shape and size of different structures, which—if properly exploited—leads to more

accurate classification maps [9].

Spectral–spatial classification methods can be generally divided into two categories. The first

exploits the spectral and spatial contextual information separately. In other words, the spatial

dependence is extracted in advance through various spatial filters, such as morphological profiles

[10–12], entropy [13], attribute profiles [14], and low-rank representation [15,16]. Then, these

transformed spatial features are combined with the spectral features, where dimensionality reduction

(DR) may be applied (when appropriate) to perform pixel-wise classification. One can also use spatial

information to refine the classification results through a regularization process such as Markov

random field (MRF) [17] and graph cut [18] at the post-processing stages. In addition, optimization

approaches—including Hopfield neural networks [19] or simulated annealing [20,21]—have been

adopted to capture both spatial and spectral information on remote sensing images. The second

category usually conjunctively fuses spatial information with spectral features to produce joint

features [22]. For example, a series of 3D wavelet filters [23], 3D Gabor filters [24], or 3D scattering

wavelet filers [25] generated at different scales and frequencies are applied on hyperspectral data to

extract spectral–spatial-combined features. Again, DR techniques may be utilized to extract low

dimensional spectral–spatial features while preserving the discriminative information, such as tensor

discriminative locality alignment (TDLA)-based feature extraction [26] and sparse low-rank

approximation-based feature embedding [27]. Since HSI data are typically presented in 3D cubes, the

second type of approach can result in a large number of feature cubes containing important

information about local signal changes in space, spectrum, and joint spatial/spectral correlations,

which are essential for better performance.

Most conventional feature extraction methods are, however, based on handcrafted features and

“shallow” learning models, highly relying on domain knowledge. Handcrafted features may fail to

address the need to consider the details embedded in the real data; it is challenging to achieve an

optimal balance between discriminability and robustness for many types of HSI data [28]. Most

recently, deep learning has emerged as the state-of-the-art machine learning technique with great

potential for HSI classification [28–33]. Instead of depending on shallow manually-engineered

features, deep learning techniques are able to automatically learn hierarchical features (from low-

level to high-level) from raw input data. Such learned features have achieved tremendous success in

many machine vision tasks. For example, Chen et al. applied unsupervised deep feature

learning—including stacked autoencoder (SAE) [28] and deep brief network (DBN) [31]—for

spectral–spatial feature extraction and classification. While SAE and DBN can extract deep features

hierarchically in a layer-wise training fashion, the training samples composed of image patches have

to be flattened to one dimension in order to meet the input requirement of such models.

Unfortunately, the flattened training samples do not retain the same spatial information that the

original image may contain. Moreover, SAE and DBN are unsupervised, and do not directly make

use of the label information when learning the features. Zhao, Yue, Makantasis, and Liang et al. [32–35]

have utilized convolutional neural networks (CNN) for HSI classification, where the spatial features

are obtained by a 2D-CNN model by exploiting the first few principal component (PC) bands of the

original HSI data. The CNN-based models have the ability to detect local features that are shown to

be capable of achieving improved classification performance over the fully connected SAE and DBN

models of Chen et al. A drawback is that these methods work by firstly employing principal

component analysis (PCA) to reduce the HSI data to a manageable scale prior to the training of the

2D-CNN model. As the spatial features and spectral features are extracted separately, they may not

fully exploit the joint spatial/spectral correlations information, which can be important for

classification.

In this paper, we present a novel approach, introducing 3D-CNN into HSI classification. By

applying 3D kernels to 3D HSI, 3D-CNN can learn the local signal changes in both the spatial and the

Remote Sens. 2017, 9, 67 3 of 21

spectral dimension of the feature cubes, exploiting important discrimination information for

classification. As the spectral features and the spatial features are extracted simultaneously, this work

takes full advantage of the structural characteristics of the 3D HSI data. Note that 3D-CNN has been

proposed in computer vision—mainly for video-based applications [36,37]—to learn spatiotemporal

features. In particular, the 3D-CNN method developed in [36] applied a set of hardwired kernels to

generate multiple channels (denoted by gray, gradient-x, gradient-y, and so on) of information from

the input frames. In contrast, our proposed approach takes full spectral bands as inputs, and does

not require any preprocessing or post-processing. The resulting deep classifier model is trained in an

end-to-end fashion. At the same scale, our 3D-CNN involves fewer parameters than other deep

learning-based methods, which is more appropriate for HSI classification problems that typically

have limited access to training samples. We compare our 3D-CNN-based approach with the

aforementioned state-of-the-art deep learning based techniques on three real HSI datasets which

were captured by different remote sensors. Experimental results demonstrate that the proposed

approach outperforms the compared.

The remainder of this paper is organized as follows. Section 2 first provides an introduction to

the relevant background, and then presents our 3D-CNN-based HSI classification framework. We

describe the datasets and experimental setup in Section 3 and discusses the experimental results in

Section 4, empirically comparing the proposed method with three other deep learning-based HSI

classification approaches—namely SAE-LR (logistic regression) [29], DBN-LR [31] and 2D-CNN [33].

Finally, we summarize the work and conclude this paper in Section 5.

2. Proposed Method

In this section, we explain the basic operations of our 3D-CNN-based classification method in

detail, elaborate on how to train this network, and analyze what the 3D-CNN model extracts from

HSI.

2.1. 3D Convolution Operation

2D-CNN has been demonstrated with great promise in the field of computer vision and image

processing, with applications such as image classification [38–40], object detection [41,42], and depth

estimation from a single image [43]. The most significant advantage of 2D-CNN is that it offers a

principled way to extract features directly from the raw input imagery. However, directly applying

2D-CNN to HSI requires the convolution of every one of the network’s 2D inputs, and each with a

set of learnable kernels. The hundreds of channels along the spectral dimension (network inputs) of

HSI require a large number of kernels (parameters), which can be prone to over-fitting with increased

computational costs.

In order to deal with this problem, DR methods are usually applied to reduce the spectral

dimensionality prior to 2D-CNN being employed for feature extraction and classification [33–35]. For

instance, in [33], the first three principal components (PCs) are extracted from HSI by PCA, and then

a 2D-CNN is used to extract deep features from condensed HSI with a window size of 42 × 42 in order

to predict the label of each pixel. Randomized PCA (R-PCA) was also introduced along the spectral

dimension to compress the entire HSI in [34], with the first 10 or 30 PCs being retained. This was

carried out prior to the 2D-CNN being used to extract deep features from the compressed HIS (with

a window size of 5 × 5), and subsequently to complete the classification task. Furthermore, the

approach presented in [35] requires three computational steps: The high-level features are first

extracted by a 2D-CNN, where the entire HSI is whitened with the PCA algorithm, retaining the

several top bands; the sparse representation technique is then applied to further reduce the high level

spatial features generated by the first step. Only after these two steps are classification results

obtained based on learned sparse dictionary. A clear disadvantage of these approaches is that they

do not preserve the spectral information well. To address this important issue, a more sophisticated

procedure for additional spectral feature extraction can be employed as reported in [32].

To take the advantage of the capability of automatically learning features in deep learning, we

herein introduce 3D-CNN into HSI processing. 3D-CNN uses 3D kernels for the 3D convolution

Remote Sens. 2017, 9, 67 4 of 21

operation, and can extract spatial features and spectral features simultaneously. Figure 1 illustrates

the key difference between the 2D convolution operation and the 3D convolution operation.

Figure 1. (a) 2D convolution operation, as per Formula (1). (b) 3D convolution operation, as per

Formula (2).

In the 2D convolution operation, input data is convolved with 2D kernels (see Figure 1a), before

going through the activation function to form the output data (i.e., feature maps). This operation can

be formulated as

11

( )( )

( 1)

00

jj

HW

xy hw x h y w

lj ljm l m lj

m h w

v f k v b

















  

(1)

where

l

indicates the layer that is considered,

j

is the number of feature maps in this layer,

xy

lj

v

stands for the output at position

( , )xy

on the

j

th feature map in the

l

th layer,

b

is the bias, and

()f

is the activation function,

m

indexes over the set of feature maps in the

( 1)l 

th layer

connected to the current feature map, and finally,

hw

ljm

k

is the value at position

( , )hw

of the kernel

connected to the

j

th feature map, with

l

H

and

l

W

being the height and width of the kernel,

respectively.

In conventional 2D-CNN, convolution operations are applied to the 2D feature maps that

capture features from the spatial dimension only. When applied to 3D data (e.g., for video analysis

[36]), it is desirable to capture features from both the spatial and temporal dimensions. To this end,

3D-CNN was proposed [36], where 3D convolution operations are applied to the 3D feature cubes in

an effort to compute spatiotemporal features from the 3D input data. Formally, the value at position

( , , )x y z

on the

j

th feature cube in the

l

th layer is given by:

1 1 1

( )( )( )

( 1)

0 0 0

l l l

H W R

xyz hwr x h y w z r

lj ljm l m lj

m h w r

v f k v b

  

  



  









 

(2)

where

l

R

is the size of the 3D kernel along the spectral dimension,

j

is the number of kernels in

this layer, and

hwr

ljm

k

is the

( , , )h w r

th value of the kernel connected to the

m

th feature cube in the

preceding layer.

In our 3D-CNN-based HSI classification model, each feature cube is treated independently. Thus,

m

is set to 1 in Equation (2), and the 3D convolution operation can be (re-)formulated as

1 1 1

( )( )( )

( 1)

0 0 0

l l l

H W D

xyz hwd x h y w z d

lij lj l i lj

h w d

v f k v b

  

  



  









  

(3)

where

l

D

is the spectral depth of the 3D kernel,

i

is the number of feature cubes in the previous

layer,

j

is the number of kernels in this layer,

xyz

lij

v

is the output at the position

( , , )x y z

that is

Spectral-spatial classification of hyperspectral imagery with 3D convolutional neural network

Citations

Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources

Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework

Deep Learning for Hyperspectral Image Classification: An Overview

Cascaded Recurrent Neural Networks for Hyperspectral Image Classification

Deep learning classifiers for hyperspectral imaging: A review

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Going deeper with convolutions

Fast R-CNN

Rich feature hierarchies for accurate object detection and semantic segmentation

Related Papers (5)

Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks

Deep Learning-Based Classification of Hyperspectral Data

Deep Convolutional Neural Networks for Hyperspectral Image Classification

Classification of hyperspectral remote sensing images with support vector machines

Deep Residual Learning for Image Recognition