scispace - formally typeset
Open AccessJournal ArticleDOI

Automatic Multi-Organ Segmentation on Abdominal CT With Dense V-Networks

TLDR
It is concluded that the deep-learning-based segmentation represents a registration-free method for multi-organ abdominal CT segmentation whose accuracy can surpass current methods, potentially supporting image-guided navigation in gastrointestinal endoscopy procedures.
Abstract
Automatic segmentation of abdominal anatomy on computed tomography (CT) images can support diagnosis, treatment planning, and treatment delivery workflows. Segmentation methods using statistical models and multi-atlas label fusion (MALF) require inter-subject image registrations, which are challenging for abdominal images, but alternative methods without registration have not yet achieved higher accuracy for most abdominal organs. We present a registration-free deep-learning-based segmentation algorithm for eight organs that are relevant for navigation in endoscopic pancreatic and biliary procedures, including the pancreas, the gastrointestinal tract (esophagus, stomach, and duodenum) and surrounding organs (liver, spleen, left kidney, and gallbladder). We directly compared the segmentation accuracy of the proposed method to the existing deep learning and MALF methods in a cross-validation on a multi-centre data set with 90 subjects. The proposed method yielded significantly higher Dice scores for all organs and lower mean absolute distances for most organs, including Dice scores of 0.78 versus 0.71, 0.74, and 0.74 for the pancreas, 0.90 versus 0.85, 0.87, and 0.83 for the stomach, and 0.76 versus 0.68, 0.69, and 0.66 for the esophagus. We conclude that the deep-learning-based segmentation represents a registration-free method for multi-organ abdominal CT segmentation whose accuracy can surpass current methods, potentially supporting image-guided navigation in gastrointestinal endoscopy procedures.

read more

Content maybe subject to copyright    Report

1
Automatic Multi-organ Segmentation on Abdominal
CT with Dense V-networks
Eli Gibson
, Francesco Giganti
§
, Yipeng Hu
, Ester Bonmati
, Steve Bandula
k
, Kurinchi Gurusamy
,
Brian Davidson
, Stephen P. Pereira
∗∗
, Matthew J. Clarkson
, and Dean C. Barratt
UCL Centre for Medical Image Computing, Department of Medical Physics & Biomedical Engineering,
University College London, UK
§
Department of Radiology, University College Hospital Trust, UK
Division of Surgery and Interventional Science, University College London, UK
k
UCL Centre for Medical Imaging, University College London, UK
∗∗
Institute for Liver and Digestive Health, University College London, UK
Wellcome / EPSRC Centre for Interventional and Surgical Sciences, University College London, UK
Abstract—Automatic segmentation of abdominal anatomy on
computed tomography (CT) images can support diagnosis, treat-
ment planning and treatment delivery workflows. Segmentation
methods using statistical models and multi-atlas label fusion
(MALF) require inter-subject image registrations which are
challenging for abdominal images, but alternative methods with-
out registration have not yet achieved higher accuracy for
most abdominal organs. We present a registration-free deep-
learning-based segmentation algorithm for eight organs that are
relevant for navigation in endoscopic pancreatic and biliary
procedures, including the pancreas, the GI tract (esophagus,
stomach, duodenum) and surrounding organs (liver, spleen, left
kidney, gallbladder). We directly compared the segmentation
accuracy of the proposed method to existing deep learning and
MALF methods in a cross-validation on a multi-centre data
set with 90 subjects. The proposed method yielded significantly
higher Dice scores for all organs and lower mean absolute
distances for most organs, including Dice scores of 0.78 vs. 0.71,
0.74 and 0.74 for the pancreas, 0.90 vs 0.85, 0.87 and 0.83 for
the stomach and 0.76 vs 0.68, 0.69 and 0.66 for the esophagus.
We conclude that deep-learning-based segmentation represents a
registration-free method for multi-organ abdominal CT segmen-
tation whose accuracy can surpass current methods, potentially
supporting image-guided navigation in gastrointestinal endoscopy
procedures.
Index Terms—Abdominal CT, Segmentation, Deep learning,
Pancreas, Gastrointestinal tract, Stomach, Duodenum, Esopha-
gus, Liver, Spleen, Kidney, Gallbladder
I. INTRODUCTION
S
EGMENTATION of organs in abdominal images can
support clinical workflows in multiple domains, including
diagnostic interventions, treatment planning and treatment
delivery. Organ segmentation is a crucially important step
for computer-assisted diagnostic and biomarker measurement
systems [1]. Segmentations of treatment volumes and organs-
at-risk are also central to planning radiation therapies [2].
More generally, segmentation-based patient-specific anatom-
ical models can support surgical planning and delivery via
intra-operative image-guidance systems [3].
Corresponding author: E. Gibson (email: eli.gibson@ucl.ac.uk). Copyright
(c) 2017 IEEE. Personal use of this material is permitted. However, permission
to use this material for any other purposes must be obtained from the IEEE
by sending a request to pubs-permissions@ieee.org.
In endoscopic pancreatobiliary procedures, an endoscope
is inserted orally and navigated through the gastrointestinal
tract to specific positions on the stomach or duodenal wall to
allow pancreatobiliary imaging and intervention. Due to the
small endoscopic field of view and the lack of visual orienta-
tion cues, this navigation task is challenging, particularly for
novice endoscopists [4]. Image-guidance showing registered
anatomical models would provide orientation and targeting
cues that are outside of the endoscopic field of view or
challenging to see on endoscopic images. To support targeting
and navigation, segmentations of multiple organs are needed:
the pancreas, gastrointestinal organs (esophagus, stomach and
duodenum), and nearby organs used as navigational landmarks
(liver, gallbladder, spleen and left kidney).
Manual segmentation of 3D abdominal images is labor-
intensive and impractical for most clinical workflows, moti-
vating (semi-)automated segmentation tools [2]. Research into
such tools has focused on computed tomography (CT), due to
its clinical prevalence, and on three methodologies: statistical
models (SM) [5], [6], multi-atlas label fusion (MALF) [6]–
[10] and registration-free methods [11]–[14]. SM and MALF,
reviewed in more detail in Section I-A1, rely on establishing
anatomic correspondences between images from different sub-
jects, a task that remains challenging due to high inter-subject
variability in organ shape and appearance as well as soft tissue
deformation [15]. Registration-free methods trade registration
challenges for the challenges of constructing variability- and
deformation-invariant features (”hand-tuned” or learnt) that
characterize anatomy in an unregistered training data set.
Despite the claimed advantage of this approach, registration-
free methods have achieved less accurate multi-organ segmen-
tations than the registration-based approaches [16].
Recent advances in machine learning, computational power
and data availability, however, have enabled the training of
more complex registration-free methods, including deep fully
convolutional networks (FCNs), promising increased segmen-
tation accuracy [17]. FCNs, discussed in detail in Section I-A2,
are particularly well-suited to multi-organ abdominal segmen-
tation because they require neither explicit anatomical corre-
spondences nor hand-tuned image features. In multi-organ ab-
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2018.2806309
Copyright (c) 2018 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

2
dominal segmentation, they have been used alone [18] or with
pre- or post-processing, such as level sets [19] and MALF [20],
demonstrating their potential value. However, these pipelines
still have not achieved higher accuracies than the most accurate
registration-based methods for most organs [16].
This study presents the dense V-network FCN (DenseVNet)
and its application to multi-organ segmentation on abdominal
CT, yielding higher accuracies than three existing methods.
The contributions of this work are four-fold:
1) The DenseVNet segmentation network is presented,
which enables high-resolution activation maps through
memory-efficient dropout and feature reuse.
2) A batch-wise spatial dropout scheme is proposed, which
lowers the memory and computational costs of dropout.
3) The accuracy of DenseVNet for multi-organ segmen-
tation from abdominal CT is evaluated using a cross-
validation over 90 manually segmented images from
multiple centres. The results indicate that higher seg-
mentation accuracy can be achieved than a state-of-the-
art MALF method and two existing FCNs.
4) The parts of DenseNet critical for accuracy are identified
by comparing the accuracies of network variants.
This builds on our preliminary work [21], with an improved
network architecture, a larger data set, and more extensive
comparisons with other algorithms and network variants.
A. Related work
1) Common multi-organ segmentation methodologies: Sta-
tistical models [5], [6] involve co-registering images in a
training data set to estimate anatomical correspondences, con-
structing a statistical model of the distribution of shapes [22]
and/or appearances [23] of corresponding anatomy in the
training data, and fitting the resulting model to new images to
generate segmentations. Multi-atlas label fusion methods [6]–
[10] register images in a training data set to each new image
and combine propagated reference segmentations to generate
new segmentations. Statistical models and multi-atlas methods
are limited by image registration accuracy. This registration,
while extensively studied, remains challenging [15]. The size,
shape, appearance, and relative positions of abdominal organs
vary considerably between patients due to natural variability,
disease status and previous treatments and within each patient
due to soft tissue deformation. To avoid challenging regis-
trations, registration-free methods train a voxel-wise classifier
on unregistered images. Some methods have relied on hand-
crafted organ-specific image features [11], [12], but many
recent approaches involve training classifiers on selected (but
typically organ-agnostic) image features [13], [14]. Regis-
tration challenges notwithstanding, MALF has yielded more
accurate multi-organ abdominal CT segmentations than reg-
istration free methods for most organs [16]. However, recent
advances in registration-free methods may change this.
2) FCNs for segmentation: FCNs are compositions of
simple image-to-image functions with trainable parameters,
including convolution with linear kernels and voxelwise non-
linearities. FCNs are efficient architectures for deep-learning-
based tasks that require image outputs like segmentation.
FCNs have recently been applied to segmentation of volu-
metric images in medical image analysis [18], [19], [24]–[26]
where such images are common. Segmentation of volumetric
images face particular challenges, mainly due to the need to
process large volumetric images under memory constraints.
One strategy to constrain the memory usage is to process
smaller images: small patches of a larger image or lower res-
olution images. Image-patch segmentations consider various
patch types single 2D slices, slabs of adjacent 2D slices
or smaller cropped regions and orientations single axis-
aligned slices, multiple slices from multiple axes, or oblique
slices. These methods gain memory efficiency but lose spatial
context. In contrast, Milletari et al. [25] and C¸ ic¸ek et al. [24]
used 3D representations of the entire image by downsampling
the image sequentially so that most image features are only
represented at low resolution. Our previous work [21] used 3D
representations with fewer, but higher-resolution, features by
using dense blocks [27], stacks of convolutional units in which
the input of each layer comprises the outputs of all preceding
stack layers, compensating for using fewer features.
Another strategy to constrain the memory usage is to limit
the network depth. However, this affects the FCN receptive
field (i.e. the size of the image region affecting each output
voxel), which grows linearly with the network depth. Larger
convolutional kernels mitigate this by increasing the linear
growth rate; however, this can result in a very high parameter
count (which grows as the cube of kernel size in 3D). Sequen-
tial downsampling, mentioned above, also mitigates this effect,
as the receptive field grows exponentially with the number of
downsampling stages. Dilated convolutions [28], used in our
previous work [21], instead use large, but sparse kernels to
give exponential receptive field size with few parameters.
Multi-organ segmentation poses additional challenges. First,
more information must be propagated through the network,
exacerbating the aforementioned memory challenges. The rela-
tive weighting of the losses for different organs (with high vol-
ume imbalance) can have unpredictable effects on convergence
and final errors; using the Dice coefficient is common but
remains poorly characterized. Imposing shape [29] and topo-
logical [30] constraints between specified organs also remains
challenging. Despite these challenges, deep learning has been
used in multi-organ abdominal CT segmentation alone [18],
[31] or as part of a larger segmentation pipeline [19], [20].
Zhou et al. [18] segmented 19 abdominal organs on 2D
slices in axial, sagittal and coronal views and combined the
results using majority-voting label fusion. Roth et al. [31]
segmented 7 organs using a two-stage hierarchical pipeline
based on 3D UNet [24]. Hu et al. [19] segmented 4 organs
using a 3D FCN to generate organ probability maps as
features for a level-set-based segmentation. Larsson et al. [20]
used MALF to identify a region of interest for each organ
and a 3D FCN with hand-tuned input features to complete
the segmentation. Compared to registration-based methods in
a recent segmentation challenge [16], these methods were
substantially more accurate (>2% Dice score improvement)
for gallbladder, achieved parity (within 2% Dice score) for the
liver, left kidney, right adrenal gland and aorta, but have lower
accuracy for the pancreas, gastrointestinal tract (esophagus,
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2018.2806309
Copyright (c) 2018 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

3
stomach) and other organs (spleen, right kidney, vena cava,
portal/splenic vein, and left adrenal gland).
II. DATA
Ninety abdominal CT images and corresponding reference
standard segmentations of the spleen, left kidney, gallbladder,
esophagus, liver, stomach, pancreas and duodenum were used
for this study. The CT images and some of the segmentations
were drawn from two publicly available data sets: forty-
three subjects from the Cancer Imaging Archive Pancreas-CT
data set [26], [32], [33] with pancreas segmentations and 47
subjects from the ‘Beyond the Cranial Vault’ (BTCV) segmen-
tation challenge [16] with segmentations of all organs except
duodenum. The remaining reference standard segmentations
were performed at our centre. The completed segmentations
and subject identifiers have been made publicly available
(DOI:http://doi.org/10.5281/zenodo.1169361).
A. Image data
The Pancreas-CT data set comprises abdominal CT ac-
quired at the National Institutes of Health Clinical Center
from pre-nephrectomy healthy kidney donors or patients with
neither major abdominal pathologies nor pancreatic cancer
lesions [33]. The BTCV data set comprises abdominal CT
acquired at the Vanderbilt University Medical Center from
metastatic liver cancer patients or post-operative ventral hernia
patients [15]. Images had voxel sizes from 0.6–0.9 mm in the
anterior-posterior (AP) and left-right (LR) axes and 0.5–5.0
mm in the inferior-superior (IS) axis. Images were manually
cropped to the rib-cage and abdominal cavity transversely, to
the superior extent of the liver or spleen and the inferior extent
of the liver or kidneys, resulting in fields of view ranging from
172–318mm AP, 246–367mm LR and 138–283mm IS.
B. Reference standard segmentations
Segmentations from the Pancreas-CT and BTCV datasets
were used where available. An imaging research fellow (E.G.),
under the supervision of a board-certified radiologist with 8
years of experience in gastrointestinal CT and MRI image
interpretation (F.G.), interactively segmented the unsegmented
organs on both data sets and edited the segmented organs to
ensure a consistent segmentation protocol, using Matlab 2015b
and ITK-SNAP 3.2 (http://itksnap.com).
III. METHODS
This study compares our proposed algorithm to multiple
automated segmentation algorithms in two experiments. First,
to evaluate the improvements to the state of the art in segmen-
tation accuracy due to our algorithm, we compare three distinct
algorithms detailed below: the multi-atlas-label-fusion-based
DEEDS+JLF [34], [35], the deep-learning-based VoxRes-
Net [36], and the proposed deep-learning-based DenseVNet.
Second, to clarify the architectural factors contributing to
these improvements, we compare variations of the proposed
DenseVNet architecture.
TABLE I
TABLE OF SYMBOLS
Tensors
L logit segmentation from V-network
P logit spatial prior
L
0
, L
00
, L
00
l
logit and probabilistic segmentation and l-th channel
R
l
l-th channel on reference standard segmentation
B
I
i
,B
O
i
stochastic binary masks for dropout
W convolution kernel
Operators
c(X, W, s, γ) convolutional unit
r(X) rectified linear non-linearity
b(X, γ) channel-wise batch normalization
¯o(X, B
I
, B
O
) batch-wise spatial dropout
f
m
(X) dense feature stack
u(X) bilinear upsampling
Operator parameters (operator: parameters)
c: s, γ stride, scale parameter
f: m, a, d
i
, n
f
# layers, kernel size,
i-th layer dilation rate, # features in each unit
Other notation
p approximate probability of keeping each channel
x, y, z voxel coordinates
Symbols used within one paragraph are omitted for brevity.
A. Proposed algorithm: Dense V-network segmentation
The proposed segmentation method uses a fully-
convolutional neural network [37] based on convolutional
units composed as shown in Figure 1. The architecture design
can be understood in terms of 5 key features described below:
batch-wise spatial dropout, dense feature stacks, V-network
downsampling and upsampling, dilated convolutions, and an
explicit spatial prior. For clarity and precision, each of these
will be described conceptually and specified mathematically.
The supplementary material, available in the multimedia tab
online, has guidance for implementing the network.
Each convolutional unit comprised three functions: (1) a
3D convolution with a learned kernel, (2) a batch normal-
ization [38] to facilitate robust gradient propagation, and (3) a
rectified linear unit (ReLU) non-linearity [39] to represent non-
linear functions. Specifically, convolutional units are denoted,
c(X, W, s, γ)
x,y ,z
= r(b((X W )
sx,sy ,sz
, γ)) (1)
where W is a convolutional kernel; batch normalization
b(X, γ) transforms the mean of each channel to 0 and the
variance to a learned per-channel scale parameter γ, and the
rectified linear unit r(X) = max(0, X) induces non-linearity.
For computational and memory efficiency, we introduce our
new batch-wise spatial dropout. In regular spatial dropout [40],
to regularize the network, random channels are dropped (i.e.
set to zero) with an independent specified probability,
ˆ
X
i
= ˆo(c(
ˆ
X
i1
, W, s, γ), B
O
) (2)
where ˆo(X, B
O
) sets channels masked by stochastic binary
mask B
O
to zero, and
ˆ
X
i1
= ˆo(X
i1
, B
I
) is the previous
unit’s output after dropout with mask B
I
. Standard imple-
mentations calculate and store the dropped activations that do
not affect subsequent layers. Our proposed batch-wise spatial
dropout avoids computing these activations by modifying the
convolution kernels instead of the activation maps, denoted
¯
X
i
= c(
¯
X
i1
, ¯o(W, B
I
, B
O
), s, ¯γ)) (3)
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2018.2806309
Copyright (c) 2018 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

4
where ¯o(W, B
I
, B
O
) is a new kernel without input and output
channels masked by B
I
and B
O
,
¯
X
i1
is the output of the pre-
vious unit after batch-wise spatial dropout, and ¯γ is the scale
parameter of undropped channels. Note that
¯
X
i
is identical to
the undropped channels of
ˆ
X
i
but does not compute or store
the dropped channels, and that subsequent convolutions are
unaffected if their kernel is similarly modified. To realize the
efficiency gains, two further changes are made. First, the same
channels are dropped for all images in each mini-batch, so that
the same convolution kernels can be used for the whole mini-
batch. Second, the distribution of dropped channels is changed
to limit the maximum memory usage. In spatial dropout,
the probability distribution of keeping k out of n channels
is a binomial distribution p(K = k) =
n
k
p
k
(1 p)
nk
;
although the expected value E[K = k] = pn, the maximum
value (corresponding to the maximum memory usage) is n.
Instead the proposed batch-wise spatial dropout drops channels
using dependent Bernoulli distributions, such that a fixed
number of channels dpne are kept. Segmentation inference can
use all features by scaling the convolutional unit outputs by
n
f
/dn
f
pe; this requires more memory per subject than train-
ing, as all n
f
feature maps are generated. Alternatively, Monte-
Carlo inference [41] can be used (increasing the computation
cost but lowering the memory usage) by inferring multiple
segmentation samples using dropout, and combining them.
Both of these approaches are evaluated in the experiments
below. An implementation of batch-wise spatial dropout is
available in the NiftyNet platform
1
.
Dense feature stacks, adapted from the dense block defined
by Huang et al. [27], are a sequence of composite functions
where the input of each function is the concatenated output of
all preceding functions. In contrast to Huang’s dense block,
our composite function use our batch-wise spatial dropout
for regularization, and do not use 1 × 1 bottleneck layers.
Specifically, the output of an m-layer dense feature stack
f
m
(X
0
) = [X
0
; X
1
; ...; X
m
] where
X
i
=
ˆ
f
i
([X
0
; X
1
; ...; X
i1
]) (4)
ˆ
f
i
(X) = c(X, ¯o(W
a,n
f
,d
i
, B
I
i
, B
O
i
), 1, ¯γ)) (5)
where [A; B] denotes channel-wise concatenation; W
a,n
f
,d
i
is
an a×a×a convolution kernel (a = 3) with n
f
output channels
(4, 8 and 16 for high, medium and low resolution dense feature
stacks) and dilation rate d
i
(d
2
= 3, d
3
= 9, d
i/∈{2,3}
= 1);
B
I
i
= [B
O
0
; B
O
1
; ...; B
O
i1
] selects all previously computed
channels, B
O
0
selects all channels from X
0
and otherwise
B
O
i
is sampled stochastically such that dpn
f
e channels are
selected (p = 0.5). This structure has several advantages.
First, like residual networks [42], the feature stacks inherently
encode identity functions, as the final output channels include
the inputs. Second, they combine multiple network depths
within a single network [43] allowing both effective propa-
gation of gradients through the network (every kernel weight
lies in a shallow sub-graph of depth 1) and representation
of complex functions (every kernel weight lies in multiple
deeper sub-graphs with depths 2 to m). Finally, when memory
1
niftynet.layer.channel_sparse_convolution.Channel
SparseConvolutionalLayer in the http://niftynet.io code repositories.
constraints limit the number of activation maps, information
from earlier layers is stored only once in memory, but accessed
by later layers. Memory-efficient dense blocks [44], where a
careful implementation of feature concatenation avoids storing
multiple copies of feature maps, can achieve O(m) memory
usage. The improvements of batch-wise spatial dropout can be
combined with those of memory-efficient dense blocks by only
allocating shared memory storage for the number of computed
activation maps, which is fixed for our dependent Bernoulli
distributions.
A V-network architecture comprises downsampling and
upsampling subnetworks, with skip connections to propa-
gate higher resolution information to the final segmenta-
tion. Previous V-networks [24], [25], typically use shallow
strided-convolution downsampling units followed by shallow
transpose-convolutional upsampling units with additive or
concatenating skip connections within each resolution. Den-
seVNet differs in several ways: the downsampling subnetwork
is a sequence of three dense feature stacks connected by down-
sampling strided convolutions; each skip connection is a single
convolution of the corresponding dense feature stack output,
and the upsampling network comprises bilinear upsampling
to the final segmentation resolution. Memory efficiencies of
dense feature stacks and batch-wise spatial dropout enable
deeper networks at higher resolutions, which is advantageous
for segmentation of smaller structures. The bilinear upsam-
pling of skip connections to the segmentation resolution (72
3
)
limits artifacts induced by transpose convolution [45]. The V-
network generates a logit label prediction L with 9 classes.
Dilated convolutions use sparse convolution kernels to rep-
resent functions with large receptive fields but few train-
able parameters. Specifically, a dilated kernel W
a,k,d
is a
(d(a 1) + 1)
3
kernel with a trainable parameter every
d elements in each dimension and 0 elsewhere. For the i-
th convolutional layer of a FCN, the relative resolution is
Q
i
j=1
1/s
j
, and the receptive field size, expressed recursively,
is r
i
= r
i1
+ d
i
(a
i
1)
Q
i1
j=1
s
j
, where d
i
, s
i
and a
i
are the
dilation rate, stride and kernel size (before dilation) of layer
i. Because both resolution and receptive field size depend on
s
i
, sequential downsampling can generate either local high-
resolution features in early layers or global low-resolution
features after the downsampling layers. In contrast, by increas-
ing d
i
exponentially with s
i
= 1, dilated convolutions can
generate high-resolution features with exponentially growing
receptive fields in the early layers. This allows more complex
functions of these features to be computed in later layers. The
high-resolution large-receptive-field features in lower layers
may help the segmentation of small structures (e.g. esophagus)
whose location can be inferred from large structures nearby.
Finally, we use an explicit spatial prior introduced in our
previous work [21]. Medical images are frequently acquired in
standard anatomically aligned views with relatively consistent
organ positions and orientations, motivating spatial segmen-
tation priors. Spatial priors can be encoded implicitly, due
to boundary effects of convolution or by providing image
coordinates as an input channel [46]. Our previous work [21]
introduced an explicit spatial prior. The spatial prior P is
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2018.2806309
Copyright (c) 2018 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

5
a low-resolution 3D map of trainable parameters, which is
bilinearly upsampled to the segmentation resolution and added
to the outputs of the V-network (i.e. L
0
= u(P ) + L).
Conceptually, this could represent the posterior log-probability
L
0
= logp(L|x, I) of the class label L at voxel x given image I
as the sum of a log-likelihood L = logp(I|x, L) generated by
the V-network and a prior log-probability u(P ) = logp(L|x)
generated by the spatial prior. However, the spatial prior
parameters are trained as part of the end-to-end gradient-based
optimization and may not represent the true prior probability.
1) Implementation details: The loss function was the
weighted sum of an L2 regularisation loss with label-
smoothed [47] probabilistic Dice scores for each organ l
averaged across subjects in each minibatch,
pDice
l
(L
00
l
, R
l
) =
min(L
00
l
, 0.9) · R
l
||R
l
||
2
+ ||min(L
00
l
, 0.9)||
2
(6)
where vectors L
00
l
= softmax(L
0
)
l
and R
l
are the algorithm’s
probabilistic segmentation and the binary reference standard
segmentation for organ l for each subject, respectively. To
further mitigate the extreme class imbalance (e.g. esophagus
averaged 0.09% of the image and liver averaged 11.7%), Dice
score hinge losses heavily penalizing Dice scores below 0.01
and 0.10 were introduced after warm up periods of 25 and 100
iterations, respectively. The loss function at iteration i was
loss(L
00
, i) =
X
W
w
2
40
1
8
8
X
l=1
d(pDice
l
(L
00
l
, R
l
), i) (7)
d(l, i) = l + 100h(l, i, 0.01, 25) + 10h(l, i, 0.1, 100) (8)
h(l, i, v, t) = sigmoid(6(i t)/t)(max(0, v l)/v)
4
(9)
where w W are kernel values, l is the Dice loss, v is the
hinge loss threshold, and t is the delay in iterations.
The network was trained using the Adam optimizer with
= 0.001 and mini-batch size 10 for 5000 iterations (i.e. 625
epochs). Training each instance of the network took approxi-
mately 6 hours using Titan X Pascal or P100 GPUs (NVIDIA
Corporation, Los Alamitos, CA). A Tensorflow implemen-
tation of a trained DenseVNet network is available in the
NiftyNet platform model zoo (http://niftynet.io/model zoo).
The cropped region of interest, ranging from 209–471
voxels (172–367mm) transversely and 32–450 voxels (138–
283mm) in the IS axis, was resampled to a 144
3
-voxel volume.
During training, for data augmentation, affine perturbations
were applied yielding skewed subregions 0% to 10% smaller
in each dimension. For the baseline DenseVNet used in the
algorithm comparison, we used Monte Carlo inference using
the mode of 30 72
3
segmentation samples (chosen heuristically
apriori), taking approximately 8–15 seconds per image. In
post-processing, the 72
3
segmentation labels were resampled
to the original cropped region at the original image resolution
in Matlab using curvature flow smoothing [48] with 2 itera-
tions (chosen visually a priori to avoid quantization artifacts).
Then, for each organ, the union of all connected components
comprising >10% (chosen ad hoc, a priori) of the segmented
organ volume was taken as the final mask.
TABLE II
DETAILED PARAMETERS FOR DENSEVNET ARCHITECTURE.
Layer Input Output Kernel Stride Subunits
m × n
f
Feature 144
3
× 1 72
3
× 24 5
3
2
DFS 1 72
3
× 24 72
3
× 20 3
3
1 5 × 4
Skip 1 72
3
× 20 72
3
× 12 3
3
1
Down 1-2 72
3
× 20 36
3
× 24 3
3
2
DFS 2 36
3
× 24 36
3
× 80 3
3
1 10 × 8
Skip 2 36
3
× 80 36
3
× 24 3
3
1
Up 2 36
3
× 24 72
3
× 24
Down 2-3 36
3
× 80 18
3
× 24 3
3
2
DFS 3 18
3
× 24 18
3
× 160 3
3
1 10 × 16
Skip 3 18
3
× 160 18
3
× 24 3
3
1
Up 3 18
3
× 24 72
3
× 24
Up Prior 12
3
× 9 72
3
× 9
B. Evaluation metrics and statistical methods
We compared the accuracy of segmentation algorithms
using a 9-fold cross-validation over 90 subjects. For each test
image in each fold, we compared each organ segmentation to
the reference standard segmentation using three metrics:
Dice coefficient 2|A B|/(|A| + |B|)
symmetric mean boundary distance (D(A, B) +
D(B, A))/2, and
symmetric 95% Hausdorff distance (P
95
(D(A, B)) +
P
95
(D(B, A)))/2,
where A and B are the algorithm and reference segmentations,
D(A, B) is the set of distances from boundary pixels of A,
A
, to the nearest boundary pixel in
B
(i.e. D(A, B) =
{ min
x
B
||x y|| |y
A
}), and P
95
(D) is the 95-th percentile
of D. The Dice coefficient measures the relative volumetric
overlap between segmentations. The mean boundary and 95%
Hausdorff distances reflect the agreement between segmen-
tation boundaries, with the latter being more sensitive to
localized disagreements.
In each analysis, we compared the accuracy of the proposed
algorithm to each comparator using a sign test for correlated
data [49], which is insensitive to the skewed distribution
of accuracy differences observed in our data, and accounts
for the correlation between values within each fold due to
the shared training set. We used Benjamini–Hochberg false-
discovery-rate multiple-comparison correction (α = 0.05) for
pairwise tests. This correction was performed separately for
the primary analysis comparing algorithms and the secondary
analysis comparing architecture variants. In several subjects,
one or more organs were not present in the images due to
prior surgeries; these organs (7 gallbladders, 1 left kidney and
1 esophagus) were excluded from the aggregate descriptive
statistics and statistical comparisons above as the measures
used are not meaningful in this scenario. In these cases, we
reported the segmented volume (ideally 0) for these organs
(Supplementary material Table II, available in the multimedia
tab online).
C. Primary analysis: algorithm comparison
We compared the segmentation accuracy of our algorithm
to those of two existing algorithms: the deep-learning-based
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TMI.2018.2806309
Copyright (c) 2018 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

Citations
More filters
Journal ArticleDOI

CE-Net: Context Encoder Network for 2D Medical Image Segmentation

TL;DR: Comprehensive results show that the proposed CE-Net method outperforms the original U- net method and other state-of-the-art methods for optic disc segmentation, vessel detection, lung segmentation , cell contour segmentation and retinal optical coherence tomography layer segmentation.
Journal ArticleDOI

Deep Learning Techniques for Medical Image Segmentation: Achievements and Challenges

TL;DR: A critical appraisal of popular methods that have employed deep learning techniques for medical image segmentation is presented and the most common challenges incurred are summarized and suggest possible solutions.
Journal ArticleDOI

CE-Net: Context Encoder Network for 2D Medical Image Segmentation

TL;DR: Li et al. as mentioned in this paper proposed a context encoder network (referred to as CE-Net) to capture more high-level information and preserve spatial information for 2D medical image segmentation, which mainly contains three major components: a feature encoder module, a context extractor and a feature decoder module.
Journal ArticleDOI

NiftyNet: a deep-learning platform for medical imaging

TL;DR: An open-source platform is implemented based on TensorFlow APIs for deep learning in medical imaging domain that facilitates warm starts with established pre-trained networks, adapting existing neural network architectures to new problems, and rapid prototyping of new solutions.
Proceedings ArticleDOI

UNETR: Transformers for 3D Medical Image Segmentation

TL;DR: UNETR as discussed by the authors utilizes a transformer encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful U-shaped network design for the encoder and decoder.
References
More filters
Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Proceedings ArticleDOI

Fully convolutional networks for semantic segmentation

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Posted Content

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.
Proceedings Article

Rectified Linear Units Improve Restricted Boltzmann Machines

TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
Journal ArticleDOI

A survey on deep learning in medical image analysis

TL;DR: This paper reviews the major deep learning concepts pertinent to medical image analysis and summarizes over 300 contributions to the field, most of which appeared in the last year, to survey the use of deep learning for image classification, object detection, segmentation, registration, and other tasks.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What have the authors contributed in "Automatic multi-organ segmentation on abdominal ct with dense v-networks" ?

The authors present a registration-free deeplearning-based segmentation algorithm for eight organs that are relevant for navigation in endoscopic pancreatic and biliary procedures, including the pancreas, the GI tract ( esophagus, stomach, duodenum ) and surrounding organs ( liver, spleen, left kidney, gallbladder ). The authors directly compared the segmentation accuracy of the proposed method to existing deep learning and MALF methods in a cross-validation on a multi-centre data set with 90 subjects. The authors conclude that deep-learning-based segmentation represents a registration-free method for multi-organ abdominal CT segmentation whose accuracy can surpass current methods, potentially supporting image-guided navigation in gastrointestinal endoscopy procedures. 

The evaluation metrics measure segmentation fidelity with the manual reference, and not the clinical utility of the resulting segmentations for aiding endoscopic navigation ; future work will evaluate whether the proposed algorithm is accurate enough to provide a 3D patientspecific anatomical model to aid endoscopic navigation. The use of dilated convolutions was not necessary, suggesting that global high-resolution non-linear features are not critical for abdominal CT organ segmentation. The use of an explicit spatial prior was also not necessary, suggesting that convolutional neural networks are implicitly encoding spatial priors, despite their purported translational invariance. The automatically generated segmentations of abdominal anatomy have the potential to support image-guided navigation in pancreatobiliary endoscopy procedures. 

Two features were introduced to reduce memory costs without affecting performance: batch-wise spatial dropout and Monte Carlo inference. 

Memory-efficient dense blocks [44], where a careful implementation of feature concatenation avoids storing multiple copies of feature maps, can achieve O(m) memory usage. 

MonteCarlo inference [41] can be used (increasing the computation cost but lowering the memory usage) by inferring multiple segmentation samples using dropout, and combining them. 

FCNs have recently been applied to segmentation of volumetric images in medical image analysis [18], [19], [24]–[26] where such images are common. 

The relative weighting of the losses for different organs (with high volume imbalance) can have unpredictable effects on convergence and final errors; using the Dice coefficient is common but remains poorly characterized. 

Altering the cropping protocol for the test data sufficiently (i.e. beyond the variability generated by data augmentation) can impact segmentation accuracy. 

One strategy to constrain the memory usage is to process smaller images: small patches of a larger image or lower resolution images. 

1) Common multi-organ segmentation methodologies: Statistical models [5], [6] involve co-registering images in a training data set to estimate anatomical correspondences, constructing a statistical model of the distribution of shapes [22] and/or appearances [23] of corresponding anatomy in the training data, and fitting the resulting model to new images to generate segmentations. 

Segmentation of volumetric images face particular challenges, mainly due to the need to process large volumetric images under memory constraints. 

In conclusion, the proposed deep-learning-based DenseVNet can segment the pancreas, esophagus, stomach, liver, spleen, gallbladder, left kidney and duodenum more accurately than previous methods using deep learning or multi-atlas label fusion. 

To evaluate Monte Carlo inference, the trained DenseVNet was used, but inference was performed with no dropout, using all features; these results are abbreviated as Deterministic. 

In spatial dropout, the probability distribution of keeping k out of n channels is a binomial distribution p(K = k) = ( n k ) pk(1 − p)n−k; although the expected value E[K = k] = pn, the maximum value (corresponding to the maximum memory usage) is n. 

Bottom: Segmentations overlaid on CT.targets) and the gastrointestinal tract (where the endoscope is navigated) should be prioritized over navigational landmarks as an endoscope can be oriented without precise boundaries. 

Although these times would not be a limiting factor in a clinical workflow for fully-automated segmentation, the deep learning methods are fast enough to use for more accurate semi-automatic segmentations. 

Authors were not blinded to the manual segmentations during algorithm development; although the cross-validation was only run after algorithm development was complete, design decisions may have been influenced by data observations.