What features were introduced to reduce memory costs without affecting performance?

Two features were introduced to reduce memory costs without affecting performance: batch-wise spatial dropout and Monte Carlo inference.

What can be the impact of a cropping protocol on segmentation accuracy?

Altering the cropping protocol for the test data sufficiently (i.e. beyond the variability generated by data augmentation) can impact segmentation accuracy.

What are the limitations of the proposed deep learning method?

In conclusion, the proposed deep-learning-based DenseVNet can segment the pancreas, esophagus, stomach, liver, spleen, gallbladder, left kidney and duodenum more accurately than previous methods using deep learning or multi-atlas label fusion.

What is the name of the algorithm used to evaluate Monte Carlo inference?

To evaluate Monte Carlo inference, the trained DenseVNet was used, but inference was performed with no dropout, using all features; these results are abbreviated as Deterministic.

What is the way to navigate an endoscope?

Bottom: Segmentations overlaid on CT.targets) and the gastrointestinal tract (where the endoscope is navigated) should be prioritized over navigational landmarks as an endoscope can be oriented without precise boundaries.

What are the limitations of the deep learning methods?

Although these times would not be a limiting factor in a clinical workflow for fully-automated segmentation, the deep learning methods are fast enough to use for more accurate semi-automatic segmentations.

What limitations did the authors have to overcome in the development of the algorithm?

Authors were not blinded to the manual segmentations during algorithm development; although the cross-validation was only run after algorithm development was complete, design decisions may have been influenced by data observations.

(Open Access) Automatic Multi-Organ Segmentation on Abdominal CT With Dense V-Networks (2018) | Eli Gibson

Q: What have the authors contributed in "Automatic multi-organ segmentation on abdominal ct with dense v-networks" ?

The authors present a registration-free deeplearning-based segmentation algorithm for eight organs that are relevant for navigation in endoscopic pancreatic and biliary procedures, including the pancreas, the GI tract ( esophagus, stomach, duodenum ) and surrounding organs ( liver, spleen, left kidney, gallbladder ). The authors directly compared the segmentation accuracy of the proposed method to existing deep learning and MALF methods in a cross-validation on a multi-centre data set with 90 subjects. The authors conclude that deep-learning-based segmentation represents a registration-free method for multi-organ abdominal CT segmentation whose accuracy can surpass current methods, potentially supporting image-guided navigation in gastrointestinal endoscopy procedures.

Q: What are the future works in "Automatic multi-organ segmentation on abdominal ct with dense v-networks" ?

The evaluation metrics measure segmentation fidelity with the manual reference, and not the clinical utility of the resulting segmentations for aiding endoscopic navigation ; future work will evaluate whether the proposed algorithm is accurate enough to provide a 3D patientspecific anatomical model to aid endoscopic navigation. The use of dilated convolutions was not necessary, suggesting that global high-resolution non-linear features are not critical for abdominal CT organ segmentation. The use of an explicit spatial prior was also not necessary, suggesting that convolutional neural networks are implicitly encoding spatial priors, despite their purported translational invariance. The automatically generated segmentations of abdominal anatomy have the potential to support image-guided navigation in pancreatobiliary endoscopy procedures.

Q: What is the way to achieve memory efficiency in dense blocks?

Memory-efficient dense blocks [44], where a careful implementation of feature concatenation avoids storing multiple copies of feature maps, can achieve O(m) memory usage.

Q: How can the authors reduce the memory usage of the sample?

MonteCarlo inference [41] can be used (increasing the computation cost but lowering the memory usage) by inferring multiple segmentation samples using dropout, and combining them.

Q: What are the common uses of FCNs?

FCNs have recently been applied to segmentation of volumetric images in medical image analysis [18], [19], [24]–[26] where such images are common.

Q: What is the common problem with the Dice coefficient?

The relative weighting of the losses for different organs (with high volume imbalance) can have unpredictable effects on convergence and final errors; using the Dice coefficient is common but remains poorly characterized.

Q: What are the common multi-organ segmentation methodologies?

1) Common multi-organ segmentation methodologies: Statistical models [5], [6] involve co-registering images in a training data set to estimate anatomical correspondences, constructing a statistical model of the distribution of shapes [22] and/or appearances [23] of corresponding anatomy in the training data, and fitting the resulting model to new images to generate segmentations.

Automatic Multi-organ Segmentation on Abdominal

CT with Dense V-networks

Eli Gibson

∗†

, Francesco Giganti

§¶

, Yipeng Hu

∗†

, Ester Bonmati

∗†

, Steve Bandula

, Kurinchi Gurusamy

Brian Davidson

, Stephen P. Pereira

∗∗

, Matthew J. Clarkson

†∗

, and Dean C. Barratt

∗†

∗

UCL Centre for Medical Image Computing, Department of Medical Physics & Biomedical Engineering,

University College London, UK

Department of Radiology, University College Hospital Trust, UK

Division of Surgery and Interventional Science, University College London, UK

UCL Centre for Medical Imaging, University College London, UK

∗∗

Institute for Liver and Digestive Health, University College London, UK

†

Wellcome / EPSRC Centre for Interventional and Surgical Sciences, University College London, UK

Abstract—Automatic segmentation of abdominal anatomy on

computed tomography (CT) images can support diagnosis, treat-

ment planning and treatment delivery workﬂows. Segmentation

methods using statistical models and multi-atlas label fusion

(MALF) require inter-subject image registrations which are

challenging for abdominal images, but alternative methods with-

out registration have not yet achieved higher accuracy for

most abdominal organs. We present a registration-free deep-

learning-based segmentation algorithm for eight organs that are

relevant for navigation in endoscopic pancreatic and biliary

procedures, including the pancreas, the GI tract (esophagus,

stomach, duodenum) and surrounding organs (liver, spleen, left

kidney, gallbladder). We directly compared the segmentation

accuracy of the proposed method to existing deep learning and

MALF methods in a cross-validation on a multi-centre data

set with 90 subjects. The proposed method yielded signiﬁcantly

higher Dice scores for all organs and lower mean absolute

distances for most organs, including Dice scores of 0.78 vs. 0.71,

0.74 and 0.74 for the pancreas, 0.90 vs 0.85, 0.87 and 0.83 for

the stomach and 0.76 vs 0.68, 0.69 and 0.66 for the esophagus.

We conclude that deep-learning-based segmentation represents a

registration-free method for multi-organ abdominal CT segmen-

tation whose accuracy can surpass current methods, potentially

supporting image-guided navigation in gastrointestinal endoscopy

procedures.

Index Terms—Abdominal CT, Segmentation, Deep learning,

Pancreas, Gastrointestinal tract, Stomach, Duodenum, Esopha-

gus, Liver, Spleen, Kidney, Gallbladder

I. INTRODUCTION

EGMENTATION of organs in abdominal images can

support clinical workﬂows in multiple domains, including

diagnostic interventions, treatment planning and treatment

delivery. Organ segmentation is a crucially important step

for computer-assisted diagnostic and biomarker measurement

systems [1]. Segmentations of treatment volumes and organs-

at-risk are also central to planning radiation therapies [2].

More generally, segmentation-based patient-speciﬁc anatom-

ical models can support surgical planning and delivery via

intra-operative image-guidance systems [3].

Corresponding author: E. Gibson (email: eli.gibson@ucl.ac.uk). Copyright

to use this material for any other purposes must be obtained from the IEEE

by sending a request to pubs-permissions@ieee.org.

In endoscopic pancreatobiliary procedures, an endoscope

is inserted orally and navigated through the gastrointestinal

tract to speciﬁc positions on the stomach or duodenal wall to

allow pancreatobiliary imaging and intervention. Due to the

small endoscopic ﬁeld of view and the lack of visual orienta-

tion cues, this navigation task is challenging, particularly for

novice endoscopists [4]. Image-guidance showing registered

anatomical models would provide orientation and targeting

cues that are outside of the endoscopic ﬁeld of view or

challenging to see on endoscopic images. To support targeting

and navigation, segmentations of multiple organs are needed:

the pancreas, gastrointestinal organs (esophagus, stomach and

duodenum), and nearby organs used as navigational landmarks

(liver, gallbladder, spleen and left kidney).

Manual segmentation of 3D abdominal images is labor-

intensive and impractical for most clinical workﬂows, moti-

vating (semi-)automated segmentation tools [2]. Research into

such tools has focused on computed tomography (CT), due to

its clinical prevalence, and on three methodologies: statistical

models (SM) [5], [6], multi-atlas label fusion (MALF) [6]–

[10] and registration-free methods [11]–[14]. SM and MALF,

reviewed in more detail in Section I-A1, rely on establishing

anatomic correspondences between images from different sub-

jects, a task that remains challenging due to high inter-subject

variability in organ shape and appearance as well as soft tissue

deformation [15]. Registration-free methods trade registration

challenges for the challenges of constructing variability- and

deformation-invariant features (”hand-tuned” or learnt) that

characterize anatomy in an unregistered training data set.

Despite the claimed advantage of this approach, registration-

free methods have achieved less accurate multi-organ segmen-

tations than the registration-based approaches [16].

Recent advances in machine learning, computational power

and data availability, however, have enabled the training of

more complex registration-free methods, including deep fully

convolutional networks (FCNs), promising increased segmen-

tation accuracy [17]. FCNs, discussed in detail in Section I-A2,

are particularly well-suited to multi-organ abdominal segmen-

tation because they require neither explicit anatomical corre-

spondences nor hand-tuned image features. In multi-organ ab-

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TMI.2018.2806309

dominal segmentation, they have been used alone [18] or with

pre- or post-processing, such as level sets [19] and MALF [20],

demonstrating their potential value. However, these pipelines

still have not achieved higher accuracies than the most accurate

registration-based methods for most organs [16].

This study presents the dense V-network FCN (DenseVNet)

and its application to multi-organ segmentation on abdominal

CT, yielding higher accuracies than three existing methods.

The contributions of this work are four-fold:

1) The DenseVNet segmentation network is presented,

which enables high-resolution activation maps through

memory-efﬁcient dropout and feature reuse.

2) A batch-wise spatial dropout scheme is proposed, which

lowers the memory and computational costs of dropout.

3) The accuracy of DenseVNet for multi-organ segmen-

tation from abdominal CT is evaluated using a cross-

validation over 90 manually segmented images from

multiple centres. The results indicate that higher seg-

mentation accuracy can be achieved than a state-of-the-

art MALF method and two existing FCNs.

4) The parts of DenseNet critical for accuracy are identiﬁed

by comparing the accuracies of network variants.

This builds on our preliminary work [21], with an improved

network architecture, a larger data set, and more extensive

comparisons with other algorithms and network variants.

A. Related work

1) Common multi-organ segmentation methodologies: Sta-

tistical models [5], [6] involve co-registering images in a

training data set to estimate anatomical correspondences, con-

structing a statistical model of the distribution of shapes [22]

and/or appearances [23] of corresponding anatomy in the

training data, and ﬁtting the resulting model to new images to

generate segmentations. Multi-atlas label fusion methods [6]–

[10] register images in a training data set to each new image

and combine propagated reference segmentations to generate

new segmentations. Statistical models and multi-atlas methods

are limited by image registration accuracy. This registration,

while extensively studied, remains challenging [15]. The size,

shape, appearance, and relative positions of abdominal organs

vary considerably between patients due to natural variability,

disease status and previous treatments and within each patient

due to soft tissue deformation. To avoid challenging regis-

trations, registration-free methods train a voxel-wise classiﬁer

on unregistered images. Some methods have relied on hand-

crafted organ-speciﬁc image features [11], [12], but many

recent approaches involve training classiﬁers on selected (but

typically organ-agnostic) image features [13], [14]. Regis-

tration challenges notwithstanding, MALF has yielded more

accurate multi-organ abdominal CT segmentations than reg-

istration free methods for most organs [16]. However, recent

advances in registration-free methods may change this.

2) FCNs for segmentation: FCNs are compositions of

simple image-to-image functions with trainable parameters,

including convolution with linear kernels and voxelwise non-

linearities. FCNs are efﬁcient architectures for deep-learning-

based tasks that require image outputs like segmentation.

FCNs have recently been applied to segmentation of volu-

metric images in medical image analysis [18], [19], [24]–[26]

where such images are common. Segmentation of volumetric

images face particular challenges, mainly due to the need to

process large volumetric images under memory constraints.

One strategy to constrain the memory usage is to process

smaller images: small patches of a larger image or lower res-

olution images. Image-patch segmentations consider various

patch types – single 2D slices, slabs of adjacent 2D slices

or smaller cropped regions – and orientations – single axis-

aligned slices, multiple slices from multiple axes, or oblique

slices. These methods gain memory efﬁciency but lose spatial

context. In contrast, Milletari et al. [25] and C¸ ic¸ek et al. [24]

used 3D representations of the entire image by downsampling

the image sequentially so that most image features are only

represented at low resolution. Our previous work [21] used 3D

representations with fewer, but higher-resolution, features by

using dense blocks [27], stacks of convolutional units in which

the input of each layer comprises the outputs of all preceding

stack layers, compensating for using fewer features.

Another strategy to constrain the memory usage is to limit

the network depth. However, this affects the FCN receptive

ﬁeld (i.e. the size of the image region affecting each output

voxel), which grows linearly with the network depth. Larger

convolutional kernels mitigate this by increasing the linear

growth rate; however, this can result in a very high parameter

count (which grows as the cube of kernel size in 3D). Sequen-

tial downsampling, mentioned above, also mitigates this effect,

as the receptive ﬁeld grows exponentially with the number of

downsampling stages. Dilated convolutions [28], used in our

previous work [21], instead use large, but sparse kernels to

give exponential receptive ﬁeld size with few parameters.

Multi-organ segmentation poses additional challenges. First,

more information must be propagated through the network,

exacerbating the aforementioned memory challenges. The rela-

tive weighting of the losses for different organs (with high vol-

ume imbalance) can have unpredictable effects on convergence

and ﬁnal errors; using the Dice coefﬁcient is common but

remains poorly characterized. Imposing shape [29] and topo-

logical [30] constraints between speciﬁed organs also remains

challenging. Despite these challenges, deep learning has been

used in multi-organ abdominal CT segmentation alone [18],

[31] or as part of a larger segmentation pipeline [19], [20].

Zhou et al. [18] segmented 19 abdominal organs on 2D

slices in axial, sagittal and coronal views and combined the

results using majority-voting label fusion. Roth et al. [31]

segmented 7 organs using a two-stage hierarchical pipeline

based on 3D UNet [24]. Hu et al. [19] segmented 4 organs

using a 3D FCN to generate organ probability maps as

features for a level-set-based segmentation. Larsson et al. [20]

used MALF to identify a region of interest for each organ

and a 3D FCN with hand-tuned input features to complete

the segmentation. Compared to registration-based methods in

a recent segmentation challenge [16], these methods were

substantially more accurate (>2% Dice score improvement)

for gallbladder, achieved parity (within 2% Dice score) for the

liver, left kidney, right adrenal gland and aorta, but have lower

accuracy for the pancreas, gastrointestinal tract (esophagus,

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TMI.2018.2806309

stomach) and other organs (spleen, right kidney, vena cava,

portal/splenic vein, and left adrenal gland).

II. DATA

Ninety abdominal CT images and corresponding reference

standard segmentations of the spleen, left kidney, gallbladder,

esophagus, liver, stomach, pancreas and duodenum were used

for this study. The CT images and some of the segmentations

were drawn from two publicly available data sets: forty-

three subjects from the Cancer Imaging Archive Pancreas-CT

data set [26], [32], [33] with pancreas segmentations and 47

subjects from the ‘Beyond the Cranial Vault’ (BTCV) segmen-

tation challenge [16] with segmentations of all organs except

duodenum. The remaining reference standard segmentations

were performed at our centre. The completed segmentations

and subject identiﬁers have been made publicly available

(DOI:http://doi.org/10.5281/zenodo.1169361).

A. Image data

The Pancreas-CT data set comprises abdominal CT ac-

quired at the National Institutes of Health Clinical Center

from pre-nephrectomy healthy kidney donors or patients with

neither major abdominal pathologies nor pancreatic cancer

lesions [33]. The BTCV data set comprises abdominal CT

acquired at the Vanderbilt University Medical Center from

metastatic liver cancer patients or post-operative ventral hernia

patients [15]. Images had voxel sizes from 0.6–0.9 mm in the

anterior-posterior (AP) and left-right (LR) axes and 0.5–5.0

mm in the inferior-superior (IS) axis. Images were manually

cropped to the rib-cage and abdominal cavity transversely, to

the superior extent of the liver or spleen and the inferior extent

of the liver or kidneys, resulting in ﬁelds of view ranging from

172–318mm AP, 246–367mm LR and 138–283mm IS.

B. Reference standard segmentations

Segmentations from the Pancreas-CT and BTCV datasets

were used where available. An imaging research fellow (E.G.),

under the supervision of a board-certiﬁed radiologist with 8

years of experience in gastrointestinal CT and MRI image

interpretation (F.G.), interactively segmented the unsegmented

organs on both data sets and edited the segmented organs to

ensure a consistent segmentation protocol, using Matlab 2015b

and ITK-SNAP 3.2 (http://itksnap.com).

III. METHODS

This study compares our proposed algorithm to multiple

automated segmentation algorithms in two experiments. First,

to evaluate the improvements to the state of the art in segmen-

tation accuracy due to our algorithm, we compare three distinct

algorithms detailed below: the multi-atlas-label-fusion-based

DEEDS+JLF [34], [35], the deep-learning-based VoxRes-

Net [36], and the proposed deep-learning-based DenseVNet.

Second, to clarify the architectural factors contributing to

these improvements, we compare variations of the proposed

DenseVNet architecture.

TABLE I

TABLE OF SYMBOLS

Tensors

L logit segmentation from V-network

P logit spatial prior

, L

logit and probabilistic segmentation and l-th channel

l-th channel on reference standard segmentation

stochastic binary masks for dropout

W convolution kernel

Operators

c(X, W, s, γ) convolutional unit

r(X) rectiﬁed linear non-linearity

b(X, γ) channel-wise batch normalization

¯o(X, B

, B

) batch-wise spatial dropout

(X) dense feature stack

u(X) bilinear upsampling

Operator parameters (operator: parameters)

c: s, γ stride, scale parameter

f: m, a, d

, n

# layers, kernel size,

i-th layer dilation rate, # features in each unit

Other notation†

p approximate probability of keeping each channel

x, y, z voxel coordinates

† Symbols used within one paragraph are omitted for brevity.

A. Proposed algorithm: Dense V-network segmentation

The proposed segmentation method uses a fully-

convolutional neural network [37] based on convolutional

units composed as shown in Figure 1. The architecture design

can be understood in terms of 5 key features described below:

batch-wise spatial dropout, dense feature stacks, V-network

downsampling and upsampling, dilated convolutions, and an

explicit spatial prior. For clarity and precision, each of these

will be described conceptually and speciﬁed mathematically.

The supplementary material, available in the multimedia tab

online, has guidance for implementing the network.

Each convolutional unit comprised three functions: (1) a

3D convolution with a learned kernel, (2) a batch normal-

ization [38] to facilitate robust gradient propagation, and (3) a

rectiﬁed linear unit (ReLU) non-linearity [39] to represent non-

linear functions. Speciﬁcally, convolutional units are denoted,

c(X, W, s, γ)

x,y ,z

= r(b((X ∗ W )

sx,sy ,sz

, γ)) (1)

where W is a convolutional kernel; batch normalization

b(X, γ) transforms the mean of each channel to 0 and the

variance to a learned per-channel scale parameter γ, and the

rectiﬁed linear unit r(X) = max(0, X) induces non-linearity.

For computational and memory efﬁciency, we introduce our

new batch-wise spatial dropout. In regular spatial dropout [40],

to regularize the network, random channels are dropped (i.e.

set to zero) with an independent speciﬁed probability,

= ˆo(c(

i−1

, W, s, γ), B

) (2)

where ˆo(X, B

) sets channels masked by stochastic binary

mask B

to zero, and

i−1

= ˆo(X

i−1

, B

) is the previous

unit’s output after dropout with mask B

. Standard imple-

mentations calculate and store the dropped activations that do

not affect subsequent layers. Our proposed batch-wise spatial

dropout avoids computing these activations by modifying the

convolution kernels instead of the activation maps, denoted

= c(

i−1

, ¯o(W, B

, B

), s, ¯γ)) (3)

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TMI.2018.2806309

where ¯o(W, B

, B

) is a new kernel without input and output

channels masked by B

and B

i−1

is the output of the pre-

vious unit after batch-wise spatial dropout, and ¯γ is the scale

parameter of undropped channels. Note that

is identical to

the undropped channels of

but does not compute or store

the dropped channels, and that subsequent convolutions are

unaffected if their kernel is similarly modiﬁed. To realize the

efﬁciency gains, two further changes are made. First, the same

channels are dropped for all images in each mini-batch, so that

the same convolution kernels can be used for the whole mini-

batch. Second, the distribution of dropped channels is changed

to limit the maximum memory usage. In spatial dropout,

the probability distribution of keeping k out of n channels

is a binomial distribution p(K = k) =





(1 − p)

n−k

;

although the expected value E[K = k] = pn, the maximum

value (corresponding to the maximum memory usage) is n.

Instead the proposed batch-wise spatial dropout drops channels

using dependent Bernoulli distributions, such that a ﬁxed

number of channels dpne are kept. Segmentation inference can

use all features by scaling the convolutional unit outputs by

/dn

pe; this requires more memory per subject than train-

ing, as all n

feature maps are generated. Alternatively, Monte-

Carlo inference [41] can be used (increasing the computation

cost but lowering the memory usage) by inferring multiple

segmentation samples using dropout, and combining them.

Both of these approaches are evaluated in the experiments

below. An implementation of batch-wise spatial dropout is

available in the NiftyNet platform

Dense feature stacks, adapted from the dense block deﬁned

by Huang et al. [27], are a sequence of composite functions

where the input of each function is the concatenated output of

all preceding functions. In contrast to Huang’s dense block,

our composite function use our batch-wise spatial dropout

for regularization, and do not use 1 × 1 bottleneck layers.

Speciﬁcally, the output of an m-layer dense feature stack

) = [X

; X

; ...; X

] where

([X

; X

; ...; X

i−1

]) (4)

(X) = c(X, ¯o(W

a,n

, B

), 1, ¯γ)) (5)

where [A; B] denotes channel-wise concatenation; W

a,n

an a×a×a convolution kernel (a = 3) with n

output channels

(4, 8 and 16 for high, medium and low resolution dense feature

stacks) and dilation rate d

= 3, d

= 9, d

i/∈{2,3}

= 1);

= [B

; B

; ...; B

i−1

] selects all previously computed

channels, B

selects all channels from X

and otherwise

is sampled stochastically such that dpn

e channels are

selected (p = 0.5). This structure has several advantages.

First, like residual networks [42], the feature stacks inherently

encode identity functions, as the ﬁnal output channels include

the inputs. Second, they combine multiple network depths

within a single network [43] allowing both effective propa-

gation of gradients through the network (every kernel weight

lies in a shallow sub-graph of depth 1) and representation

of complex functions (every kernel weight lies in multiple

deeper sub-graphs with depths 2 to m). Finally, when memory

niftynet.layer.channel_sparse_convolution.Channel

SparseConvolutionalLayer in the http://niftynet.io code repositories.

constraints limit the number of activation maps, information

from earlier layers is stored only once in memory, but accessed

by later layers. Memory-efﬁcient dense blocks [44], where a

careful implementation of feature concatenation avoids storing

multiple copies of feature maps, can achieve O(m) memory

usage. The improvements of batch-wise spatial dropout can be

combined with those of memory-efﬁcient dense blocks by only

allocating shared memory storage for the number of computed

activation maps, which is ﬁxed for our dependent Bernoulli

distributions.

A V-network architecture comprises downsampling and

upsampling subnetworks, with skip connections to propa-

gate higher resolution information to the ﬁnal segmenta-

tion. Previous V-networks [24], [25], typically use shallow

strided-convolution downsampling units followed by shallow

transpose-convolutional upsampling units with additive or

concatenating skip connections within each resolution. Den-

seVNet differs in several ways: the downsampling subnetwork

is a sequence of three dense feature stacks connected by down-

sampling strided convolutions; each skip connection is a single

convolution of the corresponding dense feature stack output,

and the upsampling network comprises bilinear upsampling

to the ﬁnal segmentation resolution. Memory efﬁciencies of

dense feature stacks and batch-wise spatial dropout enable

deeper networks at higher resolutions, which is advantageous

for segmentation of smaller structures. The bilinear upsam-

pling of skip connections to the segmentation resolution (72

)

limits artifacts induced by transpose convolution [45]. The V-

network generates a logit label prediction L with 9 classes.

Dilated convolutions use sparse convolution kernels to rep-

resent functions with large receptive ﬁelds but few train-

able parameters. Speciﬁcally, a dilated kernel W

a,k,d

is a

(d(a − 1) + 1)

kernel with a trainable parameter every

d elements in each dimension and 0 elsewhere. For the i-

th convolutional layer of a FCN, the relative resolution is

j=1

1/s

, and the receptive ﬁeld size, expressed recursively,

is r

= r

i−1

+ d

− 1)

i−1

j=1

, where d

, s

and a

are the

dilation rate, stride and kernel size (before dilation) of layer

i. Because both resolution and receptive ﬁeld size depend on

, sequential downsampling can generate either local high-

resolution features in early layers or global low-resolution

features after the downsampling layers. In contrast, by increas-

ing d

exponentially with s

= 1, dilated convolutions can

generate high-resolution features with exponentially growing

receptive ﬁelds in the early layers. This allows more complex

functions of these features to be computed in later layers. The

high-resolution large-receptive-ﬁeld features in lower layers

may help the segmentation of small structures (e.g. esophagus)

whose location can be inferred from large structures nearby.

Finally, we use an explicit spatial prior introduced in our

previous work [21]. Medical images are frequently acquired in

standard anatomically aligned views with relatively consistent

organ positions and orientations, motivating spatial segmen-

tation priors. Spatial priors can be encoded implicitly, due

to boundary effects of convolution or by providing image

coordinates as an input channel [46]. Our previous work [21]

introduced an explicit spatial prior. The spatial prior P is

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TMI.2018.2806309

a low-resolution 3D map of trainable parameters, which is

bilinearly upsampled to the segmentation resolution and added

to the outputs of the V-network (i.e. L

= u(P ) + L).

Conceptually, this could represent the posterior log-probability

= logp(L|x, I) of the class label L at voxel x given image I

as the sum of a log-likelihood L = logp(I|x, L) generated by

the V-network and a prior log-probability u(P ) = logp(L|x)

generated by the spatial prior. However, the spatial prior

parameters are trained as part of the end-to-end gradient-based

optimization and may not represent the true prior probability.

1) Implementation details: The loss function was the

weighted sum of an L2 regularisation loss with label-

smoothed [47] probabilistic Dice scores for each organ l

averaged across subjects in each minibatch,

pDice

, R

) =



min(L

, 0.9) · R

||R

+ ||min(L

, 0.9)||



(6)

where vectors L

= softmax(L

)

and R

are the algorithm’s

probabilistic segmentation and the binary reference standard

segmentation for organ l for each subject, respectively. To

further mitigate the extreme class imbalance (e.g. esophagus

averaged 0.09% of the image and liver averaged 11.7%), Dice

score hinge losses heavily penalizing Dice scores below 0.01

and 0.10 were introduced after warm up periods of 25 and 100

iterations, respectively. The loss function at iteration i was

loss(L

, i) =

∀W

−

l=1

d(pDice

, R

), i) (7)

d(l, i) = l + 100h(l, i, 0.01, 25) + 10h(l, i, 0.1, 100) (8)

h(l, i, v, t) = sigmoid(6(i − t)/t)(max(0, v − l)/v)

(9)

where w ∈ W are kernel values, l is the Dice loss, v is the

hinge loss threshold, and t is the delay in iterations.

The network was trained using the Adam optimizer with

 = 0.001 and mini-batch size 10 for 5000 iterations (i.e. 625

epochs). Training each instance of the network took approxi-

mately 6 hours using Titan X Pascal or P100 GPUs (NVIDIA

Corporation, Los Alamitos, CA). A Tensorﬂow implemen-

tation of a trained DenseVNet network is available in the

NiftyNet platform model zoo (http://niftynet.io/model zoo).

The cropped region of interest, ranging from 209–471

voxels (172–367mm) transversely and 32–450 voxels (138–

283mm) in the IS axis, was resampled to a 144

-voxel volume.

During training, for data augmentation, afﬁne perturbations

were applied yielding skewed subregions 0% to 10% smaller

in each dimension. For the baseline DenseVNet used in the

algorithm comparison, we used Monte Carlo inference using

the mode of 30 72

segmentation samples (chosen heuristically

apriori), taking approximately 8–15 seconds per image. In

post-processing, the 72

segmentation labels were resampled

to the original cropped region at the original image resolution

in Matlab using curvature ﬂow smoothing [48] with 2 itera-

tions (chosen visually a priori to avoid quantization artifacts).

Then, for each organ, the union of all connected components

comprising >10% (chosen ad hoc, a priori) of the segmented

organ volume was taken as the ﬁnal mask.

TABLE II

DETAILED PARAMETERS FOR DENSEVNET ARCHITECTURE.

Layer Input Output Kernel Stride Subunits

m × n

Feature 144

× 1 72

× 24 5

DFS 1 72

× 24 72

× 20 3

1 5 × 4

Skip 1 72

× 20 72

× 12 3

Down 1-2 72

× 20 36

× 24 3

DFS 2 36

× 24 36

× 80 3

1 10 × 8

Skip 2 36

× 80 36

× 24 3

Up 2 36

× 24 72

× 24

Down 2-3 36

× 80 18

× 24 3

DFS 3 18

× 24 18

× 160 3

1 10 × 16

Skip 3 18

× 160 18

× 24 3

Up 3 18

× 24 72

× 24

Up Prior 12

× 9 72

× 9

B. Evaluation metrics and statistical methods

We compared the accuracy of segmentation algorithms

using a 9-fold cross-validation over 90 subjects. For each test

image in each fold, we compared each organ segmentation to

the reference standard segmentation using three metrics:

• Dice coefﬁcient – 2|A ∩ B|/(|A| + |B|)

• symmetric mean boundary distance – (D(A, B) +

D(B, A))/2, and

• symmetric 95% Hausdorff distance – (P

(D(A, B)) +

(D(B, A)))/2,

where A and B are the algorithm and reference segmentations,

D(A, B) is the set of distances from boundary pixels of A,

Ω

, to the nearest boundary pixel in Ω

(i.e. D(A, B) =

{ min

x∈Ω

||x − y|| |y ∈ Ω

}), and P

(D) is the 95-th percentile

of D. The Dice coefﬁcient measures the relative volumetric

overlap between segmentations. The mean boundary and 95%

Hausdorff distances reﬂect the agreement between segmen-

tation boundaries, with the latter being more sensitive to

localized disagreements.

In each analysis, we compared the accuracy of the proposed

algorithm to each comparator using a sign test for correlated

data [49], which is insensitive to the skewed distribution

of accuracy differences observed in our data, and accounts

for the correlation between values within each fold due to

the shared training set. We used Benjamini–Hochberg false-

discovery-rate multiple-comparison correction (α = 0.05) for

pairwise tests. This correction was performed separately for

the primary analysis comparing algorithms and the secondary

analysis comparing architecture variants. In several subjects,

one or more organs were not present in the images due to

prior surgeries; these organs (7 gallbladders, 1 left kidney and

1 esophagus) were excluded from the aggregate descriptive

statistics and statistical comparisons above as the measures

used are not meaningful in this scenario. In these cases, we

reported the segmented volume (ideally 0) for these organs

(Supplementary material Table II, available in the multimedia

tab online).

C. Primary analysis: algorithm comparison

We compared the segmentation accuracy of our algorithm

to those of two existing algorithms: the deep-learning-based

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TMI.2018.2806309

Automatic Multi-Organ Segmentation on Abdominal CT With Dense V-Networks

Figures

Citations

CE-Net: Context Encoder Network for 2D Medical Image Segmentation

Deep Learning Techniques for Medical Image Segmentation: Achievements and Challenges

CE-Net: Context Encoder Network for 2D Medical Image Segmentation

NiftyNet: a deep-learning platform for medical imaging

UNETR: Transformers for 3D Medical Image Segmentation

References

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Fully convolutional networks for semantic segmentation

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Rectified Linear Units Improve Restricted Boltzmann Machines

A survey on deep learning in medical image analysis

Related Papers (5)

U-Net: Convolutional Networks for Biomedical Image Segmentation

Deep Residual Learning for Image Recognition

Fully convolutional networks for semantic segmentation

Densely Connected Convolutional Networks

A survey on deep learning in medical image analysis

Frequently Asked Questions (17)

Q1. What have the authors contributed in "Automatic multi-organ segmentation on abdominal ct with dense v-networks" ?

Q2. What are the future works in "Automatic multi-organ segmentation on abdominal ct with dense v-networks" ?

Q3. What features were introduced to reduce memory costs without affecting performance?

Q4. What is the way to achieve memory efficiency in dense blocks?

Q5. How can the authors reduce the memory usage of the sample?

Q6. What are the common uses of FCNs?

Q7. What is the common problem with the Dice coefficient?

Q8. What can be the impact of a cropping protocol on segmentation accuracy?

Q9. What is the way to limit the memory usage of large volumetric images?

Q10. What are the common multi-organ segmentation methodologies?

Q11. What are the challenges of segmentation of volumetric images?

Q12. What are the limitations of the proposed deep learning method?

Q13. What is the name of the algorithm used to evaluate Monte Carlo inference?

Q14. What is the probability distribution of keeping k out of n channels?

Q15. What is the way to navigate an endoscope?

Q16. What are the limitations of the deep learning methods?

Q17. What limitations did the authors have to overcome in the development of the algorithm?