scispace - formally typeset

Journal ArticleDOI

A shape-constraint adversarial framework with instance-normalized spatio-temporal features for inter-fetal membrane segmentation

19 Feb 2021-Medical Image Analysis (Elsevier)-Vol. 70, pp 102008-102008

AbstractBackground and Objectives During Twin-to-Twin Transfusion Syndrome (TTTS), abnormal vascular anastomoses in the monochorionic placenta can produce uneven blood flow between the fetuses. In the current practice, this syndrome is surgically treated by closing the abnormal connections using laser ablation. Surgeons commonly use the inter-fetal membrane as a reference. Limited field of view, low fetoscopic image quality and high inter-subject variability make the membrane identification a challenging task. However, currently available tools are not optimal for automatic membrane segmentation in fetoscopic videos, due to membrane texture homogeneity and high illumination variability. Methods To tackle these challenges, we present a new deep-learning framework for inter-fetal membrane segmentation on in-vivo fetoscopic videos. The framework enhances existing architectures by (i) encoding a novel (instance-normalized) dense block, invariant to illumination changes, that extracts spatio-temporal features to enforce pixel connectivity in time, and (ii) relying on an adversarial training, which constrains macro appearance. Results We performed a comprehensive validation using 20 different videos (2000 frames) from 20 different surgeries, achieving a mean Dice Similarity Coefficient of 0.8780 ± 0.1383 . Conclusions The proposed framework has great potential to positively impact the actual surgical practice for TTTS treatment, allowing the implementation of surgical guidance systems that can enhance context awareness and potentially lower the duration of the surgeries.

Summary (3 min read)

1. Introduction

  • Twin-to-twin transfusion syndrome (TTTS) may occur, during identical twin pregnancies, when abnormal vascular anastomoses in the monochorionic placenta result in uneven blood flow between the fetuses.
  • At the beginning of the surgical treatment, the surgeon identifies the interfetal membrane, which is used as a reference to explore the placenta vascular network and identify vessels to be treated.
  • As for placental vessel segmentation, the work in Almoussa et al. (2011) proposes a30 neural network trained on manually handcrafted features from E4-vivo placenta images.
  • The instance-normalized topology can tackle the il- lumination variability typical of fetoscopic videos acquired during TTTS surgery.
  • The spatio-temporal features can boost segmentation performance enforcing the consistency of segmentation masks across sequential frames.

1.1. Contribution of the work

  • The authors address the problem of automatic inter-fetal membrane segmentation to enhance surgeon context awareness during TTTS surgery.
  • Specifically, the authors extend the adversarial framework presented in Casella et al. (2020) to process, via spatio-temporal convolution, surgical video clips.
  • This allows us to70 exploit the temporal information naturally encoded in videos.
  • The authors further design a dense block that encodes instance normalization, to account for illumination changes in the video clips.
  • The authors will make the dataset collected for this work publicly available, to foster further research in the field.

2. Methods

  • The proposed framework consists of the segmentor, described in Sec. 2.1, and a discriminator network , described in Sec. 2.2.
  • The segmentor and critic are trained in an adversarial fashion, following the strategy proposed in Casella et al. (2020) and described in Sec. 2.3.90.

2.1. Segmentor

  • The segmentor has a dense UNet-like architecture consisting of downsampling and upsampling path, linked via long-skip connections.
  • This process is repeated until there are available frames, and results in a collection of temporal clips.
  • Each dense block is followed by a transition down module for downscaling.
  • By building upon the dense module proposed in (Huang et al., 2017), the authors propose a new dense module that uses two (leaky ReLu) pre-activated convolutions, instead of a single one.

2.2. Critic

  • It is com-140 posed by two branches, as described in Table 1 and shown in Fig. 2, for extracting features from both the gold-standard segmentation and the segmentor output.
  • The authors decided to keep the critic architecture similar to its original implementation because the role of the critic is to provide a shape constraining mechanism for the segmentor output.
  • The use of dense blocks would have introduced unnecessary complexity with an increase in memory requirements.
  • The segmentor branch takes as input x masked by the output of the segmentor (S(x)).

2.3. Adversarial training strategy

  • The segmentor and critic layers are initialised using.
  • While there160 is a possible risk of divergence of the loss during training, the introduction of hyper parameters may allow to balance the action of the two terms in the loss function avoiding possible divergences, However, this never occurred in their experiments.

3.1. Dataset

  • To experimentally evaluate their two research hypotheses, the authors collected a dataset of 20 fetoscopic videos acquired during 20 different surgical procedures for treating TTTS in 20 women.
  • The membrane was manually annotated in each frame under the supervision of the surgeon.
  • This dataset, to the best of their knowledge, is the biggest dataset currently available for inter-fetal membrane segmentation.
  • Each frame was cropped to contain only the FoV of the fetoscope and, resized185 to 128x128 pixels both for smoothing noise and limiting memory usage.

3.2. Parameter setting

  • The authors used wlength = 4 due to the higher complexity of their framework, which required higher memory usage and computational power.
  • Validation and testing temporal clips were built using the same parameters but with ∆w = 4 (i.e., without195 overlap).
  • During training, at each iteration step, each batch was augmented200 with random rotation in range (−25◦,+25◦), horizontal and vertical flip, and scaling with a scaling factor in range (0.5, 1.5).
  • The Mann–Whitney–Wilcoxon test on Acc and DSC, both imposing a significance level (p) equal to 0.05, were used to assess whether or not remarkable differences existed between the tested architectures.

3.4. Ablation studies

  • The authors compared the results of the proposed framework against those of the adversarial network presented in Casella et al. (2020), which is the closest work with respect to ours.
  • Considering that a comprehensive comparison with standard state of the art approaches (e.g., UNet (Ronneberger et al., 2015) and220 ResNet (He et al., 2016)) is already provided in Casella et al. (2020), the authors here focused on the ablation studies.
  • For E6, the lowest performance was the one with ∆w = 4 (no overlap between temporal clips).
  • Visual samples for the tested models are shown in Fig.

5. Discussion and conclusions

  • This paper introduced a shape-constrained adversarial framework with instance-285 normalized spatio-temporal features to perform automatic inter-fetal membrane segmentation in fetoscopic video clips, while tackling the high illumination variability in fetoscopic videos.
  • The authors noticed310 that 3D convolution alone was not able to boost segmentation consistency, as the results are comparable with the 2D vanilla adversarial framework (E3 ).
  • In such cases, the temporal connectivity introduced to guarantee consistency across consecutive frames can affect the accuracy of segmentation negatively.
  • To conclude, the achieved results suggest that the proposed approach may be effective in supporting surgeons in the identification of the inter-fetal membrane390 in fetoscopic videos.
  • Data used for the analysis395 were acquired during actual surgery procedures and then were anonymized to allow researchers to conduct the study.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

A shape-constraint adversarial framework with
instance-normalized spatio-temporal features for
inter-fetal membrane segmentation
Alessandro Casella
a,b,
, Sara Moccia
c,d
, Dario Paladini
f
, Emanuele Frontoni
e
,
Elena De Momi
b
, Leonardo S. Mattos
a
a
Department of Advanced Robotics, Istituto Italiano di Tecnologia, Genoa, Italy
b
Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan,
Italy
c
The BioRobotics Institute, Scuola Superiore Sant’Anna, Pisa, Italy
d
Department of Excellence in Robotics and AI, Scuola Superiore Sant’Anna, Pisa, Italy
e
Department of Information Engineering, Universit`a Politecnica delle Marche, Ancona,
Italy
f
Department of Fetal and Perinatal Medicine, Istituto “Giannina Gaslini”, Genoa, Italy
Abstract
Background and Objectives During Twin-to-Twin Transfusion Syndrome
(TTTS), abnormal vascular anastomoses in the monochorionic placenta can
produce uneven blood flow between the fetuses. In the current practice, this
syndrome is surgically treated by closing the abnormal connections using laser
ablation. Surgeons commonly use the inter-fetal membrane as a reference. Lim-
ited field of view, low fetoscopic image quality and high inter-subject variability
make the membrane identification a challenging task. However, currently avail-
able tools are not optimal for automatic membrane segmentation in fetoscopic
videos, due to membrane texture homogeneity and high illumination variability.
Methods To tackle these challenges, we present a new deep-learning frame-
work for inter-fetal membrane segmentation on in-vivo fetoscopic videos. The
framework enhances existing architectures by (i) encoding a novel (instance-
normalized) dense block, invariant to illumination changes, that extracts spatio-
temporal features to enforce pixel connectivity in time, and (ii) relying on an ad-
versarial training, which constrains macro appearance. Results We performed
Corresponding author
Email address: alessandro.casella@iit.it (Alessandro Casella)
Preprint submitted to Medical Image Analysis February 22, 2021

Figure 1: Sample frames from our dataset. The frames are extracted from intra-operative
videos acquired in the actual surgical practice for Twin-to-Twin Transfusion Syndrome
(TTTS). Each frame refers to a different video. Although video acquisition was performed
with the same equipment, the frames present high variability, in terms of: (i) different mem-
brane position, shape, tissue area in the field of view, contrast and texture, (ii) noise and blur,
(iii) presence of amniotic fluid particles, (iv) vessels along the membrane equator, (v) different
levels of illumination, (vi) presence of laser-guide light.
a comprehensive validation using 20 different videos (2000 frames) from 20 dif-
ferent surgeries, achieving a mean Dice Similarity Coefficient of 0.8780
+
0.1383.
Conclusions The proposed framework has great potential to positively impact
the actual surgical practice for TTTS treatment, allowing the implementation of
surgical guidance systems that can enhance context awareness and potentially
lower the duration of the surgeries.
Keywords: Inter-Fetal Membrane, Twin-to-Twin Transfusion Syndrome
(TTTS), Deep Learning, Fetoscopy
1. Introduction
Twin-to-twin transfusion syndrome (TTTS) may occur, during identical twin
pregnancies, when abnormal vascular anastomoses in the monochorionic pla-
centa result in uneven blood flow between the fetuses. If not treated, the risk of
2

perinatal mortality of one or both fetuses can exceed the 90% (Baschat et al.,5
2011). To recover the blood flow balance, the most effective treatment is mini-
mally invasive laser surgery in fetoscopy (Quintero, 2003; Roberts et al., 2014).
At the beginning of the surgical treatment, the surgeon identifies the inter-
fetal membrane, which is used as a reference to explore the placenta vascular
network and identify vessels to be treated. Limited field of view (FoV), poor vis-10
ibility, fetuses’ movements, high illumination variability (as shown in Fig. 1) and
limited maneuverability of the fetoscope makes the membrane identification a
challenging task. This results in increased surgery duration, as well as increased
risks of complications from the patients’ side, such as premature rupture of the
membranes Beck et al. (2012), and mental workload, from the surgeons’ side.15
The Surgical Data Science (SDS) (Maier-Hein et al., 2017) community is
working towards developing computer-assisted algorithms to perform intra-operative
tissue segmentation (Moccia et al., 2020). However, SDS approaches for mem-
brane segmentation have only been marginally explored.
Work relevant to TTTS video analysis focuses on surgical planning, surgical-20
phase detection, intrauterine cavity segmentation, placental vessel segmentation
and mosaicking reconstruction. Examples of surgical-phase detection in TTTS
include the work of Vasconcelos et al. (2018), where a ResNet encoder is used
to detect the ablation phase, and Bano et al. (2020), which extends Vasconcelos
et al. (2018) by adding an LSTM layer to integrate temporal information and25
detect different surgical phases. In Torrents-Barrena et al. (2020), a reinforce-
ment learning approach that relies on capsule networks has been proposed to
perform automatic intrauterine cavity segmentation from multi-planar placenta
magnetic-resonance imaging recordings, for surgical planning purposes. As for
placental vessel segmentation, the work in Almoussa et al. (2011) proposes a30
neural network trained on manually handcrafted features from E4-vivo placenta
images. In Sadda et al. (2019), a UNet architecture is proposed to perform
patch-based vessel segmentation from intra-operative fetoscopic frames. Large
efforts have also been put in mosaicking strategies to provide the surgeons with
navigation maps of the placenta. In Daga et al. (2016), SIFT is used as fea-35
3

ture extractor for frame registration, while in Gaisser et al. (2018); Peter et al.
(2018); Bano et al. (2019); Tella-Amo et al. (2019) deep-learning strategies are
presented.
Figure 2: Proposed framework to inter-fetal membrane segmentation in fetoscopic videos. The
segmentor is a U-shaped network with long-skip connections, consisting of dense blocks, each
of which is composed by multiple (number below each block) dense modules. Each module
is composed of two pre-activated 3D convolutions, where the normalization is performed at
instance (1st convolution) and batch (2nd convolution) level. The transition down and tran-
sition up modules perform downsampling and upsamplig, respectively. The critic, inspired
by Casella et al. (2020), consists of a 3D version of the encoder branch of UNet. During train-
ing, as explained in Sec. 2.3, the critic extracts the feature vectors from the input masked by
the segmentor output and the gold standard. The Mean Absolute Error (MAE ) computed
between the two vectors, contribute, along with the per-pixel binary cross entropy (BCE ), to
the loss that is minimized during training.
4

Previous work (Casella et al., 2020), implemented a residual network along
with an adversarial training strategy to enforce placenta-shape constraining.40
Despite achieving promising results, the work does not address the problem
of high illumination variability in fetoscopic frames. Furthermore, the tempo-
ral information naturally encoded in the fetoscopic videos is not processed. 3D
architectures such as V-Net Milletari et al. (2016) have been widely used for vol-
umetric segmentation in medical images. More recently, 3D architectures have45
been used for processing endoscopic videos. Hence, temporal feature proccess-
ing showed to be effective in segmentation tasks in close fields (e.g., instrument
joint detection (Colleoni et al., 2019) and pose estimation (Moccia et al., 2019)
to enhance the temporal continuity in feature processing.
Following such considerations, in this work we implement an adversarial50
strategy to train a novel densely connected 3D fully convolutional neural network
(FCNN), which we call the segmentor, for inter-fetal membrane segmentation.
The third dimension refers to the time for spatio-temporal feature extraction.
The dense topology of the segmentor is here built with an adaptive mechanism
for instance illumination normalization. With a comprehensive study with 2055
videos (2000 frames) acquired from 20 women during actual surgery, we inves-
tigated the following research hypotheses:
Hypothesis 1 (H1): The instance-normalized topology can tackle the il-
lumination variability typical of fetoscopic videos acquired during TTTS
surgery.60
Hypothesis 2 (H2): The spatio-temporal features can boost segmentation
performance enforcing the consistency of segmentation masks across se-
quential frames.
Here, the gold standard annotation was obtained manually under the supervi-
sion of an expert surgeons.65
5

Figures (14)
Citations
More filters

Posted Content
Abstract: Fetoscopy laser photocoagulation is a widely used procedure for the treatment of Twin-to-Twin Transfusion Syndrome (TTTS), that occur in mono-chorionic multiple pregnancies due to placental vascular anastomoses. This procedure is particularly challenging due to limited field of view, poor manoeuvrability of the fetoscope, poor visibility due to fluid turbidity, variability in light source, and unusual position of the placenta. This may lead to increased procedural time and incomplete ablation, resulting in persistent TTTS. Computer-assisted intervention may help overcome these challenges by expanding the fetoscopic field of view through video mosaicking and providing better visualization of the vessel network. However, the research and development in this domain remain limited due to unavailability of high-quality data to encode the intra- and inter-procedure variability. Through the \textit{Fetoscopic Placental Vessel Segmentation and Registration (FetReg)} challenge, we present a large-scale multi-centre dataset for the development of generalized and robust semantic segmentation and video mosaicking algorithms for the fetal environment with a focus on creating drift-free mosaics from long duration fetoscopy videos. In this paper, we provide an overview of the FetReg dataset, challenge tasks, evaluation metrics and baseline methods for both segmentation and registration. Baseline methods results on the FetReg dataset shows that our dataset poses interesting challenges, offering large opportunity for the creation of novel methods and models through a community effort initiative guided by the FetReg challenge.

References
More filters

Proceedings ArticleDOI
27 Jun 2016
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

93,356 citations


Book ChapterDOI
05 Oct 2015
Abstract: There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

28,273 citations


Proceedings ArticleDOI
21 Jul 2017
Abstract: Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections—one between each layer and its subsequent layer—our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet.

15,769 citations


Posted Content
TL;DR: This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities.
Abstract: Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.

8,865 citations


Posted Content
TL;DR: A small change in the stylization architecture results in a significant qualitative improvement in the generated images, and can be used to train high-performance architectures for real-time image generation.
Abstract: It this paper we revisit the fast stylization method introduced in Ulyanov et. al. (2016). We show how a small change in the stylization architecture results in a significant qualitative improvement in the generated images. The change is limited to swapping batch normalization with instance normalization, and to apply the latter both at training and testing times. The resulting method can be used to train high-performance architectures for real-time image generation. The code will is made available on github at this https URL. Full paper can be found at arXiv:1701.02096.

2,309 citations