scispace - formally typeset
Open AccessProceedings ArticleDOI

Polyp Detection and Segmentation using Mask R-CNN: Does a Deeper Feature Extractor CNN Always Perform Better?

TLDR
Wang et al. as mentioned in this paper used Mask R-CNN and evaluated its performance with different modern convolutional neural networks (CNN) as its feature extractor for polyp detection and segmentation.
Abstract
Automatic polyp detection and segmentation are highly desirable for colon screening due to polyp miss rate by physicians during colonoscopy, which is about 25%. However, this computerization is still an unsolved problem due to various polyp-like structures in the colon and high interclass polyp variations in terms of size, color, shape, and texture. In this paper, we adapt Mask R-CNN and evaluate its performance with different modern convolutional neural networks (CNN) as its feature extractor for polyp detection and segmentation. We investigate the performance improvement of each feature extractor by adding extra polyp images to the training dataset to answer whether we need deeper and more complex CNNs or better dataset for training in automatic polyp detection and segmentation. Finally, we propose an ensemble method for further performance improvement. We evaluate the performance on the 2015 MICCAI polyp detection dataset. The best results achieved are 72.59% recall, 80% precision, 70.42% dice, and 61.24% Jaccard. The model achieved state-of-the-art segmentation performance.

read more

Content maybe subject to copyright    Report

Polyp Detection and Segmentation using Mask
R-CNN: Does a Deeper Feature Extractor CNN
Always Perform Better?
Hemin Ali Qadir
1,2,5
, Younghak Shin
6
, Johannes Solhusvik
2,5
, Jacob Bergsland
1
,
Lars Aabakken
1,4
, Ilangko Balasingham
1,3
1
Intervention Centre, Oslo University Hospital, Oslo, Norway
2
Department of Informatics, University of Oslo, Oslo, Norway
3
Department of Electronic Systems, Norwegian University of Science and Technology, Trondheim, Norway
4
Department of Transplantation Medicine, University of Oslo, Oslo, Norway
5
OmniVision Technologies Norway AS, Oslo, Norway
6
LG CNS, Seoul, Korea
Abstract—Automatic polyp detection and segmentation are
highly desirable for colon screening due to polyp miss rate by
physicians during colonoscopy, which is about 25%. However, this
computerization is still an unsolved problem due to various polyp-
like structures in the colon and high interclass polyp variations
in terms of size, color, shape and texture. In this paper, we
adapt Mask R-CNN and evaluate its performance with different
modern convolutional neural networks (CNN) as its feature ex-
tractor for polyp detection and segmentation. We investigate the
performance improvement of each feature extractor by adding
extra polyp images to the training dataset to answer whether
we need deeper and more complex CNNs, or better dataset
for training in automatic polyp detection and segmentation.
Finally, we propose an ensemble method for further performance
improvement. We evaluate the performance on the 2015 MICCAI
polyp detection dataset. The best results achieved are 72.59%
recall, 80% precision, 70.42% dice, and 61.24% jaccard. The
model achieved state-of-the-art segmentation performance.
Index Terms—polyp detection, polyp segmentation, convolu-
tional neural network, mask R-CNN, ensemble
I. INTRODUCTION
Colorectal cancer is the second most common cause of
cancer-related death in the United States for both men and
women, and its incidence increases every year [1]. Colonic
polyps, growths of glandular tissue at colonic mucosa, are the
major cause of colorectal cancer. Although they are initially
benign, they might become malignant over time if left un-
treated [2]. Colonoscopy is the primary method for screening
and preventing polyps from becoming cancerous [3]. However,
colonoscopy is dependent on highly skilled endoscopists and
high level of eye-hand coordination, and recent clinical studies
have shown that 22%–28% of polyps are missed in patients
undergoing colonoscopy [4].
Over the past decades, various computer aided diagnosis
systems have been developed to reduce polyp miss rate and
improve the detection capability during colonoscopy [5]–
[19]. The existing automatic polyp detection and segmentation
This work was supported by Research Council of Norway through the
industrial Ph.D. project under the contract number 271542/O30.
methods can be roughly grouped into two categories: 1) those
which use hand-crafted features [5]–[11], 2) those which use
data driven approach, more specifically deep learning method
[12]–[18].
The majority of hand-crafted based methods can be cat-
egorized into two groups: texture/color based [5]–[8] and
shape based [9]–[11]. In [5]–[8], color wavelet, texture, Haar,
histogram of oriented gradients and local binary pattern were
investigated to differentiate polyps from the normal mucosa.
Hwang et al. [9] assumed that polyps have elliptical shape
that distinguishes polyps from non-polyp regions. Bernal et
al. [10] used valley information based on polyp appearance to
segment potential regions by watersheds followed by region
merging and classification. Tajbakhsh et al. [11] used edge
shape and context information to accumulate votes for polyp
regions. These feature patterns are frequently similar in polyp
and polyp-like normal structures, resulting in decreased per-
formance.
To overcome the shortcomings of the hand-crafted features,
a data driven approach based on CNN was proposed for polyp
detection [12]–[19]. In the 2015 MICCAI sub-challenge on
automatic polyp detection [12], most of the proposed methods
were based on CNN, including the winner. The authors in
[13] and [14] showed that fully convolution network (FCN)
architectures could be refined and adapted to recognize polyp
structures. Zhang et al. [15] used FCN-8S to segment polyp
region candidates, and texton features computed from each
region were used by a random forest classifier for the final
decision. Shin et al. [16] showed that Faster R-CNN is a
promising technique for polyp detection. Zhnag et al. [17]
added a tracker to enhance the performance of a CNN polyp
detector. Yu et al. [18] adapted a 3D-CNN model in which a
sequence of frames was used for polyp detection.
In this paper, we adapt Mask R-CNN [20] for polyp detec-
tion and segmentation. Segmenting out polyps from the normal
mucosa can help physicians to improve their segmentation
errors and subjectivity. We have several objectives in this

study. We first evaluate the performance of Mask R-CNN and
compare it to existing methods. Secondly, we aim to evaluate
different CNN architectures (e.g., Resnet50 and Resnet101
[21], and Inception Resnet V2 [21]) as the feature extractor
for the Mask R-CNN for polyp segmentation. Thirdly, we aim
to answer to what extent adding extra training images can
help to improve the performance of each of the CNN feature
extractors. Do we really need to go for a deeper and more
complex CNN to extract higher level of features or do we just
need to build a better dataset for training? Finally, we propose
an ensemble method for further performance improvement.
II. MATERIALS AND METHODS
A. Datasets
Most of the proposed methods mentioned in section I were
tested on different datasets. The authors in [14], [15] used
a dataset containing images of the same polyps for training
and testing phases after randomly splitting it into two subsets.
This is not very realistic case for validating a method as we
may have the same polyps in the training and testing phases.
These two issues limit the comparison between the reported
results. The 2015 MICCAI sub-challenge on automatic polyp
detection was an attempt to evaluate different methods on the
same datasets. We, therefore, use the same datasets of 2015
MICCAI polyp detection challenge for training and testing the
models. We only use the two datasets of still images: 1) CVC-
ClinicDB [23] containing 32 different polyps presented in 612
images, and 2) ETIS-Larib [24] containing 36 different polyps
presented in 196 images. In addition, we use CVC-ColonDB
[25] that contains 15 different polyps presented in 300 images.
B. Evaluation Metrics
For polyp detection performance evaluation, we calculate
recall and precision using the well-known medical parameters
such as True Positive (TP), False Positive (FP), True Negative
(TN) and False Negative (FN) as follows:
recall =
T P
T P + F N
, (1)
precision =
T P
T P + F P
. (2)
For evaluation of polyp segmentation, we use common seg-
mentation evaluation metrics: Jaccard index (also known as
intersection over union, IoU), and Dice similarity score as
follows:
J(A, B) =
| A B |
| A B |
=
| A B |
| A | + | B | | A B |
, (3)
Dice(A, B) =
2 | A B |
| A | + | B |
, (4)
where A represents the output image of the method and B the
actual ground-truth.
C. Mask R-CNN
Mask R-CNN [20] is a general framework for object
instance segmentation. It is an intuitive extension of Faster
R-CNN [26], the state-of-the-art object detector. Mask R-
CNN adapts the same first stage of Faster R-CNN which
is region proposal network (RPN). It adds a new branch to
the second stage for predicting an object mask in parallel
with the existing branches for bounding box regression and
confidence value. Instead of using RoIPool, which performs
coarse quantization for feature extraction in Faster R-CNN,
Mask R-CNN uses RoIAlign, quantization-free layer, to fix
the misalignment problem.
For our polyp detection and segmentation, we use the
architecture shown in Fig. 1 to evaluate the performance of
Mask R-CNN with different CNN based feature extractors.
To train our models, we use a multi-task loss on each region
of interest called anchor proposed by RPN. For each anchor
a, we find the best matching ground-truth box b. If there is
a match, anchor a acts as a positive anchor, and we assign a
class label y
a
= 1, and a vector (φ(b
a
; a)) encoding box b with
respect to anchor a. If there is no match, anchor a acts as a
negative sample, and the class label is set to y
a
= 0. The mask
branch has a 14 × 14 dimensional output for each anchor. The
loss for each anchor a, then consists of three losses: location-
based loss `
loc
for the predicted box f
loc
(I; a, θ), classification
loss `
cls
for the predicted class f
cls
(I; a, θ) and mask loss
`
mask
for the predicted mask f
mask
(I, a, θ), where I is the
image and θ is the model parameter,
L(a, I; θ) =
1
m
m
X
i=1
1
N
N
X
j=1
1[a is positive] . `
loc
φ(b
a
; a)
f
loc
(I; a, θ)
+ `
cls
y
a
, f
cls
(I; a, θ)
+`
mask
mask
a
, f
mask
(I, a, θ)
,
(5)
where m is the size of mini-batch and N is the number of
anchors for each frame. We use the following loss functions:
Smooth L1 for the localization loss, softmax for the classifi-
cation loss and binary cross-entropy for the mask loss.
D. CNN Feature Extractor Networks
In the first stage of Mask R-CNN, we need a CNN based
feature extractor to extract high level features from the input
image. The choice of the feature extractor is essential because
the CNN architecture, the number of parameters and type of
layers directly affect the speed, memory usage and most im-
portantly the performance of the Mask R-CNN. In this study,
we select three feature extractors to compare and evaluate their
performance in polyp detection and segmentation. We select a
deep CNN (e.g., Resnet50 [21]), deeper CNN (e.g., Resnet101
[21]), and complex CNN (e.g., Inception Resnet (v2) [22]).
Resnet is a residual learning framework to ease the training
of substantially deep networks to avoid degradation problem–
accuracy gets saturated and then degrades rapidly with depth
increasing [21]. With residual learning, we can now benefit
from deeper CNN networks to obtain even higher level of
features which are essential for difficult tasks such as polyp
detection and segmentation. With inception technique, we can
increase the depth and width of a CNN network without
increasing the computational cost [27]. Szegedy et al. [22]
proposed Inception Resnet (v2) to combine the optimization

Fig. 1. Our Mask R-CNN framework. In the first stage, we use Resnet50, Resnet101 and Resnet Inception v2 as the feature extractor for the performance
evaluation of polyp detection and segmentation. Region proposal network (RPN) utilizes feature maps at one of the intermediate layers (usually the last
convolutional layer) of the CNN feature extractor networks to generate box proposals (300 boxes in our study). The proposed boxes are a grid of anchors
tiled in different aspect ratios and scales. The second stage predicts the confidence value, the offsets for the proposed box and the mask within the box for
each anchor.
benefits of residual learning and computational efficiency from
inception units.
For all three feature extractors, it is important to choose one
of the layer to extract features for predicting region proposals
by RPN. In our experiments, we use the recommended layers
by the original papers. For both Resnet50 and Resnet101, we
use the last layer of the conv4 block. For Inception Resnet
(v2), we use M ixed 6a layer and its associated residual
layers.
E. Ensemble Model
The three CNN feature extractors compute different types
of features due to differences in their number of layers and
architectures. A deeper CNN can compute a higher level of
features from the input image while it loses some spatial
information due to the contraction and pooling layers. Some
polyps might be missed by one of the CNN model while
it could be detected by another one. To partly solve this
problem, we propose an ensemble model to combine results
of two Mask R-CNN models with two different CNN feature
extractors. We use one of the models as the main model and
its output is always relied on, and the second model as an
auxiliary model to support the main model. We only take
into account the outputs from the auxiliary model when the
confidence of the detection is > 95% (an optimized value
using a validation dataset, see section III-B).
F. Training Details
The available polyp datasets are not large enough to train a
deep CNN. To prevent the models from overfitting, we enlarge
the dataset by applying different augmentation strategies. We
follow the same augmentation methods recommended by Shin
et al. [16]. Image augmentation cannot improve data distri-
bution of the training set—they can only lead to an image-
level transformation through depth and scale. This does not
ensure the model from being overfitted. Therefore, we use
transfer learning by initializing the weights of our CNN feature
extractors from models pre-trained on Microsoft’s COCO
dataset [28]. We use SGD with a momentum of 0.9, learning
rate of 0.0003, and batch size of 1 to fine-tune the pre-trained
CNNs using the augmented dataset. We keep the original
image size during both training and test phases.
III. RESULTS AND DISCUSSION
A. Performance Evaluation of the CNN Feature Extractors
In this section, we report the performance of our Mask
R-CNN model shown in Fig. 1 with the three CNN feature
extractors as the base networks. In this experiment, we used
CVC-ColonDB for training and CVC-ClinicDB for testing.
We trained the three Mask R-CNN models for 10, 20, and 30
epochs and drew curves to show the performance improvement
(see Fig. 2). We noticed that only 20 epochs was enough to
fine-tune the parameters of the three Mask R-CNN models for
polyp detection and segmentation, in case of Resnet50 and
Resnet101 only 10 epochs. It seems that the the models are
getting overfitted on the training dataset after 30 epochs, which
results in performance degradation.
For comparison, we chose 20 epochs and summarized the
results in Table I. Inception Resnet (v2) and Resnet101 have
shown the best performance for many object classification,
detection and segmentation tasks on datasets of natural images
[29]. However, Mask R-CNN with Resnet50 could outperform
the counterpart models in all evaluation metrics, with a recall
of 83.49%, precision of 92.95%, dice of 71.6% and jaccard
of 63.9%. This might be due to the fact that deeper and more

10 20 30 40
65
70
75
80
number of training epochs
accuracy
Resnet50
Resnet101
Inception Resnet
Fig. 2. Accuracy of the CNN feature extractors vs. number of epochs
complex networks need larger number of images for training.
The CVC-ColonDB dataset contains 300 images with only 15
different polyps. This dataset might not have enough unique
polyps for Resnet101 and Inception Resnet (v2) to show their
actual performance. This outcome is important because it
could be used as evidence to properly choose a CNN feature
extractor according to the size of the available dataset.
TABLE I
COMPARISON OF THE RESULTS OBTAINED ON THE CVC-CLINICDB
AFTER THE MODELS HAVE BEEN TRAINED FOR 20 EPOCHS
Mask R-CNNs Recall % Precision % Dice % Jaccard %
Resnet50 83.49 92.95 71.6 63.9
Resnet101 80.71 92.1 70.42 63.3
Inception Resnet 77.31 91.25 70.31 63.6
Fig. 3 illustrates three examples with different output results.
The polyp shown in the first column is correctly detected and
nicely segmented by the three models. The polyp in the second
column is detected correctly by the three models, but only
Resnet50 was successful to segment out most of the polyp
pixels from the background. The polyp in the third column is
only detected and segmented by Resnet50.
B. Ensemble Results
It is important to know if detection and segmentation
performance can be improved by combining the output results
of two Mask R-CNN models. Table II shows the results of
TABLE II
ENSEMBLE RESULTS OBTAINED ON THE CVC-CLINICDB BY COMBINING
THE RESULTS OF TWO MASK R-CNN MODELS
Mask R-CNNs Recall % Precision % Dice % Jaccard %
Resnet50 83.49 92.95 71.6 63.9
Resnet101 80.71 92.1 70.42 63.3
Resnet Inception 77.31 91.25 70.31 63.6
Ensemble
50+101
86.42 92.41 75.72 68.28
Improvement 2.93 -0.54 4.12 4.38
Ensemble
50+Incep
83.95 90.67 74.73 67.41
Improvement 0.46 -2.28 3.13 3.51
50+101
Resnet50 used as main, Resnet101 used as auxiliary
50+Incep
Resnet50 used as main, Resnet Inception used as auxiliary
this combination. We chose Resnet50 as our main model
because it performed better than its counterparts as seen in
Table I, and the two others as the auxiliary model. We first
Ground Truth
Input ImageResnet50Resnet101Inception Resnet
Fig. 3. Example of three outputs produced by our Mask R-CNN models. The
images in the 1
st
row show the ground truths for the polyps shown in the 2
nd
raw. The images in the 3
rd
row show the output results produced by Mask
R-CNN with Resnet50. The images in the 4
th
row are outputs from Mask
R-CNN with Resnet101. The images in the 5
th
row are outputs from Mask
R-CNN with Resnet Inception (v2).
used the ETIS-Larib dataset as the validation set to select a
suitable confidence threshold for the auxiliary model. This is
an essential prepossessing to prevent increasing the number
of FP detection. Based on this optimization step, the output
of the auxiliary model is only taken into account when the
confidence of the detection is > 95%.
Table II demonstrates that the auxiliary model could only
add a small improvement in the performance of the main
model. Resnet101 could improve recall by 2.93%, dice by
4.12%, and jaccard by 4.38% whereas Resnet Inception could
only improve recall by 0.46%, dice by 3.13%, and jaccard
by 3.51%. Precision got decreased in both cases. The im-
provement in detection is less than in segmentation. This
means that Resnet50 was able to detect most of the polyps
detected by the two auxiliary models. Fig. 4 illustrates two
polyp examples. The first polyp is partially segmented and
the second polyp is missed by Resnet50. However, they both
are precisely segmented by Resnet101 and Resnet Inception
with a confidence of 99%.

Input+Ground Truth Resnet50 Resnet101 Recent Inception
Fig. 4. Example of two outputs produced by the three Mask R-CNN models.
Column 1 shows two polyps with their ground truths. Columns 2, 3 and 4
show the results of Resnet50, Resnet101 and Resnet Inception, respectively.
C. The Effect of Adding New Images to the Training Set
In this experiment, we aim to know to what extent adding
extra training images with new polyps can help the CNN
feature extractors improve their performance. We thus trained
the three models again for 20 epochs using the images in
both ETIS-Larib and CVC-ColonDB datasets for training (51
different polyps). Table III shows that all the three models were
able to greatly improve both the detection and segmentation
capabilities of the Mask R-CNN (especially Inception Resnet)
after adding 36 new polyps of ETIS-Larib (196 images) to
the training data. Unlike ensemble approach, all the metrics,
including precision, improved by larger margins in this exper-
iment. As can be noticed in the results, Resnet Inception is
the model with the most improvements in all metrics. This
indicates the ability of this CNN architecture to extract richer
features from larger training data. As shown in Fig. 5, the new
TABLE III
COMPARISON OF RESULTS OBTAINED ON THE CVC-CLINICDB AFTER
ETIS-LARIB WAS ADDED TO THE TRAINING DATA AND THE MODELS
TRAINED FOR 20 EPOCHS
Mask R-CNNs Recall % Precision % Dice % Jaccard %
Resnet50
*
83.49 92.95 71.6 63.9
Resnet50
+
85.34 93.1 80.42 73.4
improvement 1.85 0.15 8.82 9.5
Resnet101
*
80.71 92.1 70.42 63.3
Resnet101
+
84.87 95 77.48 70.13
improvement 4.16 2.9 7.06 6.83
Inception Resnet
*
77.31 91.25 70.31 63.6
Inception Resnet
+
86.1 94.1 80.19 73.2
improvement 8.79 2.85 9.88 9.6
*
indicates that only CVC-ColonDB was used for the training
+
indicates that CVC-ColonDB and ETIS-Larib were used for training
polyp images added to the training data helped Mask R-CNN
with Inception Resnet (v2) to predict a better mask for the
polyp shown in the first column, correctly detect and segment
the missed polyp shown in the second column, and correct the
FP detection for the polyp shown in the third column.
D. Comparison with Other Methods
Each output produced by the Mask R-CNN consists of three
components: a confidence value, the coordinates of a bounding
box, and a mask (see Fig. 3). This makes Mask R-CNN eligi-
ble for performance comparison with other methods in terms
of the detection and segmentation capabilities. For comparison
Ground Truth
Input Image
Resnet Inception
*
Inception Resnet
+
Fig. 5. Example of three outputs produced by Mask R-CNN with Inception
Resnet (v2). The images in the 1
st
row show the ground truths for the polyps
shown in the 2
nd
row. The images in the 3
rd
row are output results of the
model when trained on CVC-ColonDB (Inception Resnet
*
). The images in
the 4
th
row are output results of the model when trained on CVC-ColonDB
and ETIS-Larib (Inception Resnet
+
).
against the methods presented in MICCAI 2015, we followed
the same dataset guidelines i.e. CVC-ClinicDB dataset used
for training stage whereas ETIS-Larib dataset used for testing
stage. In Table IV, we compare our Mask R-CNN models
TABLE IV
SEGMENTATION RESULTS OBTAINED ON THE ETIS-LARIB DATASET
Segmentation Models Dice % Jaccard %
FCN-VGG [13] 70.23 54.20
Mask R-CNN with Resnet50 58.14 51.32
Mask R-CNN with Resnet101 70.42 61.24
Mask R-CNN with Inception Resnet 63.78 56.85
against FCN-VGG [13] which is the only segmentation method
fully tested on ETIS-Larib. Our Mask R-CNN with Resnet101
has outperformed all the other methods including FCN-VGG,
with a dice of 70.42% and Jaccard of 61.24%. To be able to
fairly compare the detection capability of our Mask R-CNN
models, we followed the same procedure in MICCAI 2015 to
compute TP, FP, FN, and TN. As can be seen in Table V, our
Mask R-CNN with Resnet101 achieved the highest precision
(80%) and a good recall (72.59%), outperforming Mask R-
CNN with Resnet50, Mask R-CNN with Inception Resnet
(v2) and the best method in MICCAI 2015. FCN-VGG has
a better recall because both CVC-ClinicDB and ASU-Mayo
were used in the training stage (more data for training). These
results in Tables IV and V are inconsistent with the results in
Table I where Resnet50 achieved the best performance. The

Figures
Citations
More filters
Proceedings ArticleDOI

DoubleU-Net: A Deep Convolutional Neural Network for Medical Image Segmentation

TL;DR: Encouraging results show that DoubleU-Net can be used as a strong baseline for both medical image segmentation and cross-dataset evaluation testing to measure the generalizability of Deep Learning (DL) models.
Journal ArticleDOI

PolypSegNet: A modified encoder-decoder architecture for automated polyp segmentation from colonoscopy images.

TL;DR: An encoder-decoder based modified deep neural network architecture is proposed, named as PolypSegNet, to overcome several limitations of traditional architectures for very precise automated segmentation of polyp regions from colonoscopy images that will expedite the diagnosis of polyps even in challenging conditions.
Journal ArticleDOI

Consolidated domain adaptive detection and localization framework for cross-device colonoscopic images.

TL;DR: This paper proposes a consolidated domain adaptive detection and localization framework to bridge the domain gap between different colonosopic datasets effectively, consisting of two parts: the pixel-level adaptation and the hierarchical feature- level adaptation.
Proceedings ArticleDOI

AFP-Net: Realtime Anchor-Free Polyp Detection in Colonoscopy

TL;DR: A novel anchor free polyp detector that can localize polyps without using predefined anchor boxes is proposed that can respond in real time while achieving state-of-the-art performance with 99.36% precision and 96.44% recall.
Journal ArticleDOI

Toward real-time polyp detection using fully CNNs for 2D Gaussian shapes prediction.

TL;DR: The experimental results showed that the proposed 2D Gaussian masks are efficient for detection of flat and small polyps with unclear boundaries between background and polyp parts and make a better training effect to discriminate polyps from the polyp-like false positives.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI

Mask R-CNN

TL;DR: This work presents a conceptually simple, flexible, and general framework for object instance segmentation, which extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
Journal ArticleDOI

Cancer statistics, 2018

TL;DR: The combined cancer death rate dropped continuously from 1991 to 2015 by a total of 26%, translating to approximately 2,378,600 fewer cancer deaths than would have been expected if death rates had remained at their peak.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What have the authors contributed in "Polyp detection and segmentation using mask r-cnn: does a deeper feature extractor cnn always perform better?" ?

In this paper, the authors adapt Mask R-CNN and evaluate its performance with different modern convolutional neural networks ( CNN ) as its feature extractor for polyp detection and segmentation. The authors investigate the performance improvement of each feature extractor by adding extra polyp images to the training dataset to answer whether they need deeper and more complex CNNs, or better dataset for training in automatic polyp detection and segmentation. Finally, the authors propose an ensemble method for further performance improvement. The authors evaluate the performance on the 2015 MICCAI polyp detection dataset. 

The authors use the following loss functions: Smooth L1 for the localization loss, softmax for the classification loss and binary cross-entropy for the mask loss. 

The choice of the feature extractor is essential because the CNN architecture, the number of parameters and type of layers directly affect the speed, memory usage and most importantly the performance of the Mask R-CNN. 

The authors use SGD with a momentum of 0.9, learning rate of 0.0003, and batch size of 1 to fine-tune the pre-trained CNNs using the augmented dataset. 

It seems that the the models are getting overfitted on the training dataset after 30 epochs, which results in performance degradation. 

In this experiment, the authors aim to know to what extent adding extra training images with new polyps can help the CNN feature extractors improve their performance. 

The results confirm that with a better training dataset, Mask R-CNN will become a promising technique for polyp detection and segmentation, and using a deeper or more complex CNN feature extractor might become unnecessary. 

Mask R-CNN with Resnet50 could outperform the counterpart models in all evaluation metrics, with a recall of 83.49%, precision of 92.95%, dice of 71.6% and jaccard of 63.9%. 

In this paper the authors adapted and evaluated Mask R-CNN with three recent CNN feature extractors i.e. Resnet50, Resnet101, and Inception Resnet (v2) for polyp detection and segmentation. 

As shown in Fig. 5, the newpolyp images added to the training data helped Mask R-CNN with Inception Resnet (v2) to predict a better mask for the polyp shown in the first column, correctly detect and segment the missed polyp shown in the second column, and correct the FP detection for the polyp shown in the third column. 

Each output produced by the Mask R-CNN consists of three components: a confidence value, the coordinates of a bounding box, and a mask (see Fig. 3). 

Their Mask R-CNN with Resnet101 has outperformed all the other methods including FCN-VGG, with a dice of 70.42% and Jaccard of 61.24%. 

Trending Questions (1)
How to Train an image segmentation model?

The model achieved state-of-the-art segmentation performance.