scispace - formally typeset
Search or ask a question

Showing papers on "Image segmentation published in 2020"


Journal ArticleDOI
TL;DR: Mask R-CNN as discussed by the authors extends Faster-RCNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition, which achieves state-of-the-art performance in instance segmentation.
Abstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: https://github.com/facebookresearch/Detectron .

1,506 citations


Journal ArticleDOI
TL;DR: UNet++ as mentioned in this paper proposes an efficient ensemble of U-Nets of varying depths, which partially share an encoder and co-learn simultaneously using deep supervision, leading to a highly flexible feature fusion scheme.
Abstract: The state-of-the-art models for medical image segmentation are variants of U-Net and fully convolutional networks (FCN). Despite their success, these models have two limitations: (1) their optimal depth is apriori unknown, requiring extensive architecture search or inefficient ensemble of models of varying depths; and (2) their skip connections impose an unnecessarily restrictive fusion scheme, forcing aggregation only at the same-scale feature maps of the encoder and decoder sub-networks. To overcome these two limitations, we propose UNet++, a new neural architecture for semantic and instance segmentation, by (1) alleviating the unknown network depth with an efficient ensemble of U-Nets of varying depths, which partially share an encoder and co-learn simultaneously using deep supervision; (2) redesigning skip connections to aggregate features of varying semantic scales at the decoder sub-networks, leading to a highly flexible feature fusion scheme; and (3) devising a pruning scheme to accelerate the inference speed of UNet++. We have evaluated UNet++ using six different medical image segmentation datasets, covering multiple imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and electron microscopy (EM), and demonstrating that (1) UNet++ consistently outperforms the baseline models for the task of semantic segmentation across different datasets and backbone architectures; (2) UNet++ enhances segmentation quality of varying-size objects—an improvement over the fixed-depth U-Net; (3) Mask RCNN++ (Mask R-CNN with UNet++ design) outperforms the original Mask R-CNN for the task of instance segmentation; and (4) pruned UNet++ models achieve significant speedup while showing only modest performance degradation. Our implementation and pre-trained models are available at https://github.com/MrGiovanni/UNetPlusPlus .

1,487 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: nuScenes as discussed by the authors is the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view.
Abstract: Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in computer vision tasks such as object detection, tracking and segmentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learning based methods for detection and tracking become more prevalent, there is a need to train and evaluate such methods on datasets containing range sensor data along with images. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We define novel 3D detection and tracking metrics. We also provide careful dataset analysis as well as baselines for lidar and image based detection and tracking. Data, development kit and more information are available online.

1,378 citations


Journal ArticleDOI
TL;DR: This work develops a novel architecture, MultiResUNet, as the potential successor to the U-Net architecture, and tests and compared it with the classical U- net on a vast repertoire of multimodal medical images.

1,027 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: This paper introduces RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds, and introduces a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details.
Abstract: We study the problem of efficient semantic segmentation for large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/post-processing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Extensive experiments show that our RandLA-Net can process 1 million points in a single pass with up to 200x faster than existing approaches. Moreover, our RandLA-Net clearly surpasses state-of-the-art approaches for semantic segmentation on two large-scale benchmarks Semantic3D and SemanticKITTI.

977 citations


Posted Content
TL;DR: A comprehensive review of recent pioneering efforts in semantic and instance segmentation, including convolutional pixel-labeling networks, encoder-decoder architectures, multiscale and pyramid-based approaches, recurrent networks, visual attention models, and generative models in adversarial settings are provided.
Abstract: Image segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. Various algorithms for image segmentation have been developed in the literature. Recently, due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models. In this survey, we provide a comprehensive review of the literature at the time of this writing, covering a broad spectrum of pioneering works for semantic and instance-level segmentation, including fully convolutional pixel-labeling networks, encoder-decoder architectures, multi-scale and pyramid based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the similarity, strengths and challenges of these deep learning models, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.

950 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: A novel UNet 3+ is proposed, which takes advantage of full-scale skip connections and deep supervisions, and can reduce the network parameters to improve the computation efficiency.
Abstract: Recently, a growing interest has been seen in deep learning-based semantic segmentation. UNet, which is one of deep learning networks with an encoder-decoder architecture, is widely used in medical image segmentation. Combining multi-scale features is one of important factors for accurate segmentation. UNet++ was developed as a modified Unet by designing an architecture with nested and dense skip connections. However, it does not explore sufficient information from full scales and there is still a large room for improvement. In this paper, we propose a novel UNet 3+, which takes advantage of full-scale skip connections and deep supervisions. The full-scale skip connections incorporate low-level details with high-level semantics from feature maps in different scales; while the deep supervision learns hierarchical representations from the full-scale aggregated feature maps. The proposed method is especially benefiting for organs that appear at varying scales. In addition to accuracy improvements, the proposed UNet 3+ can reduce the network parameters to improve the computation efficiency. We further propose a hybrid loss function and devise a classification-guided module to enhance the organ boundary and reduce the over-segmentation in a non-organ image, yielding more accurate segmentation results. The effectiveness of the proposed method is demonstrated on two datasets. The code is available at: github.com/ZJUGiveLab/UNet-Version

897 citations


Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed a COVID-19 Lung Infection Segmentation Deep Network ( Inf-Net) to automatically identify infected regions from chest CT slices, where a parallel partial decoder is used to aggregate the high-level features and generate a global map.
Abstract: Coronavirus Disease 2019 (COVID-19) spread globally in early 2020, causing the world to face an existential health crisis. Automated detection of lung infections from computed tomography (CT) images offers a great potential to augment the traditional healthcare strategy for tackling COVID-19. However, segmenting infected regions from CT slices faces several challenges, including high variation in infection characteristics, and low intensity contrast between infections and normal tissues. Further, collecting a large amount of data is impractical within a short time period, inhibiting the training of a deep model. To address these challenges, a novel COVID-19 Lung Infection Segmentation Deep Network ( Inf-Net ) is proposed to automatically identify infected regions from chest CT slices. In our Inf-Net , a parallel partial decoder is used to aggregate the high-level features and generate a global map. Then, the implicit reverse attention and explicit edge-attention are utilized to model the boundaries and enhance the representations. Moreover, to alleviate the shortage of labeled data, we present a semi-supervised segmentation framework based on a randomly selected propagation strategy, which only requires a few labeled images and leverages primarily unlabeled data. Our semi-supervised framework can improve the learning ability and achieve a higher performance. Extensive experiments on our COVID-SemiSeg and real CT volumes demonstrate that the proposed Inf-Net outperforms most cutting-edge segmentation models and advances the state-of-the-art performance.

633 citations


Journal ArticleDOI
TL;DR: This article provides a detailed review of the solutions above, summarizing both the technical novelties and empirical results, and compares the benefits and requirements of the surveyed methodologies and provides recommended solutions.

487 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: PointPainting as mentioned in this paper projects lidar points into the output of an image-only semantic segmentation network and appends the class scores to each point, which can then be fed to any lidar-only method.
Abstract: Camera and lidar are important sensor modalities for robotics in general and self-driving cars in particular. The sensors provide complementary information offering an opportunity for tight sensor-fusion. Surprisingly, lidar-only methods outperform fusion methods on the main benchmark datasets, suggesting a gap in the literature. In this work, we propose PointPainting: a sequential fusion method to fill this gap. PointPainting works by projecting lidar points into the output of an image-only semantic segmentation network and appending the class scores to each point. The appended (painted) point cloud can then be fed to any lidar-only method. Experiments show large improvements on three different state-of-the art methods, Point-RCNN, VoxelNet and PointPillars on the KITTI and nuScenes datasets. The painted version of PointRCNN represents a new state of the art on the KITTI leaderboard for the bird's-eye view detection task. In ablation, we study how the effects of Painting depends on the quality and format of the semantic segmentation output, and demonstrate how latency can be minimized through pipelining.

486 citations


Proceedings ArticleDOI
27 Oct 2020
TL;DR: A new log-cosh dice loss function is introduced and it is showcased that certain loss functions perform well across all data-sets and can be taken as a good baseline choice in unknown data distribution scenarios.
Abstract: Image Segmentation has been an active field of research as it has a wide range of applications, ranging from automated disease detection to self driving cars. In the past five years, various papers came up with different objective loss functions used in different cases such as biased data, sparse segmentation, etc. In this paper, we have summarized some of the well-known loss functions widely used for Image Segmentation and listed out the cases where their usage can help in fast and better convergence of a model. Furthermore, we have also introduced a new log-cosh dice loss function and compared its performance on NBFS skull-segmentation open source data-set with widely used loss functions. We also showcased that certain loss functions perform well across all data-sets and can be taken as a good baseline choice in unknown data distribution scenarios.

Journal ArticleDOI
Xinyu Huang1, Peng Wang1, Cheng Xinjing1, Dingfu Zhou1, Qichuan Geng1, Ruigang Yang1 
TL;DR: This paper provides a sensor fusion scheme integrating camera videos, consumer-grade motion sensors (GPS/IMU), and a 3D semantic map in order to achieve robust self-localization and semantic segmentation for autonomous driving.
Abstract: Autonomous driving has attracted tremendous attention especially in the past few years. The key techniques for a self-driving car include solving tasks like 3D map construction, self-localization, parsing the driving road and understanding objects, which enable vehicles to reason and act. However, large scale data set for training and system evaluation is still a bottleneck for developing robust perception models. In this paper, we present the ApolloScape dataset [1] and its applications for autonomous driving. Compared with existing public datasets from real scenes, e.g., KITTI [2] or Cityscapes [3] , ApolloScape contains much large and richer labelling including holistic semantic dense point cloud for each site, stereo, per-pixel semantic labelling, lanemark labelling, instance segmentation, 3D car instance, high accurate location for every frame in various driving videos from multiple sites, cities and daytimes. For each task, it contains at lease 15x larger amount of images than SOTA datasets. To label such a complete dataset, we develop various tools and algorithms specified for each task to accelerate the labelling process, such as joint 3D-2D segment labeling, active labelling in videos etc. Depend on ApolloScape , we are able to develop algorithms jointly consider the learning and inference of multiple tasks. In this paper, we provide a sensor fusion scheme integrating camera videos, consumer-grade motion sensors (GPS/IMU), and a 3D semantic map in order to achieve robust self-localization and semantic segmentation for autonomous driving. We show that practically, sensor fusion and joint learning of multiple tasks are beneficial to achieve a more robust and accurate system. We expect our dataset and proposed relevant algorithms can support and motivate researchers for further development of multi-sensor fusion and multi-task learning in the field of computer vision.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: PointRend as discussed by the authors proposes a point-based rendering module that performs segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm, which produces crisp object boundaries in regions that are over-smoothed by previous methods.
Abstract: We present a new method for efficient high-quality image segmentation of objects and scenes. By analogizing classical computer graphics methods for efficient rendering with over- and undersampling challenges faced in pixel labeling tasks, we develop a unique perspective of image segmentation as a rendering problem. From this vantage, we present the PointRend (Point-based Rendering) neural network module: a module that performs point-based segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm. PointRend can be flexibly applied to both instance and semantic segmentation tasks by building on top of existing state-of-the-art models. While many concrete implementations of the general idea are possible, we show that a simple design already achieves excellent results. Qualitatively, PointRend outputs crisp object boundaries in regions that are over-smoothed by previous methods. Quantitatively, PointRend yields significant gains on COCO and Cityscapes, for both instance and semantic segmentation. PointRend's efficiency enables output resolutions that are otherwise impractical in terms of memory or computation compared to existing approaches. Code has been made available at https://github.com/facebookresearch/detectron2/tree/master/projects/PointRend.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work explores the multi-scale collaborative representation for rain streaks from the perspective of input image scales and hierarchical deep features in a unified framework, termed multi- scale progressive fusion network (MSPFN) for single image rain streak removal.
Abstract: Rain streaks in the air appear in various blurring degrees and resolutions due to different distances from their positions to the camera. Similar rain patterns are visible in a rain image as well as its multi-scale (or multi-resolution) versions, which makes it possible to exploit such complementary information for rain streak representation. In this work, we explore the multi-scale collaborative representation for rain streaks from the perspective of input image scales and hierarchical deep features in a unified framework, termed multi-scale progressive fusion network (MSPFN) for single image rain streak removal. For the similar rain streaks at different positions, we employ recurrent calculation to capture the global texture, thus allowing to explore the complementary and redundant information at the spatial dimension to characterize target rain streaks. Besides, we construct multi-scale pyramid structure, and further introduce the attention mechanism to guide the fine fusion of these correlated information from different scales. This multi-scale progressive fusion strategy not only promotes the cooperative representation, but also boosts the end-to-end training. Our proposed method is extensively evaluated on several benchmark datasets and achieves the state-of-the-art results. Moreover, we conduct experiments on joint deraining, detection, and segmentation tasks, and inspire a new research direction of vision task driven image deraining. The source code is available at https://github.com/kuihua/MSPFN.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: A powerful student-teacher framework for the challenging problem of unsupervised anomaly detection and pixel-precise anomaly segmentation in high-resolution images by trained to regress the output of a descriptive teacher network that was pretrained on a large dataset of patches from natural images.
Abstract: We introduce a powerful student-teacher framework for the challenging problem of unsupervised anomaly detection and pixel-precise anomaly segmentation in high-resolution images. Student networks are trained to regress the output of a descriptive teacher network that was pretrained on a large dataset of patches from natural images. This circumvents the need for prior data annotation. Anomalies are detected when the outputs of the student networks differ from that of the teacher network. This happens when they fail to generalize outside the manifold of anomaly-free training data. The intrinsic uncertainty in the student networks is used as an additional scoring function that indicates anomalies. We compare our method to a large number of existing deep learning based methods for unsupervised anomaly detection. Our experiments demonstrate improvements over state-of-the-art methods on a number of real-world datasets, including the recently introduced MVTec Anomaly Detection dataset that was specifically designed to benchmark anomaly segmentation algorithms.

Journal ArticleDOI
TL;DR: An automatic classification segmentation tool for helping screening COVID-19 pneumonia using chest CT imaging and shows very encouraging performance with a dice coefficient higher than 0.88 for the segmentation and an area under the ROC curve higher than 97% for the classification.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: PolarMask as discussed by the authors formulates the instance segmentation problem as predicting contour of instance through instance center classification and dense distance regression in a polar coordinate, which can be used by easily embedding it into most off-the-shelf detection methods.
Abstract: In this paper, we introduce an anchor-box free and single shot instance segmentation method, which is conceptually simple, fully convolutional and can be used by easily embedding it into most off-the-shelf detection methods. Our method, termed PolarMask, formulates the instance segmentation problem as predicting contour of instance through instance center classification and dense distance regression in a polar coordinate. Moreover, we propose two effective approaches to deal with sampling high-quality center examples and optimization for dense distance regression, respectively, which can significantly improve the performance and simplify the training process. Without any bells and whistles, PolarMask achieves 32.9% in mask mAP with single-model and single-scale training/testing on the challenging COCO dataset. For the first time, we show that the complexity of instance segmentation, in terms of both design and computation complexity, can be the same as bounding box object detection and this much simpler and flexible instance segmentation framework can achieve competitive accuracy. We hope that the proposed PolarMask framework can serve as a fundamental and strong baseline for single shot instance segmentation task.

Journal ArticleDOI
TL;DR: A noise-robust Dice loss that is a generalization of Dice loss for segmentation and Mean Absolute Error (MAE) loss for robustness against noise is introduced and combined with an adaptive self-ensembling framework for training.
Abstract: Segmentation of pneumonia lesions from CT scans of COVID-19 patients is important for accurate diagnosis and follow-up. Deep learning has a potential to automate this task but requires a large set of high-quality annotations that are difficult to collect. Learning from noisy training labels that are easier to obtain has a potential to alleviate this problem. To this end, we propose a novel noise-robust framework to learn from noisy labels for the segmentation task. We first introduce a noise-robust Dice loss that is a generalization of Dice loss for segmentation and Mean Absolute Error (MAE) loss for robustness against noise, then propose a novel COVID-19 Pneumonia Lesion segmentation network (COPLE-Net) to better deal with the lesions with various scales and appearances. The noise-robust Dice loss and COPLE-Net are combined with an adaptive self-ensembling framework for training, where an Exponential Moving Average (EMA) of a student model is used as a teacher model that is adaptively updated by suppressing the contribution of the student to EMA when the student has a large training loss. The student model is also adaptive by learning from the teacher only when the teacher outperforms the student. Experimental results showed that: (1) our noise-robust Dice loss outperforms existing noise-robust loss functions, (2) the proposed COPLE-Net achieves higher performance than state-of-the-art image segmentation networks, and (3) our framework with adaptive self-ensembling significantly outperforms a standard training process and surpasses other noise-robust training approaches in the scenario of learning from noisy labels for COVID-19 pneumonia lesion segmentation.

Journal ArticleDOI
TL;DR: In this article, a new PDE interpretation of a class of deep convolutional neural networks (CNN) was established, which are commonly used to learn from speech, image, and video data.
Abstract: Partial differential equations (PDEs) are indispensable for modeling many physical phenomena and also commonly used for solving image processing tasks. In the latter area, PDE-based approaches interpret image data as discretizations of multivariate functions and the output of image processing algorithms as solutions to certain PDEs. Posing image processing problems in the infinite-dimensional setting provides powerful tools for their analysis and solution. For the last few decades, the reinterpretation of classical image processing problems through the PDE lens has been creating multiple celebrated approaches that benefit a vast area of tasks including image segmentation, denoising, registration, and reconstruction. In this paper, we establish a new PDE interpretation of a class of deep convolutional neural networks (CNN) that are commonly used to learn from speech, image, and video data. Our interpretation includes convolution residual neural networks (ResNet), which are among the most promising approaches for tasks such as image classification having improved the state-of-the-art performance in prestigious benchmark challenges. Despite their recent successes, deep ResNets still face some critical challenges associated with their design, immense computational costs and memory requirements, and lack of understanding of their reasoning. Guided by well-established PDE theory, we derive three new ResNet architectures that fall into two new classes: parabolic and hyperbolic CNNs. We demonstrate how PDE theory can provide new insights and algorithms for deep learning and demonstrate the competitiveness of three new CNN architectures using numerical experiments.

Journal ArticleDOI
TL;DR: This article proposes a simple yet effective similarity guidance network to tackle the one-shot (SG-One) segmentation problem, aiming at predicting the segmentation mask of a query image with the reference to one densely labeled support image of the same category.
Abstract: One-shot image semantic segmentation poses a challenging task of recognizing the object regions from unseen categories with only one annotated example as supervision. In this article, we propose a simple yet effective similarity guidance network to tackle the one-shot (SG-One) segmentation problem. We aim at predicting the segmentation mask of a query image with the reference to one densely labeled support image of the same category. To obtain the robust representative feature of the support image, we first adopt a masked average pooling strategy for producing the guidance features by only taking the pixels belonging to the support image into account. We then leverage the cosine similarity to build the relationship between the guidance features and features of pixels from the query image. In this way, the possibilities embedded in the produced similarity maps can be adopted to guide the process of segmenting objects. Furthermore, our SG-One is a unified framework that can efficiently process both support and query images within one network and be learned in an end-to-end manner. We conduct extensive experiments on Pascal VOC 2012. In particular, our SG-One achieves the mIoU score of 46.3%, surpassing the baseline methods.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: A simple method for unsupervised domain adaptation, whereby the discrepancy between the source and target distributions is reduced by swapping the low-frequency spectrum of one with the other, which results indicate that even simple procedures can discount nuisance variability in the data that more sophisticated methods struggle to learn away.
Abstract: We describe a simple method for unsupervised domain adaptation, whereby the discrepancy between the source and target distributions is reduced by swapping the low-frequency spectrum of one with the other. We illustrate the method in semantic segmentation, where densely annotated images are aplenty in one domain (synthetic data), but difficult to obtain in another (real images). Current state-of-the-art methods are complex, some requiring adversarial optimization to render the backbone of a neural network invariant to the discrete domain selection variable. Our method does not require any training to perform the domain alignment, just a simple Fourier Transform and its inverse. Despite its simplicity, it achieves state-of-the-art performance in the current benchmarks, when integrated into a relatively standard semantic segmentation model. Our results indicate that even simple procedures can discount nuisance variability in the data that more sophisticated methods struggle to learn away.

Proceedings ArticleDOI
28 Jul 2020
TL;DR: Encouraging results show that DoubleU-Net can be used as a strong baseline for both medical image segmentation and cross-dataset evaluation testing to measure the generalizability of Deep Learning (DL) models.
Abstract: Semantic image segmentation is the process of labeling each pixel of an image with its corresponding class. An encoder-decoder based approach, like U-Net and its variants, is a popular strategy for solving medical image segmentation tasks. To improve the performance of U-Net on various segmentation tasks, we propose a novel architecture called DoubleU-Net, which is a combination of two U-Net architectures stacked on top of each other. The first U-Net uses a pre-trained VGG-19 as the encoder, which has already learned features from ImageNet and can be transferred to another task easily. To capture more semantic information efficiently, we added another U-Net at the bottom. We also adopt Atrous Spatial Pyramid Pooling (ASPP) to capture contextual information within the network. We have evaluated DoubleU-Net using four medical segmentation datasets, covering various imaging modalities such as colonoscopy, dermoscopy, and microscopy. Experiments on the MICCAI 2015 segmentation challenge, the CVC-ClinicDB, the 2018 Data Science Bowl challenge, and the Lesion boundary segmentation datasets demonstrate that the DoubleU-Net outperforms U-Net and the baseline models. Moreover, DoubleU-Net produces more accurate segmentation masks, especially in the case of the CVC-ClinicDB and MICCAI 2015 segmentation challenge datasets, which have challenging images such as smaller and flat polyps. These results show the improvement over the existing U-Net model. The encouraging results, produced on various medical image segmentation datasets, show that DoubleU-Net can be used as a strong baseline for both medical image segmentation and cross-dataset evaluation testing to measure the generalizability of Deep Learning (DL) models.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: Semantic Region Adaptive Normalization (SEAN) as mentioned in this paper is a simple but effective building block for Generative Adversarial Networks conditioned on segmentation masks that describe the semantic regions in the desired output image.
Abstract: We propose semantic region-adaptive normalization (SEAN), a simple but effective building block for Generative Adversarial Networks conditioned on segmentation masks that describe the semantic regions in the desired output image. Using SEAN normalization, we can build a network architecture that can control the style of each semantic region individually, e.g., we can specify one style reference image per region. SEAN is better suited to encode, transfer, and synthesize style than the best previous method in terms of reconstruction quality, variability, and visual quality. We evaluate SEAN on multiple datasets and report better quantitative metrics (e.g. FID, PSNR) than the current state of the art. SEAN also pushes the frontier of interactive image editing. We can interactively edit images by changing segmentation masks or the style for any given region. We can also interpolate styles from two reference images per region.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: Panoptic-DeepLab as discussed by the authors adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods.
Abstract: In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed. In particular, Panoptic-DeepLab adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. As a result, our single Panoptic-DeepLab simultaneously ranks first at all three Cityscapes benchmarks, setting the new state-of-art of 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set. Additionally, equipped with MobileNetV3, Panoptic-DeepLab runs nearly in real-time with a single 1025x2049 image (15.8 frames per second), while achieving a competitive performance on Cityscapes (54.1 PQ% on test set). On Mapillary Vistas test set, our ensemble of six models attains 42.7% PQ, outperforming the challenge winner in 2018 by a healthy margin of 1.5%. Finally, our Panoptic-DeepLab also performs on par with several top-down approaches on the challenging COCO dataset. For the first time, we demonstrate a bottom-up approach could deliver state-of-the-art results on panoptic segmentation.

Book ChapterDOI
05 Jan 2020
TL;DR: This paper presents Kvasir-SEG: an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and then verified by an experienced gastroenterologist, and demonstrates the use of the dataset with a traditional segmentation approach and a modern deep-learning based Convolutional Neural Network approach.
Abstract: Pixel-wise image segmentation is a highly demanding task in medical-image analysis. In practice, it is difficult to find annotated medical images with corresponding segmentation masks. In this paper, we present Kvasir-SEG: an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and then verified by an experienced gastroenterologist. Moreover, we also generated the bounding boxes of the polyp regions with the help of segmentation masks. We demonstrate the use of our dataset with a traditional segmentation approach and a modern deep-learning based Convolutional Neural Network (CNN) approach. The dataset will be of value for researchers to reproduce results and compare methods. By adding segmentation masks to the Kvasir dataset, which only provide frame-wise annotations, we enable multimedia and computer vision researchers to contribute in the field of polyp segmentation and automatic analysis of colonoscopy images.

Proceedings ArticleDOI
Yude Wang1, Jie Zhang1, Meina Kan1, Shiguang Shan1, Xilin Chen1 
14 Jun 2020
TL;DR: Zhang et al. as mentioned in this paper proposed a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap between full and weak supervisions.
Abstract: Image-level weakly supervised semantic segmentation is a challenging problem that has been deeply studied in recent years. Most of advanced solutions exploit class activation map (CAM). However, CAMs can hardly serve as the object mask due to the gap between full and weak supervisions. In this paper, we propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap. Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation, whose pixel-level labels take the same spatial transformation as the input images during data augmentation. However, this constraint is lost on the CAMs trained by image-level supervision. Therefore, we propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning. Moreover, we propose a pixel correlation module (PCM), which exploits context appearance information and refines the prediction of current pixel by its similar neighbors, leading to further improvement on CAMs consistency. Extensive experiments on PASCAL VOC 2012 dataset demonstrate our method outperforms state-of-the-art methods using the same level of supervision. The code is released online.

Journal ArticleDOI
Neeraj Kumar1, Ruchika Verma2, Deepak Anand3, Yanning Zhou4, Omer Fahri Onder, E. D. Tsougenis, Hao Chen, Pheng-Ann Heng4, Jiahui Li5, Zhiqiang Hu6, Yunzhi Wang7, Navid Alemi Koohbanani8, Mostafa Jahanifar8, Neda Zamani Tajeddin8, Ali Gooya8, Nasir M. Rajpoot8, Xuhua Ren9, Sihang Zhou10, Qian Wang9, Dinggang Shen10, Cheng-Kun Yang, Chi-Hung Weng, Wei-Hsiang Yu, Chao-Yuan Yeh, Shuang Yang11, Shuoyu Xu12, Pak-Hei Yeung13, Peng Sun12, Amirreza Mahbod14, Gerald Schaefer15, Isabella Ellinger14, Rupert Ecker, Örjan Smedby16, Chunliang Wang16, Benjamin Chidester17, That-Vinh Ton18, Minh-Triet Tran19, Jian Ma17, Minh N. Do18, Simon Graham8, Quoc Dang Vu20, Jin Tae Kwak20, Akshaykumar Gunda21, Raviteja Chunduri3, Corey Hu22, Xiaoyang Zhou23, Dariush Lotfi24, Reza Safdari24, Antanas Kascenas, Alison O'Neil, Dennis Eschweiler25, Johannes Stegmaier25, Yanping Cui26, Baocai Yin, Kailin Chen, Xinmei Tian26, Philipp Gruening27, Erhardt Barth27, Elad Arbel28, Itay Remer28, Amir Ben-Dor28, Ekaterina Sirazitdinova, Matthias Kohl, Stefan Braunewell, Yuexiang Li29, Xinpeng Xie29, Linlin Shen29, Jun Ma30, Krishanu Das Baksi31, Mohammad Azam Khan32, Jaegul Choo32, Adrián Colomer33, Valery Naranjo33, Linmin Pei34, Khan M. Iftekharuddin34, Kaushiki Roy35, Debotosh Bhattacharjee35, Anibal Pedraza36, Maria Gloria Bueno36, Sabarinathan Devanathan37, Saravanan Radhakrishnan37, Praveen Koduganty37, Zihan Wu38, Guanyu Cai39, Xiaojie Liu39, Yuqin Wang39, Amit Sethi3 
TL;DR: Several of the top techniques compared favorably to an individual human annotator and can be used with confidence for nuclear morphometrics as well as heavy data augmentation in the MoNuSeg 2018 challenge.
Abstract: Generalized nucleus segmentation techniques can contribute greatly to reducing the time to develop and validate visual biomarkers for new digital pathology datasets. We summarize the results of MoNuSeg 2018 Challenge whose objective was to develop generalizable nuclei segmentation techniques in digital pathology. The challenge was an official satellite event of the MICCAI 2018 conference in which 32 teams with more than 80 participants from geographically diverse institutes participated. Contestants were given a training set with 30 images from seven organs with annotations of 21,623 individual nuclei. A test dataset with 14 images taken from seven organs, including two organs that did not appear in the training set was released without annotations. Entries were evaluated based on average aggregated Jaccard index (AJI) on the test set to prioritize accurate instance segmentation as opposed to mere semantic segmentation. More than half the teams that completed the challenge outperformed a previous baseline. Among the trends observed that contributed to increased accuracy were the use of color normalization as well as heavy data augmentation. Additionally, fully convolutional networks inspired by variants of U-Net, FCN, and Mask-RCNN were popularly used, typically based on ResNet or VGG base architectures. Watershed segmentation on predicted semantic segmentation maps was a popular post-processing strategy. Several of the top techniques compared favorably to an individual human annotator and can be used with confidence for nuclear morphometrics.

Journal ArticleDOI
TL;DR: A deep stacked transformation approach for domain generalization that can be generalized to the design of highly robust deep segmentation models for clinical deployment and reaches the performance of state-of theart fully supervised models that are trained and tested on their source domains.
Abstract: Recent advances in deep learning for medical image segmentation demonstrate expert-level accuracy. However, application of these models in clinically realistic environments can result in poor generalization and decreased accuracy, mainly due to the domain shift across different hospitals, scanner vendors, imaging protocols, and patient populations etc. Common transfer learning and domain adaptation techniques are proposed to address this bottleneck. However, these solutions require data (and annotations) from the target domain to retrain the model, and is therefore restrictive in practice for widespread model deployment. Ideally, we wish to have a trained (locked) model that can work uniformly well across unseen domains without further training. In this paper, we propose a deep stacked transformation approach for domain generalization. Specifically, a series of ${n}$ stacked transformations are applied to each image during network training. The underlying assumption is that the “expected” domain shift for a specific medical imaging modality could be simulated by applying extensive data augmentation on a single source domain, and consequently, a deep model trained on the augmented “big” data (BigAug) could generalize well on unseen domains. We exploit four surprisingly effective, but previously understudied, image-based characteristics for data augmentation to overcome the domain generalization problem. We train and evaluate the BigAug model (with ${n}={9}$ transformations) on three different 3D segmentation tasks (prostate gland, left atrial, left ventricle) covering two medical imaging modalities (MRI and ultrasound) involving eight publicly available challenge datasets. The results show that when training on relatively small dataset (n = 10~32 volumes, depending on the size of the available datasets) from a single source domain: (i) BigAug models degrade an average of 11%(Dice score change) from source to unseen domain, substantially better than conventional augmentation (degrading 39%) and CycleGAN-based domain adaptation method (degrading 25%), (ii) BigAug is better than “shallower” stacked transforms (i.e. those with fewer transforms) on unseen domains and demonstrates modest improvement to conventional augmentation on the source domain, (iii) after training with BigAug on one source domain, performance on an unseen domain is similar to training a model from scratch on that domain when using the same number of training samples. When training on large datasets (n = 465 volumes) with BigAug, (iv) application to unseen domains reaches the performance of state-of-the-art fully supervised models that are trained and tested on their source domains. These findings establish a strong benchmark for the study of domain generalization in medical imaging, and can be generalized to the design of highly robust deep segmentation models for clinical deployment.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed novel Context Pyramid Fusion Network (named CPFNet) is very competitive with other state-of-the-art methods on four different challenging tasks, including skin lesion segmentation, retinal linear lesion segmentsation, multi-class segmentation of thoracic organs at risk and multi- class segmentsation of retinal edema lesions.
Abstract: Accurate and automatic segmentation of medical images is a crucial step for clinical diagnosis and analysis. The convolutional neural network (CNN) approaches based on the U-shape structure have achieved remarkable performances in many different medical image segmentation tasks. However, the context information extraction capability of single stage is insufficient in this structure, due to the problems such as imbalanced class and blurred boundary. In this paper, we propose a novel Context Pyramid Fusion Network (named CPFNet) by combining two pyramidal modules to fuse global/multi-scale context information. Based on the U-shape structure, we first design multiple global pyramid guidance (GPG) modules between the encoder and the decoder, aiming at providing different levels of global context information for the decoder by reconstructing skip-connection. We further design a scale-aware pyramid fusion (SAPF) module to dynamically fuse multi-scale context information in high-level features. These two pyramidal modules can exploit and fuse rich context information progressively. Experimental results show that our proposed method is very competitive with other state-of-the-art methods on four different challenging tasks, including skin lesion segmentation, retinal linear lesion segmentation, multi-class segmentation of thoracic organs at risk and multi-class segmentation of retinal edema lesions.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a novel unsupervised domain adaptation framework, named as synergistic image and feature alignment (SIFA), to effectively adapt a segmentation network to an unlabeled target domain.
Abstract: Unsupervised domain adaptation has increasingly gained interest in medical image computing, aiming to tackle the performance degradation of deep neural networks when being deployed to unseen data with heterogeneous characteristics. In this work, we present a novel unsupervised domain adaptation framework, named as Synergistic Image and Feature Alignment (SIFA) , to effectively adapt a segmentation network to an unlabeled target domain. Our proposed SIFA conducts synergistic alignment of domains from both image and feature perspectives. In particular, we simultaneously transform the appearance of images across domains and enhance domain-invariance of the extracted features by leveraging adversarial learning in multiple aspects and with a deeply supervised mechanism. The feature encoder is shared between both adaptive perspectives to leverage their mutual benefits via end-to-end learning. We have extensively evaluated our method with cardiac substructure segmentation and abdominal multi-organ segmentation for bidirectional cross-modality adaptation between MRI and CT images. Experimental results on two different tasks demonstrate that our SIFA method is effective in improving segmentation performance on unlabeled target images, and outperforms the state-of-the-art domain adaptation approaches by a large margin.