scispace - formally typeset
Search or ask a question
Journal ArticleDOI

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

TL;DR: This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.
Abstract: In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First , we highlight convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second , we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third , we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed “DeepLab” system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7 percent mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
Citations
More filters
Proceedings ArticleDOI
21 Jul 2017
TL;DR: This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.
Abstract: Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.

10,189 citations

Book ChapterDOI
Liang-Chieh Chen1, Yukun Zhu1, George Papandreou1, Florian Schroff1, Hartwig Adam1 
08 Sep 2018
TL;DR: This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.
Abstract: Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https://github.com/tensorflow/models/tree/master/research/deeplab.

7,113 citations


Cites background or methods from "DeepLab: Semantic Image Segmentatio..."

  • ...We refer interested readers to [9] for more details....

    [...]

  • ...Spatial pyramid pooling: Models, such as PSPNet [81] or DeepLab [9, 10], perform spatial pyramid pooling [23, 40] at several grid scales (including image-level pooling [47]) or apply several parallel atrous convolution with different rates (called Atrous Spatial Pyramid Pooling, or ASPP)....

    [...]

  • ...In this subsection, we evaluate the segmentation accuracy with the trimap experiment [36, 37, 9] to quantify the accuracy of the proposed decoder module near object boundaries....

    [...]

  • ..., image pyramid) [18, 16, 58, 44, 11, 9] or those that adopt probabilistic graphical models (such as DenseCRF [37] with efficient inference algorithm [2]) [8, 4, 82, 44, 48, 55, 63, 34, 72, 6, 7, 9]....

    [...]

Posted Content
TL;DR: This work uses new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, C mBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100.
Abstract: There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at this https URL

5,709 citations


Cites background or methods from "DeepLab: Semantic Image Segmentatio..."

  • ...The difference in operation between ASPP [5] module and improved SPP module is mainly from the original k×k kernel size, max-pooling of stride equals to 1 to several 3 × 3 kernel size, dilated ratio equals to k, and stride equals to 1 in dilated convolution operation....

    [...]

  • ...RFB module is to use several dilated convolutions of k×k kernel, dilated ratio equals to k, and stride equals to 1 to obtain a more comprehensive spatial coverage than ASPP....

    [...]

  • ...To sum up, an ordinary object detector is composed of several parts: • Input: Image, Patches, Image Pyramid • Backbones: VGG16 [68], ResNet-50 [26], SpineNet [12], EfficientNet-B0/B7 [75], CSPResNeXt50 [81], CSPDarknet53 [81] • Neck: • Additional blocks: SPP [25], ASPP [5], RFB [47], SAM [85] • Path-aggregation blocks: FPN [44], PAN [49], NAS-FPN [17], Fully-connected FPN, BiFPN [77], ASFF [48], SFAM [98] • Heads:: • Dense Prediction (one-stage): ◦ RPN [64], SSD [50], YOLO [61], RetinaNet [45] (anchor based) ◦ CornerNet [37], CenterNet [13], MatrixNet [60], FCOS [78] (anchor free) • Sparse Prediction (two-stage): ◦ Faster R-CNN [64], R-FCN [9], Mask R- CNN [23] (anchor based) ◦ RepPoints [87] (anchor free)...

    [...]

  • ...Common modules that can be used to enhance receptive field are SPP [25], ASPP [5], and RFB [47]....

    [...]

  • ...• Additional blocks: SPP [25], ASPP [5], RFB [47], SAM [85]...

    [...]

Proceedings ArticleDOI
15 Jun 2019
TL;DR: New state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset is achieved without using coarse data.
Abstract: In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the self-attention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data.

4,327 citations


Cites background or methods from "DeepLab: Semantic Image Segmentatio..."

  • ...We employ a pretrained residual network with the dilated strategy [3] as the backbone....

    [...]

  • ...First, Deeplabv2 [3] and Deeplabv3 [4] adopt atrous spatial pyramid pooling to embed contextual information, which consist of parallel dilated convolutions with different dilated rates....

    [...]

  • ...For example, some works [3, 4, 29] aggregate multi-scale contexts via combining feature maps generated by different dilated convolutions and pooling operations....

    [...]

  • ...Following [3], we adopt multi-loss on the end of the network when both two attention modules are used....

    [...]

Proceedings ArticleDOI
17 Mar 2017
TL;DR: Deformable convolutional networks as discussed by the authors augment the spatial sampling locations in the modules with additional offsets and learn the offsets from the target tasks, without additional supervision, which can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard backpropagation.
Abstract: Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in their building modules. In this work, we introduce two new modules to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from the target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the performance of our approach. For the first time, we show that learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks such as object detection and semantic segmentation. The code is released at https://github.com/msracver/Deformable-ConvNets.

3,318 citations

References
More filters
Proceedings ArticleDOI
01 Jun 2016
TL;DR: This work proposes replacing the fully-connected CRF with domain transform (DT), a modern edge-preserving filtering method in which the amount of smoothing is controlled by a reference edge map, and shows that it yields comparable semantic segmentation results, accurately capturing object boundaries.
Abstract: Deep convolutional neural networks (CNNs) are the backbone of state-of-art semantic image segmentation systems. Recent work has shown that complementing CNNs with fully-connected conditional random fields (CRFs) can significantly enhance their object localization accuracy, yet dense CRF inference is computationally expensive. We propose replacing the fully-connected CRF with domain transform (DT), a modern edge-preserving filtering method in which the amount of smoothing is controlled by a reference edge map. Domain transform filtering is several times faster than dense CRF inference and we show that it yields comparable semantic segmentation results, accurately capturing object boundaries. Importantly, our formulation allows learning the reference edge map from intermediate CNN features instead of using the image gradient magnitude as in standard DT filtering. This produces task-specific edges in an end-to-end trainable system optimizing the target semantic segmentation quality.

334 citations


"DeepLab: Semantic Image Segmentatio..." refers background or methods in this paper

  • ...In a different direction, [63] replace the bilateral filtering module used in mean field inference with a faster domain transform module [67], improving the speed and lowering the memory requirements of the overall system, while [18], [68] combine semantic segmentation with edge detection....

    [...]

  • ...This direction has been extended by several follow-up papers [17], [40], [58], [59], [60], [61], [62], [63], [65], since the first version of our work was published [38]....

    [...]

  • ...Multiple groups have made important advances, significantly raising the bar on the PASCAL VOC 2012 semantic segmentation benchmark, as reflected to the high level of activity in the benchmark’s leaderboard(1) [17], [40], [58], [59], [60], [61], [62], [63]....

    [...]

Journal ArticleDOI
TL;DR: In this letter, a precise relationship between RDWT-domain and original-signal-domain distortion for additive white noise in the RDWT domain is derived.
Abstract: The behavior under additive noise of the redundant discrete wavelet transform (RDWT), which is a frame expansion that is essentially an undecimated discrete wavelet transform, is studied. Known prior results in the form of inequalities bound distortion energy in the original signal domain from additive noise in frame-expansion coefficients. In this letter, a precise relationship between RDWT-domain and original-signal-domain distortion for additive white noise in the RDWT domain is derived.

331 citations


"DeepLab: Semantic Image Segmentatio..." refers background in this paper

  • ...We refer the interested reader to [74] for early references from the wavelet literature....

    [...]

Posted Content
TL;DR: This work unifies this two-stage process for semantic segmentation into a single joint training algorithm and demonstrates the method on the semantic image segmentation task and shows encouraging results on the challenging PASCAL VOC 2012 dataset.
Abstract: Convolutional neural networks with many layers have recently been shown to achieve excellent results on many high-level tasks such as image classification, object detection and more recently also semantic segmentation. Particularly for semantic segmentation, a two-stage procedure is often employed. Hereby, convolutional networks are trained to provide good local pixel-wise features for the second step being traditionally a more global graphical model. In this work we unify this two-stage process into a single joint training algorithm. We demonstrate our method on the semantic image segmentation task and show encouraging results on the challenging PASCAL VOC 2012 dataset.

324 citations


"DeepLab: Semantic Image Segmentatio..." refers background or methods in this paper

  • ...While we employ the CRF as a post-processing method, [40], [59], [62], [64], [65] have successfully pursued joint learning of the DCNN and CRF....

    [...]

  • ...This direction has been extended by several follow-up papers [17], [40], [58], [59], [60], [61], [62], [63], [65], since the first version of our work was published [38]....

    [...]

  • ...In particular, [59], [65] unroll the CRF mean-field inference steps to convert the whole system into an end-to-end trainable feed-forward network, while [62] approximates one iteration of the dense CRF mean field inference [22] by convolutional layers with learnable filters....

    [...]

Book ChapterDOI
08 Oct 2016
TL;DR: Wang et al. as mentioned in this paper proposed the Graph Long Short-Term Memory (Graph LSTM) network, which is the generalization of LSTMs from sequential data or multi-dimensional data to general graph-structured data.
Abstract: By taking the semantic object parsing task as an exemplar application scenario, we propose the Graph Long Short-Term Memory (Graph LSTM) network, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data. Particularly, instead of evenly and fixedly dividing an image to pixels or patches in existing multi-dimensional LSTM structures (e.g., Row, Grid and Diagonal LSTMs), we take each arbitrary-shaped superpixel as a semantically consistent node, and adaptively construct an undirected graph for each image, where the spatial relations of the superpixels are naturally used as edges. Constructed on such an adaptive graph topology, the Graph LSTM is more naturally aligned with the visual patterns in the image (e.g., object boundaries or appearance similarities) and provides a more economical information propagation route. Furthermore, for each optimization step over Graph LSTM, we propose to use a confidence-driven scheme to update the hidden and memory states of nodes progressively till all nodes are updated. In addition, for each node, the forgets gates are adaptively learned to capture different degrees of semantic correlation with neighboring nodes. Comprehensive evaluations on four diverse semantic object parsing datasets well demonstrate the significant superiority of our Graph LSTM over other state-of-the-art solutions.

312 citations

Posted Content
TL;DR: The proposed ABC-CNN architecture for visual question answering task (VQA) achieves significant improvements over state-of-the-art methods on three benchmark VQA datasets and is shown to reflect the regions that are highly relevant to the questions.
Abstract: We propose a novel attention based deep learning architecture for visual question answering task (VQA). Given an image and an image related natural language question, VQA generates the natural language answer for the question. Generating the correct answers requires the model's attention to focus on the regions corresponding to the question, because different questions inquire about the attributes of different image regions. We introduce an attention based configurable convolutional neural network (ABC-CNN) to learn such question-guided attention. ABC-CNN determines an attention map for an image-question pair by convolving the image feature map with configurable convolutional kernels derived from the question's semantics. We evaluate the ABC-CNN architecture on three benchmark VQA datasets: Toronto COCO-QA, DAQUAR, and VQA dataset. ABC-CNN model achieves significant improvements over state-of-the-art methods on these datasets. The question-guided attention generated by ABC-CNN is also shown to reflect the regions that are highly relevant to the questions.

285 citations


"DeepLab: Semantic Image Segmentatio..." refers methods in this paper

  • ...Interestingly, the atrous convolution technique has also been adopted for a broader set of tasks, such as object detection [12], [77], instance-level segmentation [78], visual question answering [79], and optical flow [80]....

    [...]