Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

doi:10.1007/978-3-030-01234-2_49

Open AccessBook ChapterDOI

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Liang-Chieh Chen, +4 more

- pp 833-851

Chats0

TLDR

This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.

Abstract:

Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https://github.com/tensorflow/models/tree/master/research/deeplab.

Citations

PDF

Open Access

More filters

Posted Content

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

Ze Liu, +7 more

- 25 Mar 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

Proceedings ArticleDOI

EfficientDet: Scalable and Efficient Object Detection

Mingxing Tan, +2 more

TL;DR: EfficientDetD7 as discussed by the authors proposes a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion, and a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time.

...read moreread less

Posted Content

Searching for MobileNetV3.

Andrew Howard, +11 more

- 06 May 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art of MobileNets.

...read moreread less

Proceedings ArticleDOI

Searching for MobileNetV3

Andrew Howard, +11 more

TL;DR: MobileNetV3 as mentioned in this paper is the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design and achieves state-of-the-art results for mobile classification, detection and segmentation.

...read moreread less

Journal ArticleDOI

Res2Net: A New Multi-Scale Backbone Architecture

Shanghua Gao, +5 more

- 01 Feb 2021 -

IEEE Transactions on Pattern Analysis an...

TL;DR: Res2Net as mentioned in this paper constructs hierarchical residual-like connections within one single residual block to represent multi-scale features at a granular level and increases the range of receptive fields for each network layer.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book ChapterDOI

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Kaiming He, +3 more

TL;DR: This work equips the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement, and develops a new network structure, called SPP-net, which can generate a fixed-length representation regardless of image size/scale.

...read moreread less

Book ChapterDOI

Stacked Hourglass Networks for Human Pose Estimation

Alejandro Newell, +2 more

TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.

...read moreread less

Proceedings ArticleDOI

Deformable Convolutional Networks

Jifeng Dai, +6 more

TL;DR: Deformable convolutional networks as discussed by the authors augment the spatial sampling locations in the modules with additional offsets and learn the offsets from the target tasks, without additional supervision, which can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard backpropagation.

...read moreread less

Proceedings Article

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Philipp Krähenbühl, +1 more

TL;DR: This paper considers fully connected CRF models defined on the complete set of pixels in an image and proposes a highly efficient approximate inference algorithm in which the pairwise edge potentials are defined by a linear combination of Gaussian kernels.

...read moreread less

Proceedings Article

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Pierre Sermanet, +5 more

TL;DR: In this article, a multiscale and sliding window approach is proposed to predict object boundaries, which is then accumulated rather than suppressed in order to increase detection confidence, and OverFeat is the winner of the ImageNet Large Scale Visual Recognition Challenge 2013.

...read moreread less

Collapse

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Citations

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

EfficientDet: Scalable and Efficient Object Detection

Searching for MobileNetV3.

Searching for MobileNetV3

Res2Net: A New Multi-Scale Backbone Architecture

References

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Stacked Hourglass Networks for Human Pose Estimation

Deformable Convolutional Networks

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Related Papers (5)

Deep Residual Learning for Image Recognition

U-Net: Convolutional Networks for Biomedical Image Segmentation

Fully convolutional networks for semantic segmentation

Microsoft COCO: Common Objects in Context

ImageNet: A large-scale hierarchical image database