Home
/
Authors
/
Jun Fu

Author

Jun Fu

Bio: Jun Fu is an academic researcher from Chinese Academy of Sciences. The author has contributed to research in topics: Computer science & Segmentation. The author has an hindex of 11, co-authored 16 publications receiving 2303 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Dual Attention Network for Scene Segmentation

[...]

Jun Fu¹, Jing Liu¹, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang¹, Hanqing Lu - Show less +3 more•Institutions (1)

Chinese Academy of Sciences¹

15 Jun 2019

TL;DR: New state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset is achieved without using coarse data.

...read moreread less

Abstract: In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the self-attention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data.

...read moreread less

4,327 citations

Journal Article•DOI•

Stacked Deconvolutional Network for Semantic Segmentation

[...]

Jun Fu¹, Jing Liu¹, Yuhang Wang¹, Jin Zhou², Changyong Wang², Hanqing Lu¹ - Show less +2 more•Institutions (2)

Chinese Academy of Sciences¹, Academy of Military Medical Sciences²

25 Jan 2019-IEEE Transactions on Image Processing

TL;DR: This work proposes a Stacked Deconvolutional Network (SDN) for semantic segmentation and achieves the new state-ofthe- art results on four datasets, including PASCAL VOC 2012, CamVid, GATECH, COCO Stuff.

...read moreread less

Abstract: Recent progress in semantic segmentation has been driven by improving the spatial resolution under Fully Convolutional Networks (FCNs). To address this problem, we propose a Stacked Deconvolutional Network (SDN) for semantic segmentation. In SDN, multiple shallow deconvolutional networks, which are called as SDN units, are stacked one by one to integrate contextual information and bring the fine recovery of localization information. Meanwhile, inter-unit and intra-unit connections are designed to assist network training and enhance feature fusion since the connections improve the flow of information and gradient propagation throughout the network. Besides, hierarchical supervision is applied during the upsampling process of each SDN unit, which enhances the discrimination of feature representations and benefits the network optimization. We carry out comprehensive experiments and achieve the new state-ofthe- art results on four datasets, including PASCAL VOC 2012, CamVid, GATECH, COCO Stuff. In particular, our best model without CRF post-processing achieves an intersection-over-union score of 86.6% in the test set.

...read moreread less

174 citations

Journal Article•DOI•

Scene Segmentation With Dual Relation-Aware Attention Network

[...]

Jun Fu¹, Jing Liu¹, Jie Jiang¹, Yong Li, Yongjun Bao, Hanqing Lu¹ - Show less +2 more•Institutions (1)

Chinese Academy of Sciences¹

01 Jun 2021-IEEE Transactions on Neural Networks

TL;DR: A Dual Relation-aware Attention Network (DRANet) is proposed to handle the task of scene segmentation and designs two types of compact attention modules, which model the contextual dependencies in spatial and channel dimensions, respectively.

...read moreread less

Abstract: In this article, we propose a Dual Relation-aware Attention Network (DRANet) to handle the task of scene segmentation. How to efficiently exploit context is essential for pixel-level recognition. To address the issue, we adaptively capture contextual information based on the relation-aware attention mechanism. Especially, we append two types of attention modules on the top of the dilated fully convolutional network (FCN), which model the contextual dependencies in spatial and channel dimensions, respectively. In the attention modules, we adopt a self-attention mechanism to model semantic associations between any two pixels or channels. Each pixel or channel can adaptively aggregate context from all pixels or channels according to their correlations. To reduce the high cost of computation and memory caused by the abovementioned pairwise association computation, we further design two types of compact attention modules. In the compact attention modules, each pixel or channel is built into association only with a few numbers of gathering centers and obtains corresponding context aggregation over these gathering centers. Meanwhile, we add a cross-level gating decoder to selectively enhance spatial details that boost the performance of the network. We conduct extensive experiments to validate the effectiveness of our network and achieve new state-of-the-art segmentation performance on four challenging scene segmentation data sets, i.e., Cityscapes, ADE20K, PASCAL Context, and COCO Stuff data sets. In particular, a Mean IoU score of 82.9% on the Cityscapes test set is achieved without using extra coarse annotated data.

...read moreread less

133 citations

Proceedings Article•DOI•

Adaptive Context Network for Scene Parsing

[...]

Jun Fu¹, Jing Liu, Yuhang Wang¹, Yong Li, Yongjun Bao, Jinhui Tang², Hanqing Lu - Show less +3 more•Institutions (2)

Chinese Academy of Sciences¹, Nanjing University of Science and Technology²

01 Oct 2019

TL;DR: Adaptive Context Network (ACNet) as discussed by the authors proposes a competitive fusion of global context and local context according to different per-pixel demands, where the global context demand is measured by the similarity between the global feature and its local feature, whose reverse value can also be used to measure the local context demand.

...read moreread less

Abstract: Recent works attempt to improve scene parsing performance by exploring different levels of contexts, and typically train a well-designed convolutional network to exploit useful contexts across all pixels equally. However, in this paper, we find that the context demands are varying from different pixels or regions in each image. Based on this observation, we propose an Adaptive Context Network (ACNet) to capture the pixel-aware contexts by a competitive fusion of global context and local context according to different per-pixel demands. Specifically, when given a pixel, the global context demand is measured by the similarity between the global feature and its local feature, whose reverse value can also be used to measure the local context demand. We model the two demanding measurements by the proposed global context module and local context module, respectively, to generate their adaptive contextual features. Furthermore, we import multiple such modules to build several adaptive context blocks in different levels of network to obtain a coarse-to-fine result. Finally, comprehensive experimental evaluations demonstrate the effectiveness of the proposed ACNet, and new state-of-the-arts performances are achieved on all four public datasets, i.e. Cityscapes, ADE20K, PASCAL Context, and COCO Stuff.

...read moreread less

99 citations

Posted Content•

Dual Attention Network for Scene Segmentation

[...]

Jun Fu¹, Jing Liu¹, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang¹, Hanqing Lu - Show less +3 more•Institutions (1)

Chinese Academy of Sciences¹

09 Sep 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a dual attention network is proposed to adaptively integrate local features with their global dependencies, which achieves state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff.

...read moreread less

Abstract: In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the selfattention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data. We make the code and trained model publicly available at this https URL

...read moreread less

78 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

Book Chapter•DOI•

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

[...]

Liang-Chieh Chen¹, Yukun Zhu¹, George Papandreou¹, Florian Schroff¹, Hartwig Adam¹ - Show less +1 more•Institutions (1)

Google¹

08 Sep 2018

TL;DR: This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.

...read moreread less

Abstract: Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https://github.com/tensorflow/models/tree/master/research/deeplab.

...read moreread less

7,113 citations

Proceedings Article•DOI•

Dual Attention Network for Scene Segmentation

[...]

Jun Fu¹, Jing Liu¹, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang¹, Hanqing Lu - Show less +3 more•Institutions (1)

Chinese Academy of Sciences¹

15 Jun 2019

TL;DR: New state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset is achieved without using coarse data.

...read moreread less

4,327 citations

Posted Content•

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

[...]

Ze Liu¹, Yutong Lin¹, Yue Cao¹, Han Hu¹, Yixuan Wei¹, Zheng Zhang¹, Stephen Lin¹, Baining Guo¹ - Show less +4 more•Institutions (1)

Microsoft¹

25 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

...read moreread less

3,518 citations

Proceedings Article•DOI•

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

[...]

Sixiao Zheng¹, Jiachen Lu¹, Hengshuang Zhao², Xiatian Zhu³, Zekun Luo⁴, Yabiao Wang⁴, Yanwei Fu¹, Jianfeng Feng¹, Tao Xiang³, Philip H. S. Torr², Li Zhang¹ - Show less +7 more•Institutions (4)

Fudan University¹, University of Oxford², University of Surrey³, Tencent⁴

20 Jun 2021

TL;DR: Zhang et al. as discussed by the authors proposed a pure transformer to encode an image as a sequence of patches, which can be combined with a simple decoder to provide a powerful segmentation model.

...read moreread less

Abstract: Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.

...read moreread less

1,761 citations

Proceedings Article•DOI•

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

[...]

Qilong Wang¹, Banggu Wu¹, Pengfei Zhu¹, Peihua Li², Wangmeng Zuo³, Qinghua Hu¹ - Show less +2 more•Institutions (3)

Tianjin University¹, Dalian University of Technology², Harbin Institute of Technology³

14 Jun 2020

TL;DR: The Efficient Channel Attention (ECA) module as discussed by the authors proposes a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution, which only involves a handful of parameters while bringing clear performance gain.

...read moreread less

Abstract: Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity. To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain. By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Therefore, we propose a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution. Furthermore, we develop a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction. The proposed ECA module is both efficient and effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFlops vs. 3.86 GFlops, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.

...read moreread less

1,378 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse