Home
/
Authors
/
Ke Sun

Author

Ke Sun

University of Science and Technology of China

Bio: Ke Sun is an academic researcher from University of Science and Technology of China. The author has contributed to research in topics: Pose & Feature learning. The author has an hindex of 8, co-authored 13 publications receiving 2493 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep High-Resolution Representation Learning for Human Pose Estimation

[...]

Ke Sun¹, Bin Xiao², Dong Liu¹, Jingdong Wang²•Institutions (2)

University of Science and Technology of China¹, Microsoft²

25 Feb 2019

TL;DR: This paper proposes a network that maintains high-resolution representations through the whole process of human pose estimation and empirically demonstrates the effectiveness of the network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset.

...read moreread less

Abstract: In this paper, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. In addition, we show the superiority of our network in pose tracking on the PoseTrack dataset. The code and models have been publicly available at https://github.com/leoxiaobin/deep-high-resolution-net.pytorch.

...read moreread less

2,979 citations

Posted Content•

Deep High-Resolution Representation Learning for Visual Recognition

[...]

Jingdong Wang¹, Ke Sun², Tianheng Cheng³, Borui Jiang⁴, Chaorui Deng⁵, Yang Zhao⁶, Dong Liu², Yadong Mu⁴, Mingkui Tan⁵, Xinggang Wang³, Wenyu Liu³, Bin Xiao¹ - Show less +8 more•Institutions (6)

Microsoft¹, University of Science and Technology of China², Huazhong University of Science and Technology³, Peking University⁴, South China University of Technology⁵, Griffith University⁶

20 Aug 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: The superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, is shown, suggesting that the HRNet is a stronger backbone for computer vision problems.

...read moreread less

Abstract: High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions \emph{in series} (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams \emph{in parallel}; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at~{\url{this https URL}}.

...read moreread less

1,278 citations

Journal Article•DOI•

Deep High-Resolution Representation Learning for Visual Recognition

[...]

Microsoft¹, University of Science and Technology of China², Huazhong University of Science and Technology³, Peking University⁴, South China University of Technology⁵, Griffith University⁶

01 Oct 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The High-Resolution Network (HRNet) as mentioned in this paper maintains high-resolution representations through the whole process by connecting the high-to-low resolution convolution streams in parallel and repeatedly exchanging the information across resolutions.

...read moreread less

Abstract: High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel and (ii) repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet .

...read moreread less

1,162 citations

Posted Content•

High-Resolution Representations for Labeling Pixels and Regions

[...]

Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, Jingdong Wang - Show less +6 more

09 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A simple modification is introduced to augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from thehigh-resolution convolution, which leads to stronger representations, evidenced by superior results.

...read moreread less

Abstract: High-resolution representation learning plays an essential role in many vision problems, e.g., pose estimation and semantic segmentation. The high-resolution network (HRNet)~\cite{SunXLW19}, recently developed for human pose estimation, maintains high-resolution representations through the whole process by connecting high-to-low resolution convolutions in \emph{parallel} and produces strong high-resolution representations by repeatedly conducting fusions across parallel convolutions. In this paper, we conduct a further study on high-resolution representations by introducing a simple yet effective modification and apply it to a wide range of vision tasks. We augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from the high-resolution convolution as done in~\cite{SunXLW19}. This simple modification leads to stronger representations, evidenced by superior results. We show top results in semantic segmentation on Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW, COFW, $300$W, and WFLW. In addition, we build a multi-level representation from the high-resolution representation and apply it to the Faster R-CNN object detection framework and the extended frameworks. The proposed approach achieves superior results to existing single-model networks on COCO object detection. The code and models have been publicly available at \url{this https URL}.

...read moreread less

512 citations

Proceedings Article•DOI•

Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression

[...]

Zigang Geng¹, Ke Sun¹, Bin Xiao², Zhaoxiang Zhang³, Jingdong Wang² - Show less +1 more•Institutions (3)

University of Science and Technology of China¹, Microsoft², Chinese Academy of Sciences³

06 Apr 2021

TL;DR: In this paper, the authors proposed a disentangled keypoint regression (DEKR) method, which adopts adaptive convolutions through pixel-wise spatial transformer to activate the pixels in the keypoint regions and accordingly learn representations from them.

...read moreread less

Abstract: In this paper, we are interested in the bottom-up paradigm of estimating human poses from an image. We study the dense keypoint regression framework that is previously inferior to the keypoint detection and grouping framework. Our motivation is that regressing keypoint positions accurately needs to learn representations that focus on the keypoint regions.We present a simple yet effective approach, named disentangled keypoint regression (DEKR). We adopt adaptive convolutions through pixel-wise spatial transformer to activate the pixels in the keypoint regions and accordingly learn representations from them. We use a multi-branch structure for separate regression: each branch learns a representation with dedicated adaptive convolutions and regresses one keypoint. The resulting disentangled representations are able to attend to the keypoint regions, respectively, and thus the keypoint regression is spatially more accurate. We empirically show that the proposed direct regression method outperforms keypoint detection and grouping methods and achieves superior bottom-up pose estimation results on two benchmark datasets, COCO and CrowdPose. The code and models are available at https://github.com/HRNet/DEKR.

...read moreread less

137 citations

Cited by

PDF

Open Access

More filters

Posted Content•

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

[...]

Ze Liu¹, Yutong Lin¹, Yue Cao¹, Han Hu¹, Yixuan Wei¹, Zheng Zhang¹, Stephen Lin¹, Baining Guo¹ - Show less +4 more•Institutions (1)

Microsoft¹

25 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

...read moreread less

3,518 citations

Proceedings Article•DOI•

Deep High-Resolution Representation Learning for Human Pose Estimation

[...]

Ke Sun¹, Bin Xiao², Dong Liu¹, Jingdong Wang²•Institutions (2)

University of Science and Technology of China¹, Microsoft²

25 Feb 2019

...read moreread less

2,979 citations

Proceedings Article•DOI•

FCOS: Fully Convolutional One-Stage Object Detection

[...]

Zhi Tian¹, Chunhua Shen¹, Hao Chen¹, Tong He¹•Institutions (1)

University of Adelaide¹

02 Apr 2019

TL;DR: For the first time, a much simpler and flexible detection framework achieving improved detection accuracy is demonstrated, and it is hoped that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks.

...read moreread less

Abstract: We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor box free, as well as proposal free. By eliminating the pre-defined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating overlapping during training. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), FCOS with ResNeXt-64x4d-101 achieves 44.7% in AP with single-model and single-scale testing, surpassing previous one-stage detectors with the advantage of being much simpler. For the first time, we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks. Code is available at: https://tinyurl.com/FCOSv1

...read moreread less

2,244 citations

Posted Content•

FCOS: Fully Convolutional One-Stage Object Detection

[...]

Zhi Tian¹, Chunhua Shen¹, Hao Chen¹, Tong He¹•Institutions (1)

University of Adelaide¹

02 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a fully convolutional one-stage object detector (FCOS) is proposed to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation.

...read moreread less

Abstract: We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor box free, as well as proposal free. By eliminating the predefined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating overlapping during training. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), FCOS with ResNeXt-64x4d-101 achieves 44.7% in AP with single-model and single-scale testing, surpassing previous one-stage detectors with the advantage of being much simpler. For the first time, we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks. Code is available at:Code is available at: this https URL

...read moreread less

2,160 citations

Journal Article•DOI•

Deep Learning for Generic Object Detection: A Survey

[...]

Li Liu¹, Li Liu², Wanli Ouyang³, Xiaogang Wang⁴, Paul Fieguth⁵, Jie Chen², Xinwang Liu¹, Matti Pietikäinen² - Show less +4 more•Institutions (5)

National University of Defense Technology¹, University of Oulu², University of Sydney³, The Chinese University of Hong Kong⁴, University of Waterloo⁵

01 Feb 2020-International Journal of Computer Vision

TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.

...read moreread less

Abstract: Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in the field of generic object detection. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought about by deep learning techniques. More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics. We finish the survey by identifying promising directions for future research.

...read moreread less

1,897 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse