Home
/
Authors
/
Chaorui Deng

Author

Chaorui Deng

Other affiliations: University of Adelaide

Bio: Chaorui Deng is an academic researcher from South China University of Technology. The author has contributed to research in topics: Object detection & Computer science. The author has an hindex of 8, co-authored 13 publications receiving 942 citations. Previous affiliations of Chaorui Deng include University of Adelaide.

Papers

PDF

Open Access

More filters

Posted Content•

Deep High-Resolution Representation Learning for Visual Recognition

[...]

Jingdong Wang¹, Ke Sun², Tianheng Cheng³, Borui Jiang⁴, Chaorui Deng⁵, Yang Zhao⁶, Dong Liu², Yadong Mu⁴, Mingkui Tan⁵, Xinggang Wang³, Wenyu Liu³, Bin Xiao¹ - Show less +8 more•Institutions (6)

Microsoft¹, University of Science and Technology of China², Huazhong University of Science and Technology³, Peking University⁴, South China University of Technology⁵, Griffith University⁶

20 Aug 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: The superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, is shown, suggesting that the HRNet is a stronger backbone for computer vision problems.

...read moreread less

Abstract: High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions \emph{in series} (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams \emph{in parallel}; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at~{\url{this https URL}}.

...read moreread less

1,278 citations

Journal Article•DOI•

Deep High-Resolution Representation Learning for Visual Recognition

[...]

Microsoft¹, University of Science and Technology of China², Huazhong University of Science and Technology³, Peking University⁴, South China University of Technology⁵, Griffith University⁶

01 Oct 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The High-Resolution Network (HRNet) as mentioned in this paper maintains high-resolution representations through the whole process by connecting the high-to-low resolution convolution streams in parallel and repeatedly exchanging the information across resolutions.

...read moreread less

Abstract: High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel and (ii) repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet .

...read moreread less

1,162 citations

Proceedings Article•DOI•

Visual Grounding via Accumulated Attention

[...]

Chaorui Deng¹, Qi Wu², Qingyao Wu¹, Fuyuan Hu³, Fan Lyu³, Mingkui Tan¹ - Show less +2 more•Institutions (3)

South China University of Technology¹, University of Adelaide², Suzhou University of Science and Technology³

18 Jun 2018

TL;DR: The A-ATT mechanism can circularly accumulate the attention for useful information in image, query, and objects, while the noises are ignored gradually and the experimental results show the superiority of the proposed method in term of accuracy.

...read moreread less

Abstract: Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence or even a multi-round dialogue. There are three main challenges in VG: 1) what is the main focus in a query; 2) how to understand an image; 3) how to locate an object. Most existing methods combine all the information curtly, which may suffer from the problem of information redundancy (i.e. ambiguous query, complicated image and a large number of objects). In this paper, we formulate these challenges as three attention problems and propose an accumulated attention (A-ATT) mechanism to reason among them jointly. Our A-ATT mechanism can circularly accumulate the attention for useful information in image, query, and objects, while the noises are ignored gradually. We evaluate the performance of A-ATT on four popular datasets (namely Refer-COCO, ReferCOCO+, ReferCOCOg, and Guesswhat?!), and the experimental results show the superiority of the proposed method in term of accuracy.

...read moreread less

197 citations

Proceedings Article•DOI•

Sketch, Ground, and Refine: Top-Down Dense Video Captioning

[...]

Chaorui Deng¹, Shizhe Chen², Da Chen³, Yuan He³, Qi Wu¹ - Show less +1 more•Institutions (3)

University of Adelaide¹, Renmin University of China², Alibaba Group³

01 Jun 2021

TL;DR: Catt et al. as discussed by the authors proposed a Sketch, Ground, and Refine (SGR) model to generate paragraphs from a global view and then ground each event description to a video segment for detailed refinement.

...read moreread less

Abstract: The dense video captioning task aims to detect and describe a sequence of events in a video for detailed and coherent storytelling. Previous works mainly adopt a "detect-then-describe" framework, which firstly detects event proposals in the video and then generates descriptions for the detected events. However, the definitions of events are diverse which could be as simple as a single action or as complex as a set of events, depending on different semantic con-texts. Therefore, directly detecting events based on video information is ill-defined and hurts the coherency and accuracy of generated dense captions. In this work, we reverse the predominant "detect-then-describe" fashion, proposing a top-down way to first generate paragraphs from a global view and then ground each event description to a video segment for detailed refinement. It is formulated as a Sketch, Ground, and Refine process (SGR). The sketch stage first generates a coarse-grained multi-sentence paragraph to describe the whole video, where each sentence is treated as an event and gets localised in the grounding stage. In the re-fining stage, we improve captioning quality via refinement-enhanced training and dual-path cross attention on both coarse-grained event captions and aligned event segments. The updated event caption can further adjust its segment boundaries. Our SGR model outperforms state-of-the-art methods on ActivityNet Captioning benchmark under traditional and story-oriented dense caption evaluations. Code will be released at github.com/bearcatt/SGR.

...read moreread less

41 citations

Proceedings Article•

Double Forward Propagation for Memorized Batch Normalization

[...]

Yong Guo¹, Qingyao Wu¹, Chaorui Deng¹, Jian Chen¹, Mingkui Tan¹ - Show less +1 more•Institutions (1)

South China University of Technology¹

29 Apr 2018

TL;DR: Li et al. as mentioned in this paper proposed a memorized batch normalization (MBN), which considers multiple recent batches to obtain more accurate and robust statistics, which greatly reduces the sensitivity of data and improves generalization performance.

...read moreread less

Abstract: Batch Normalization (BN) has been a standard component in designing deep neural networks (DNNs). Although the standard BN can significantly accelerate the training of DNNs and improve the generalization performance, it has several underlying limitations which may hamper the performance in both training and inference. In the training stage, BN relies on estimating the mean and variance of data using a single mini-batch. Consequently, BN can be unstable when the batch size is very small or the data is poorly sampled. In the inference stage, BN often uses the so called moving mean and moving variance instead of batch statistics, i.e., the training and inference rules in BN are not consistent. Regarding these issues, we propose a memorized batch normalization (MBN), which considers multiple recent batches to obtain more accurate and robust statistics. Note that after the SGD update for each batch, the model parameters will change, and the features will change accordingly, leading to the Distribution Shift before and after the update for the considered batch. To alleviate this issue, we present a simple Double-Forward scheme in MBN which can further improve the performance. Compared to related methods, the proposed MBN exhibits consistent behaviors in both training and inference. Empirical results show that the MBN based models trained with the Double-Forward scheme greatly reduce the sensitivity of data and significantly improve the generalization performance.

...read moreread less

28 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Posted Content•

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

[...]

Ze Liu¹, Yutong Lin¹, Yue Cao¹, Han Hu¹, Yixuan Wei¹, Zheng Zhang¹, Stephen Lin¹, Baining Guo¹ - Show less +4 more•Institutions (1)

Microsoft¹

25 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

...read moreread less

3,518 citations

Journal Article•DOI•

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

[...]

Laith Alzubaidi¹, Jinglan Zhang¹, Amjad J. Humaidi², Ayad Q. Al-Dujaili, Ye Duan³, Omran Al-Shamma, José Santamaría⁴, Mohammed A. Fadhel⁵, Muthana Al-Amidie³, Laith Farhan⁶ - Show less +6 more•Institutions (6)

Queensland University of Technology¹, University of Technology, Iraq², University of Missouri³, University of Jaén⁴, Information Technology University⁵, Manchester Metropolitan University⁶

01 Jan 2021-Journal of Big Data

TL;DR: In this paper, a comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field is provided, and the challenges and suggested solutions to help researchers understand the existing research gaps.

...read moreread less

Abstract: In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching or even beating those provided by human performance. One of the benefits of DL is the ability to learn massive amounts of data. The DL field has grown fast in the last few years and it has been extensively used to successfully address a wide range of traditional applications. More importantly, DL has outperformed well-known ML techniques in many domains, e.g., cybersecurity, natural language processing, bioinformatics, robotics and control, and medical information processing, among many others. Despite it has been contributed several works reviewing the State-of-the-Art on DL, all of them only tackled one aspect of the DL, which leads to an overall lack of knowledge about it. Therefore, in this contribution, we propose using a more holistic approach in order to provide a more suitable starting point from which to develop a full understanding of DL. Specifically, this review attempts to provide a more comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field. In particular, this paper outlines the importance of DL, presents the types of DL techniques and networks. It then presents convolutional neural networks (CNNs) which the most utilized DL network type and describes the development of CNNs architectures together with their main features, e.g., starting with the AlexNet network and closing with the High-Resolution network (HR.Net). Finally, we further present the challenges and suggested solutions to help researchers understand the existing research gaps. It is followed by a list of the major DL applications. Computational tools including FPGA, GPU, and CPU are summarized along with a description of their influence on DL. The paper ends with the evolution matrix, benchmark datasets, and summary and conclusion.

...read moreread less

1,084 citations

Journal Article•DOI•

FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking

[...]

Yifu Zhang¹, Chunyu Wang², Xinggang Wang¹, Wenjun Zeng², Wenyu Liu¹ - Show less +1 more•Institutions (2)

Huazhong University of Science and Technology¹, Microsoft²

04 Apr 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A simple approach which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features allows \emph{FairMOT} to obtain high levels of detection and tracking accuracy and outperform previous state-of-the-arts by a large margin on several public datasets.

...read moreread less

Abstract: There has been remarkable progress on object detection and re-identification (re-ID) in recent years which are the key components of multi-object tracking. However, little attention has been focused on jointly accomplishing the two tasks in a single network. Our study shows that the previous attempts ended up with degraded accuracy mainly because the re-ID task is not fairly learned which causes many identity switches. The unfairness lies in two-fold: (1) they treat re-ID as a secondary task whose accuracy heavily depends on the primary detection task. So training is largely biased to the detection task but ignores the re-ID task; (2) they use ROI-Align to extract re-ID features which is directly borrowed from object detection. However, this introduces a lot of ambiguity in characterizing objects because many sampling points may belong to disturbing instances or background. To solve the problems, we present a simple approach \emph{FairMOT} which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks allows \emph{FairMOT} to obtain high levels of detection and tracking accuracy and outperform previous state-of-the-arts by a large margin on several public datasets. The source code and pre-trained models are released at this https URL.

...read moreread less

507 citations

Proceedings Article•DOI•

Graph Convolutional Networks for Temporal Action Localization

[...]

Runhao Zeng¹, Wenbing Huang², Chuang Gan³, Mingkui Tan¹, Yu Rong², Peilin Zhao², Junzhou Huang⁴ - Show less +3 more•Institutions (4)

South China University of Technology¹, Tencent², Massachusetts Institute of Technology³, University of Texas at Arlington⁴

01 Oct 2019

TL;DR: Zhang et al. as mentioned in this paper exploit the proposal-proposal relations using GraphConvolutional Networks (GCNs) to exploit the context information for each proposal and the correlations between distinct actions.

...read moreread less

Abstract: Most state-of-the-art action localization systems process each action proposal individually, without explicitly exploiting their relations during learning. However, the relations between proposals actually play an important role in action localization, since a meaningful action always consists of multiple proposals in a video. In this paper, we propose to exploit the proposal-proposal relations using GraphConvolutional Networks (GCNs). First, we construct an action proposal graph, where each proposal is represented as a node and their relations between two proposals as an edge. Here, we use two types of relations, one for capturing the context information for each proposal and the other one for characterizing the correlations between distinct actions. Then we apply the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization. Experimental results show that our approach significantly outperforms the state-of-the-art on THUMOS14(49.1% versus 42.8%). Moreover, augmentation experiments on ActivityNet also verify the efficacy of modeling action proposal relationships.

...read moreread less

460 citations

Proceedings Article•DOI•

HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation

[...]

Bowen Cheng¹, Bin Xiao², Jingdong Wang², Honghui Shi³, Thomas S. Huang¹, Lei Zhang² - Show less +2 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, Microsoft², University of Oregon³

14 Jun 2020

TL;DR: HigherHRNet is presented, a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids that surpasses all top-down methods on CrowdPose test and achieves new state-of-the-art result on COCO test-dev, suggesting its robustness in crowded scene.

...read moreread less

Abstract: Bottom-up human pose estimation methods have difficulties in predicting the correct pose for small persons due to challenges in scale variation. In this paper, we present HigherHRNet: a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids. Equipped with multi-resolution supervision for training and multi-resolution aggregation for inference, the proposed approach is able to solve the scale variation challenge in bottom-up multi-person pose estimation and localize keypoints more precisely, especially for small person. The feature pyramid in HigherHRNet consists of feature map outputs from HRNet and upsampled higher-resolution outputs through a transposed convolution. HigherHRNet outperforms the previous best bottom-up method by 2.5% AP for medium person on COCO test-dev, showing its effectiveness in handling scale variation. Furthermore, HigherHRNet achieves new state-of-the-art result on COCO test-dev (70.5% AP) without using refinement or other post-processing techniques, surpassing all existing bottom-up methods. HigherHRNet even surpasses all top-down methods on CrowdPose test (67.6% AP), suggesting its robustness in crowded scene.

...read moreread less

459 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse