Top 35 papers published by Chen Qian from SenseTime in 2021

Proceedings Article•DOI•

Reformulating HOI Detection as Adaptive Set Prediction

[...]

Mingfei Chen¹, Yue Liao², Si Liu², Zhiyuan Chen³, Fei Wang³, Chen Qian³ - Show less +2 more•Institutions (3)

Huazhong University of Science and Technology¹, Beihang University², SenseTime³

01 Jun 2021

TL;DR: AS-Net as mentioned in this paper proposes an adaptive set-based one-stage framework with parallel instance and interaction branches, where each query adaptively aggregates the interaction-relevant features from global contexts through multi-head co-attention.

...read moreread less

Abstract: Determining which image regions to concentrate is critical for Human-Object Interaction (HOI) detection. Conventional HOI detectors focus on either detected human and object pairs or pre-defined interaction locations, which limits learning of the effective features. In this paper, we reformulate HOI detection as an adaptive set prediction problem, with this novel formulation, we propose an Adaptive Set-based one-stage framework (AS-Net) with parallel instance and interaction branches. To attain this, we map a trainable interaction query set to an interaction prediction set with transformer. Each query adaptively aggregates the interaction-relevant features from global contexts through multi-head co-attention. Besides, the training process is supervised adaptively by matching each ground-truth with the interaction prediction. Furthermore, we design an effective instance-aware attention module to introduce instructive features from the instance branch into the interaction branch. Our method outperforms previous state-of-the-art methods without any extra human pose and language features on three challenging HOI detection datasets. Especially, we achieve over 31% relative improvement on a large scale HICO-DET dataset. Code is available at https://github.com/yoyomimi/AS-Net.

...read moreread less

82 citations

Journal Article•

Everybody's Talkin': Let Me Talk as You Want

[...]

Linsen Song¹, Wayne Wu², Chen Qian³, Ran He¹, Chen Change Loy⁴ - Show less +1 more•Institutions (4)

Chinese Academy of Sciences¹, Tsinghua University², The Chinese University of Hong Kong³, Nanyang Technological University⁴

04 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: A method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video, which is end-to-end learnable and robust to voice variations in the source audio.

...read moreread less

Abstract: We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the con-text of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.

...read moreread less

72 citations

Proceedings Article•DOI•

Prioritized Architecture Sampling with Monto-Carlo Tree Search

[...]

Xiu Su¹, Tao Huang², Yanxi Li¹, Shan You², Fei Wang², Chen Qian², Changshui Zhang³, Chang Xu¹ - Show less +4 more•Institutions (3)

University of Sydney¹, SenseTime², Tsinghua University³

01 Jun 2021

TL;DR: NAS-Bench-Macro as discussed by the authors proposes a sampling strategy based on Monte Carlo tree search (MCTS) with the search space modeled as a MCT, which captures the dependency among layers.

...read moreread less

Abstract: One-shot neural architecture search (NAS) methods significantly reduce the search cost by considering the whole search space as one network, which only needs to be trained once. However, current methods select each operation independently without considering previous layers. Besides, the historical information obtained with huge computation costs is usually used only once and then discarded. In this paper, we introduce a sampling strategy based on Monte Carlo tree search (MCTS) with the search space modeled as a Monte Carlo tree (MCT), which captures the dependency among layers. Furthermore, intermediate results are stored in the MCT for future decisions and a better exploration-exploitation balance. Concretely, MCT is updated using the training loss as a reward to the architecture performance; for accurately evaluating the numerous nodes, we propose node communication and hierarchical node selection methods in the training and search stages, respectively, making better uses of the operation rewards and hierarchical information. Moreover, for a fair comparison of different NAS methods, we construct an open-source NAS benchmark of a macro search space evaluated on CIFAR-10, namely NAS-Bench-Macro. Extensive experiments on NAS-Bench-Macro and ImageNet demonstrate that our method significantly improves search efficiency and performance. For example, by only searching 20 architectures, our obtained architecture achieves 78.0% top-1 accuracy with 442M FLOPs on ImageNet. Code (Benchmark) is available at: https://github.com/xiusu/NAS-Bench-Macro.

...read moreread less

35 citations

Proceedings Article•DOI•

Towards Improving the Consistency, Efficiency, and Flexibility of Differentiable Neural Architecture Search

[...]

Yibo Yang¹, Shan You, Hongyang Li¹, Fei Wang, Chen Qian, Zhouchen Lin¹ - Show less +2 more•Institutions (1)

Peking University¹

01 Jun 2021

TL;DR: EnTranNAS as discussed by the authors is composed of Engine-cells and Transit-cells, which is differentiable for architecture search, while the Transit-cell only transits a sub-graph by architecture derivation.

...read moreread less

Abstract: Most differentiable neural architecture search methods construct a super-net for search and derive a target-net as its sub-graph for evaluation. There exists a significant gap between the architectures in search and evaluation. As a result, current methods suffer from an inconsistent, inefficient, and inflexible search process. In this paper, we introduce EnTranNAS that is composed of Engine-cells and Transit-cells. The Engine-cell is differentiable for architecture search, while the Transit-cell only transits a sub-graph by architecture derivation. Consequently, the gap between the architectures in search and evaluation is significantly reduced. Our method also spares much memory and computation cost, which speeds up the search process. A feature sharing strategy is introduced for more balanced optimization and more efficient search. Furthermore, we develop an architecture derivation method to replace the traditional one that is based on a hand-crafted rule. Our method enables differentiable sparsification, and keeps the derived architecture equivalent to that of Engine-cell, which further improves the consistency between search and evaluation. More importantly, it supports the search for topology where a node can be connected to prior nodes with any number of connections, so that the searched architectures could be more flexible. Our search on CIFAR-10 has an error rate of 2.22% with only 0.07 GPU-day. We can also directly perform the search on ImageNet with topology learnable and achieve a top-1 error rate of 23.8% in 2.1 GPU-day.

...read moreread less

34 citations

Proceedings Article•DOI•

When Human Pose Estimation Meets Robustness: Adversarial Algorithms and Benchmarks

[...]

Jiahang Wang¹, Sheng Jin², Wentao Liu³, Weizhong Liu¹, Chen Qian³, Ping Luo² - Show less +2 more•Institutions (3)

Huazhong University of Science and Technology¹, University of Hong Kong², SenseTime³

13 May 2021

TL;DR: In this paper, the authors proposed AdvMix, which consists of adversarial augmentation and knowledge distillation to improve the robustness of the pose estimator by learning from harder samples.

...read moreread less

Abstract: Human pose estimation is a fundamental yet challenging task in computer vision, which aims at localizing human anatomical keypoints. However, unlike human vision that is robust to various data corruptions such as blur and pixelation, current pose estimators are easily confused by these corruptions. This work comprehensively studies and addresses this problem by building rigorous robust benchmarks, termed COCO-C, MPII-C, and OCHuman-C, to evaluate the weaknesses of current advanced pose estimators, and a new algorithm termed AdvMix is proposed to improve their robustness in different corruptions. Our work has several unique benefits. (1) AdvMix is model-agnostic and capable in a wide-spectrum of pose estimation models. (2) AdvMix consists of adversarial augmentation and knowledge distillation. Adversarial augmentation contains two neural network modules that are trained jointly and competitively in an adversarial manner, where a generator network mixes different corrupted images to confuse a pose estimator, improving the robustness of the pose estimator by learning from harder samples. To compensate for the noise patterns by adversarial augmentation, knowledge distillation is applied to transfer clean pose structure knowledge to the target pose estimator. (3) Extensive experiments show that AdvMix significantly increases the robustness of pose estimations across a wide range of corruptions, while maintaining accuracy on clean data in various challenging benchmark datasets.

...read moreread less

32 citations

Proceedings Article•DOI•

ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search

[...]

Lumin Xu¹, Guan Yingda², Sheng Jin³, Wentao Liu², Chen Qian², Ping Luo³, Wanli Ouyang⁴, Xiaogang Wang¹ - Show less +4 more•Institutions (4)

The Chinese University of Hong Kong¹, SenseTime², University of Hong Kong³, University of Sydney⁴

01 Jun 2021

TL;DR: Li et al. as mentioned in this paper proposed a novel neural architecture search (NAS) method, termed ViP-NAS, to search networks in both spatial and temporal levels for fast online video pose estimation.

...read moreread less

Abstract: Human pose estimation has achieved significant progress in recent years. However, most of the recent methods focus on improving accuracy using complicated models and ignoring real-time efficiency. To achieve a better trade-off between accuracy and efficiency, we propose a novel neural architecture search (NAS) method, termed ViP-NAS, to search networks in both spatial and temporal levels for fast online video pose estimation. In the spatial level, we carefully design the search space with five different dimensions including network depth, width, kernel size, group number, and attentions. In the temporal level, we search from a series of temporal feature fusions to optimize the total accuracy and speed across multiple video frames. To the best of our knowledge, we are the first to search for the temporal feature fusion and automatic computation allocation in videos. Extensive experiments demonstrate the effectiveness of our approach on the challenging COCO2017 and PoseTrack2018 datasets. Our discovered model family, S-ViPNAS and T-ViPNAS, achieve significantly higher inference speed (CPU real-time) without sacrificing the accuracy compared to the previous state-of-the-art methods.

...read moreread less

30 citations

Proceedings Article•DOI•

BCNet: Searching for Network Width with Bilaterally Coupled Network

[...]

Xiu Su¹, Shan You², Fei Wang², Chen Qian², Changshui Zhang³, Chang Xu¹ - Show less +2 more•Institutions (3)

University of Sydney¹, SenseTime², Tsinghua University³

01 Jun 2021

TL;DR: In this article, a Bilaterally Coupled Network (BCNet) is proposed to evaluate the performance w.r.t. different network widths, where each channel is fairly trained and responsible for the same amount of network width.

...read moreread less

Abstract: Searching for a more compact network width recently serves as an effective way of channel pruning for the deployment of convolutional neural networks (CNNs) under hardware constraints. To fulfill the searching, a one-shot supernet is usually leveraged to efficiently evaluate the performance w.r.t. different network widths. However, current methods mainly follow a unilaterally augmented (UA) principle for the evaluation of each width, which induces the training unfairness of channels in supernet. In this paper, we introduce a new supernet called Bilaterally Coupled Network (BCNet) to address this issue. In BCNet, each channel is fairly trained and responsible for the same amount of network widths, thus each network width can be evaluated more accurately. Besides, we leverage a stochastic complementary strategy for training the BCNet, and propose a prior initial population sampling method to boost the performance of the evolutionary search. Extensive experiments on benchmark CIFAR-10 and ImageNet datasets indicate that our method can achieve state-of-the-art or competing performance over other baseline methods. Moreover, our method turns out to further boost the performance of NAS models by refining their network widths. For example, with the same FLOPs budget, our obtained EfficientNet-B0 achieves 77.36% Top-1 accuracy on ImageNet dataset, surpassing the performance of original setting by 0.48%.

...read moreread less

26 citations

Proceedings Article•DOI•

Video Semantic Segmentation via Sparse Temporal Transformer

[...]

Jiangtong Li¹, Wentao Wang¹, Junjie Chen¹, Li Niu¹, Jianlou Si², Chen Qian², Liqing Zhang¹ - Show less +3 more•Institutions (2)

Shanghai Jiao Tong University¹, SenseTime²

17 Oct 2021

TL;DR: Wang et al. as discussed by the authors proposed a sparse temporal transformer (STT) to bridge temporal relation among video frames adaptively, which is also equipped with query selection and key selection.

...read moreread less

Abstract: Currently, video semantic segmentation mainly faces two challenges: 1) the demand of temporal consistency; 2) the balance between segmentation accuracy and inference efficiency. For the first challenge, existing methods usually use optical flow to capture the temporal relation in consecutive frames and maintain the temporal consistency, but the low inference speed by means of optical flow limits the real-time applications. For the second challenge, flow based key frame warping is one mainstream solution. However, the unbalanced inference latency of flow-based key frame warping makes it unsatisfactory for real-time applications. Considering the segmentation accuracy and inference efficiency, we propose a novel Sparse Temporal Transformer (STT) to bridge temporal relation among video frames adaptively, which is also equipped with query selection and key selection. The key selection and query selection strategies are separately applied to filter out temporal and spatial redundancy in our temporal transformer. Specifically, our STT can reduce the time complexity of temporal transformer by a large margin without harming the segmentation accuracy and temporal consistency. Experiments on two benchmark datasets, Cityscapes and Camvid, demonstrate that our method achieves the state-of-the-art segmentation accuracy and temporal consistency with comparable inference speed.

...read moreread less

17 citations

Journal Article•DOI•

Structure-Coherent Deep Feature Learning for Robust Face Alignment

[...]

Chunze Lin¹, Beier Zhu¹, Quan Wang¹, Renjie Liao², Chen Qian¹, Jiwen Lu³, Jie Zhou³ - Show less +3 more•Institutions (3)

SenseTime¹, University of Toronto², Tsinghua University³

26 May 2021-IEEE Transactions on Image Processing

TL;DR: Zhu et al. as discussed by the authors leverage a landmark-graph relational network to enforce the structural relationships among landmarks, and dynamically adapt the weights of node neighborhood to eliminate distracted information from noisy nodes, such as occluded landmark point.

...read moreread less

Abstract: In this paper, we propose a structure-coherent deep feature learning method for face alignment. Unlike most existing face alignment methods which overlook the facial structure cues, we explicitly exploit the relation among facial landmarks to make the detector robust to hard cases such as occlusion and large pose. Specifically, we leverage a landmark-graph relational network to enforce the structural relationships among landmarks. We consider the facial landmarks as structural graph nodes and carefully design the neighborhood to passing features among the most related nodes. Our method dynamically adapts the weights of node neighborhood to eliminate distracted information from noisy nodes, such as occluded landmark point. Moreover, different from most previous works which only tend to penalize the landmarks absolute position during the training, we propose a relative location loss to enhance the information of relative location of landmarks. This relative location supervision further regularizes the facial structure. Our approach considers the interactions among facial landmarks and can be easily implemented on top of any convolutional backbone to boost the performance. Extensive experiments on three popular benchmarks, including WFLW, COFW and 300W, demonstrate the effectiveness of the proposed method. In particular, due to explicit structure modeling, our approach is especially robust to challenging cases resulting in impressive low failure rate on COFW and WFLW datasets. The model and code are publicly available at https://github.com/BeierZhu/Sturcture-Coherency-Face-Alignment

...read moreread less

16 citations

Journal Article•

Explicit Learning Topology for Differentiable Neural Architecture Search

[...]

Tao Huang, Shan You¹, Yibo Yang², Zhuozhuo Tu³, Fei Wang, Chen Qian⁴, Changshui Zhang¹ - Show less +3 more•Institutions (4)

Tsinghua University¹, Peking University², University of Sydney³, The Chinese University of Hong Kong⁴

04 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes an explicit topology modeling method, named TopoNAS, to directly decouple the operation selection and topology during search, and introduces a set of topological variables and a combinatorial probabilistic distribution to explicitly indicate the target topology.

...read moreread less

Abstract: Differentiable neural architecture search (NAS) has gained much success in discovering more ﬂexible and diverse cell types. Current methods couple the operations and topology during search, and simply derive optimal topology by a hand-craft rule. However, topology also matters for neural architectures since it controls the interactions between features of operations. In this paper, we highlight the topology learning in differentiable NAS, and propose an explicit topology modeling method, named TopoNAS, to directly decouple the operation selection and topology during search. Concretely, we introduce a set of topological variables and a combinatorial probabilistic distribution to explicitly indicate the target topology. Besides, we also leverage a passive-aggressive regularization to suppress invalid topology within supernet. Our introduced topological variables can be jointly learned with operation variables and supernet weights, and apply to various DARTS variants. Extensive experiments on CIFAR-10 and ImageNet validate the effectiveness of our proposed TopoNAS. The results show that TopoNAS does enable to search cells with more diverse and complex topology, and boost the performance signiﬁcantly. For example, TopoNAS can improve DARTS by 0.16% accuracy on CIFAR-10 dataset with 40% parameters reduced or 0.35% with similar parameters.

...read moreread less

12 citations

Posted Content•

Locally Free Weight Sharing for Network Width Search

[...]

Xiu Su¹, Shan You², Tao Huang, Fei Wang, Chen Qian³, Changshui Zhang², Chang Xu¹ - Show less +3 more•Institutions (3)

University of Sydney¹, Tsinghua University², The Chinese University of Hong Kong³

10 Feb 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: CafeNet as discussed by the authors proposes a locally free weight sharing strategy, where weights are more freely shared, and each width is jointly indicated by its base channels and free channels, where free channels are supposed to be placed in a local zone to better represent each width.

...read moreread less

Abstract: Searching for network width is an effective way to slim deep neural networks with hardware budgets. With this aim, a one-shot supernet is usually leveraged as a performance evaluator to rank the performance \wrt~different width. Nevertheless, current methods mainly follow a manually fixed weight sharing pattern, which is limited to distinguish the performance gap of different width. In this paper, to better evaluate each width, we propose a locally free weight sharing strategy (CafeNet) accordingly. In CafeNet, weights are more freely shared, and each width is jointly indicated by its base channels and free channels, where free channels are supposed to loCAte FrEely in a local zone to better represent each width. Besides, we propose to further reduce the search space by leveraging our introduced FLOPs-sensitive bins. As a result, our CafeNet can be trained stochastically and get optimized within a min-min strategy. Extensive experiments on ImageNet, CIFAR-10, CelebA and MS COCO dataset have verified our superiority comparing to other state-of-the-art baselines. For example, our method can further boost the benchmark NAS network EfficientNet-B0 by 0.41\% via searching its width more delicately.

...read moreread less

Proceedings Article•

Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images

[...]

Size Wu, Sheng Jin¹, Wentao Liu², Lei Bai³, Chen Qian², Dong Liu, Wanli Ouyang⁴ - Show less +3 more•Institutions (4)

Zhejiang University¹, SenseTime², University of New South Wales³, University of Sydney⁴

01 Jan 2021

TL;DR: Zhang et al. as discussed by the authors decompose the task into two stages, i.e. person localization and pose estimation, and propose three task-specific graph neural networks for effective message passing.

...read moreread less

Abstract: This paper studies the task of estimating the 3D human poses of multiple persons from multiple calibrated camera views. Following the top-down paradigm, we decompose the task into two stages, i.e. person localization and pose estimation. Both stages are processed in coarse-to-fine manners. And we propose three task-specific graph neural networks for effective message passing. For 3D person localization, we first use Multi-view Matching Graph Module (MMG) to learn the cross-view association and recover coarse human proposals. The Center Refinement Graph Module (CRG) further refines the results via flexible point-based prediction. For 3D pose estimation, the Pose Regression Graph Module (PRG) learns both the multi-view geometry and structural relations between human joints. Our approach achieves state-of-the-art performance on CMU Panoptic and Shelf datasets with significantly lower computation complexity.

...read moreread less

Posted Content•

Towards Improving the Consistency, Efficiency, and Flexibility of Differentiable Neural Architecture Search.

[...]

Yibo Yang¹, Shan You, Hongyang Li¹, Fei Wang, Chen Qian, Zhouchen Lin¹ - Show less +2 more•Institutions (1)

Peking University¹

27 Jan 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: EnTranNAS as mentioned in this paper is composed of Engine-cells and Transit-cells, which is differentiable for architecture search, while the Transit-cell only transits a sub-graph by architecture derivation.

...read moreread less

Abstract: Most differentiable neural architecture search methods construct a super-net for search and derive a target-net as its sub-graph for evaluation. There exists a significant gap between the architectures in search and evaluation. As a result, current methods suffer from an inconsistent, inefficient, and inflexible search process. In this paper, we introduce EnTranNAS that is composed of Engine-cells and Transit-cells. The Engine-cell is differentiable for architecture search, while the Transit-cell only transits a sub-graph by architecture derivation. Consequently, the gap between the architectures in search and evaluation is significantly reduced. Our method also spares much memory and computation cost, which speeds up the search process. A feature sharing strategy is introduced for more balanced optimization and more efficient search. Furthermore, we develop an architecture derivation method to replace the traditional one that is based on a hand-crafted rule. Our method enables differentiable sparsification, and keeps the derived architecture equivalent to that of Engine-cell, which further improves the consistency between search and evaluation. Besides, it supports the search for topology where a node can be connected to prior nodes with any number of connections, so that the searched architectures could be more flexible. For experiments on CIFAR-10, our search on the standard space requires only 0.06 GPU-day. We further have an error rate of 2.22% with 0.07 GPU-day for the search on an extended space. We can also directly perform the search on ImageNet with topology learnable and achieve a top-1 error rate of 23.8% in 2.1 GPU-day.

...read moreread less

Proceedings Article•

TAM: Temporal Adaptive Module for Video Recognition

[...]

Zhaoyang Liu¹, Limin Wang¹, Wayne Wu², Chen Qian³, Tong Lu¹ - Show less +1 more•Institutions (3)

Nanjing University¹, Tsinghua University², The Chinese University of Hong Kong³

04 May 2021

TL;DR: In this article, a new temporal adaptive module ({\bf TAM}) is proposed to generate video-specific kernels based on its own feature maps, which decouples dynamic kernels into a location sensitive importance map and a location invariant aggregation weight.

...read moreread less

Abstract: Temporal modeling is crucial for capturing spatiotemporal structure in videos for action recognition. Video data is with extremely complex dynamics along its temporal dimension due to various factors such as camera motion, speed variation, and different activities. To effectively capture this diverse motion pattern, this paper presents a new temporal adaptive module ({\bf TAM}) to generate video-specific kernels based on its own feature maps. TAM proposes a unique two-level adaptive modeling scheme by decoupling dynamic kernels into a location sensitive importance map and a location invariant aggregation weight. The importance map is learned in a local temporal window to capture short term information, while the aggregation weight is generated from a global view with a focus on long-term structure. TAM is a principled module and could be integrated into 2D CNNs to yield a powerful video architecture (TANet) with a very small extra computational cost. The extensive experiments on Kinetics-400 and Something-Something datasets, demonstrate that the TAM outperforms other temporal modeling methods consistently owing to its temporal adaptive modeling strategy.

...read moreread less

Posted Content•

Reformulating HOI Detection as Adaptive Set Prediction

[...]

Mingfei Chen¹, Yue Liao², Si Liu², Zhiyuan Chen³, Fei Wang³, Chen Qian³ - Show less +2 more•Institutions (3)

Huazhong University of Science and Technology¹, Beihang University², SenseTime³

10 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors reformulate HOI detection as an adaptive set prediction problem, with this novel formulation, they propose an Adaptive Set-based one-stage framework (AS-Net) with parallel instances and interaction branches.

...read moreread less

Abstract: Determining which image regions to concentrate on is critical for Human-Object Interaction (HOI) detection. Conventional HOI detectors focus on either detected human and object pairs or pre-defined interaction locations, which limits learning of the effective features. In this paper, we reformulate HOI detection as an adaptive set prediction problem, with this novel formulation, we propose an Adaptive Set-based one-stage framework (AS-Net) with parallel instances and interaction branches. To attain this, we map a trainable interaction query set to an interaction prediction set with a transformer. Each query adaptively aggregates the interaction-relevant features from global contexts through multi-head co-attention. Besides, the training process is supervised adaptively by matching each ground truth with the interaction prediction. Furthermore, we design an effective instance-aware attention module to introduce instructive features from the instance branch into the interaction branch. Our method outperforms previous state-of-the-art methods without any extra human pose and language features on three challenging HOI detection datasets. Especially, we achieve over $31\%$ relative improvement on a large-scale HICO-DET dataset. Code is available at this https URL.

...read moreread less

Proceedings Article•DOI•

Pareidolia Face Reenactment

[...]

Linsen Song¹, Wayne Wu², Chaoyou Fu¹, Chen Qian², Chen Change Loy³, Ran He¹ - Show less +2 more•Institutions (3)

Chinese Academy of Sciences¹, SenseTime², Nanyang Technological University³

01 Jun 2021

TL;DR: Wang et al. as discussed by the authors proposed to decompose the reenactment into three catenate processes: shape modeling, motion transfer and texture synthesis, and introduce three crucial components, i.e., Parametric Shape Modeling, Expansionary Motion Transfer and Unsupervised Texture Synthesizer, to overcome the remarkably variances on pareidolia faces.

...read moreread less

Abstract: We present a new application direction named Pareidolia Face Reenactment, which is defined as animating a static illusory face to move in tandem with a human face in the video. For the large differences between pareidolia face reenactment and traditional human face reenactment, two main challenges are introduced, i.e., shape variance and texture variance. In this work, we propose a novel Parametric Unsupervised Reenactment Algorithm to tackle these two challenges. Specifically, we propose to decompose the reenactment into three catenate processes: shape modeling, motion transfer and texture synthesis. With the decomposition, we introduce three crucial components, i.e., Parametric Shape Modeling, Expansionary Motion Transfer and Unsupervised Texture Synthesizer, to overcome the problems brought by the remarkably variances on pareidolia faces. Extensive experiments show the superior performance of our method both qualitatively and quantitatively. Code, model and data are available on our project page1.

...read moreread less

Posted Content•

When Human Pose Estimation Meets Robustness: Adversarial Algorithms and Benchmarks

[...]

Jiahang Wang¹, Sheng Jin², Wentao Liu³, Weizhong Liu¹, Chen Qian³, Ping Luo² - Show less +2 more•Institutions (3)

Huazhong University of Science and Technology¹, University of Hong Kong², SenseTime³

13 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors proposed AdvMix, which consists of adversarial augmentation and knowledge distillation to improve the robustness of the pose estimator by learning from harder samples.

...read moreread less

Abstract: Human pose estimation is a fundamental yet challenging task in computer vision, which aims at localizing human anatomical keypoints. However, unlike human vision that is robust to various data corruptions such as blur and pixelation, current pose estimators are easily confused by these corruptions. This work comprehensively studies and addresses this problem by building rigorous robust benchmarks, termed COCO-C, MPII-C, and OCHuman-C, to evaluate the weaknesses of current advanced pose estimators, and a new algorithm termed AdvMix is proposed to improve their robustness in different corruptions. Our work has several unique benefits. (1) AdvMix is model-agnostic and capable in a wide-spectrum of pose estimation models. (2) AdvMix consists of adversarial augmentation and knowledge distillation. Adversarial augmentation contains two neural network modules that are trained jointly and competitively in an adversarial manner, where a generator network mixes different corrupted images to confuse a pose estimator, improving the robustness of the pose estimator by learning from harder samples. To compensate for the noise patterns by adversarial augmentation, knowledge distillation is applied to transfer clean pose structure knowledge to the target pose estimator. (3) Extensive experiments show that AdvMix significantly increases the robustness of pose estimations across a wide range of corruptions, while maintaining accuracy on clean data in various challenging benchmark datasets.

...read moreread less

Posted Content•

Vision Transformer Architecture Search

[...]

Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, Chang Xu - Show less +5 more

25 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as mentioned in this paper proposed an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets, such that architectures with different token embedding, sequence size, number of heads, width, and depth can be derived from a single super-transformer.

...read moreread less

Abstract: Recently, transformers have shown great superiority in solving computer vision tasks by modeling images as a sequence of manually-split patches with self-attention mechanism. However, current architectures of vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks and have not been sufficiently investigated and optimized. In this paper, we make a further step by examining the intrinsic structure of transformers for vision tasks and propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets. Concretely, we design a new effective yet efficient weight sharing paradigm for ViTs, such that architectures with different token embedding, sequence size, number of heads, width, and depth can be derived from a single super-transformer. Moreover, to cater for the variance of distinct architectures, we introduce \textit{private} class token and self-attention maps in the super-transformer. In addition, to adapt the searching for different budgets, we propose to search the sampling probability of identity operation. Experimental results show that our ViTAS attains excellent results compared to existing pure transformer architectures. For example, with $1.3$G FLOPs budget, our searched architecture achieves $74.7\%$ top-$1$ accuracy on ImageNet and is $2.5\%$ superior than the current baseline ViT architecture. Code is available at \url{this https URL}.

...read moreread less

Posted Content•

BCNet: Searching for Network Width with Bilaterally Coupled Network

[...]

Xiu Su¹, Shan You², Fei Wang², Chen Qian², Changshui Zhang³, Chang Xu¹ - Show less +2 more•Institutions (3)

University of Sydney¹, SenseTime², Tsinghua University³

21 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Bilaterally Coupled Network (BCNet) as discussed by the authors proposes a prior initial population sampling method to boost the performance of the evolutionary search and achieves state-of-the-art or competing performance over other baseline methods.

...read moreread less

Abstract: Searching for a more compact network width recently serves as an effective way of channel pruning for the deployment of convolutional neural networks (CNNs) under hardware constraints. To fulfill the searching, a one-shot supernet is usually leveraged to efficiently evaluate the performance \wrt~different network widths. However, current methods mainly follow a \textit{unilaterally augmented} (UA) principle for the evaluation of each width, which induces the training unfairness of channels in supernet. In this paper, we introduce a new supernet called Bilaterally Coupled Network (BCNet) to address this issue. In BCNet, each channel is fairly trained and responsible for the same amount of network widths, thus each network width can be evaluated more accurately. Besides, we leverage a stochastic complementary strategy for training the BCNet, and propose a prior initial population sampling method to boost the performance of the evolutionary search. Extensive experiments on benchmark CIFAR-10 and ImageNet datasets indicate that our method can achieve state-of-the-art or competing performance over other baseline methods. Moreover, our method turns out to further boost the performance of NAS models by refining their network widths. For example, with the same FLOPs budget, our obtained EfficientNet-B0 achieves 77.36\% Top-1 accuracy on ImageNet dataset, surpassing the performance of original setting by 0.48\%.

...read moreread less

Posted Content•

K-shot NAS: Learnable Weight-Sharing for NAS with K-shot Supernets

[...]

Xiu Su¹, Shan You², Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang³, Chang Xu¹ - Show less +3 more•Institutions (3)

University of Sydney¹, SenseTime², Tsinghua University³

11 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, instead of counting on a single supernet, instead of taking their weights for each operation as a dictionary, the operation weight for each path is represented as a convex combination of items in a dictionary with a simplex code.

...read moreread less

Abstract: In one-shot weight sharing for NAS, the weights of each operation (at each layer) are supposed to be identical for all architectures (paths) in the supernet. However, this rules out the possibility of adjusting operation weights to cater for different paths, which limits the reliability of the evaluation results. In this paper, instead of counting on a single supernet, we introduce $K$-shot supernets and take their weights for each operation as a dictionary. The operation weight for each path is represented as a convex combination of items in a dictionary with a simplex code. This enables a matrix approximation of the stand-alone weight matrix with a higher rank ($K>1$). A \textit{simplex-net} is introduced to produce architecture-customized code for each path. As a result, all paths can adaptively learn how to share weights in the $K$-shot supernets and acquire corresponding weights for better evaluation. $K$-shot supernets and simplex-net can be iteratively trained, and we further extend the search to the channel dimension. Extensive experiments on benchmark datasets validate that K-shot NAS significantly improves the evaluation accuracy of paths and thus brings in impressive performance improvements.

...read moreread less

Book Chapter•DOI•

PNO: Personalized Network Optimization for Human Pose and Shape Reconstruction

[...]

Zhijie Cao¹, Min Wang², Shanyan Guan¹, Wentao Liu², Chen Qian², Lizhuang Ma¹ - Show less +2 more•Institutions (2)

Shanghai Jiao Tong University¹, SenseTime²

14 Sep 2021

TL;DR: In this paper, a Personalized Network Optimization (PNO) method is proposed to maintain both generalization and personality for human pose and shape reconstruction by optimizing with only a few unlabeled video frames of the target person.

...read moreread less

Abstract: Most previous human pose and shape reconstruction methods focus on the generalization ability and learn a prior of the general pose and shape, however the personalized features are often ignored. We argue that the personalized features such as appearance and body shape are always consistent for the specific person and can further improve the accuracy. In this paper, we propose a Personalized Network Optimization (PNO) method to maintain both generalization and personality for human pose and shape reconstruction. The general trained network is adapted to the personalized network by optimizing with only a few unlabeled video frames of the target person. Moreover, we specially propose geometry-aware temporal constraints that help the network better exploit the geometry knowledge of the target person. In order to prove the effectiveness of PNO, we re-design the benchmark of pose and shape reconstruction to test on each person independently. Experiments show that our method achieve the state-of-the-art results in both 3DPW and MPI-INF-3DHP datasets.

...read moreread less

Proceedings Article•

Weakly Supervised Contrastive Learning

[...]

Mingkai Zheng, Fei Wang¹, Shan You², Chen Qian¹, Changshui Zhang², Xiaogang Wang³, Chang Xu⁴ - Show less +3 more•Institutions (4)

SenseTime¹, Tsinghua University², The Chinese University of Hong Kong³, University of Sydney⁴

01 Jan 2021

TL;DR: Wang et al. as discussed by the authors proposed a weakly supervised contrastive learning framework (WCL) to tackle the problem of class collision by using a graph-based method to explore similar samples and generate a weak label.

...read moreread less

Abstract: Unsupervised visual representation learning has gained much attention from the computer vision community because of the recent achievement of contrastive learning. Most of the existing contrastive learning frameworks adopt the instance discrimination as the pretext task, which treating every single instance as a different class. However, such method will inevitably cause class collision problems, which hurts the quality of the learned representation. Motivated by this observation, we introduced a weakly supervised contrastive learning framework (WCL) to tackle this issue. Specifically, our proposed framework is based on two projection heads, one of which will perform the regular instance discrimination task. The other head will use a graph-based method to explore similar samples and generate a weak label, then perform a supervised contrastive learning task based on the weak label to pull the similar images closer. We further introduced a K-Nearest Neighbor based multi-crop strategy to expand the number of positive samples. Extensive experimental results demonstrate WCL improves the quality of self-supervised representations across different datasets. Notably, we get a new state-of-the-art result for semi-supervised learning. With only 1\% and 10\% labeled examples, WCL achieves 65\% and 72\% ImageNet Top-1 Accuracy using ResNet50, which is even higher than SimCLRv2 with ResNet101.

...read moreread less

Proceedings Article•

K-shot NAS: Learnable Weight-Sharing for NAS with K-shot Supernets

[...]

Xiu Su¹, Shan You², Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang³, Chang Xu¹ - Show less +3 more•Institutions (3)

University of Sydney¹, SenseTime², Tsinghua University³

18 Jul 2021

TL;DR: In this article, instead of counting on a single supernet, instead of taking their weights for each operation as a dictionary, the operation weight for each path is represented as a convex combination of items in a dictionary with a simplex code.

...read moreread less

Abstract: In one-shot weight sharing for NAS, the weights of each operation (at each layer) are supposed to be identical for all architectures (paths) in the supernet. However, this rules out the possibility of adjusting operation weights to cater for different paths, which limits the reliability of the evaluation results. In this paper, instead of counting on a single supernet, we introduce $K$-shot supernets and take their weights for each operation as a dictionary. The operation weight for each path is represented as a convex combination of items in a dictionary with a simplex code. This enables a matrix approximation of the stand-alone weight matrix with a higher rank ($K>1$). A \textit{simplex-net} is introduced to produce architecture-customized code for each path. As a result, all paths can adaptively learn how to share weights in the $K$-shot supernets and acquire corresponding weights for better evaluation. $K$-shot supernets and simplex-net can be iteratively trained, and we further extend the search to the channel dimension. Extensive experiments on benchmark datasets validate that K-shot NAS significantly improves the evaluation accuracy of paths and thus brings in impressive performance improvements.

...read moreread less

Proceedings Article•

ReSSL: Relational Self-Supervised Learning with Weak Augmentation

[...]

Mingkai Zheng, Shan You¹, Fei Wang¹, Chen Qian¹, Changshui Zhang², Xiaogang Wang³, Chang Xu⁴ - Show less +3 more•Institutions (4)

SenseTime¹, Tsinghua University², The Chinese University of Hong Kong³, University of Sydney⁴

06 Dec 2021

TL;DR: Li et al. as mentioned in this paper proposed a relational self-supervised learning (ReSSL) framework, which employs sharpened distribution of pairwise similarities among different instances as \textit{relation} metric, which is thus utilized to match the feature embeddings of different augmentations.

...read moreread less

Abstract: Self-supervised Learning (SSL) including the mainstream contrastive learning has achieved great success in learning visual representations without data annotations. However, most of methods mainly focus on the instance level information (\ie, the different augmented images of the same instance should have the same feature or cluster into the same class), but there is a lack of attention on the relationships between different instances. In this paper, we introduced a novel SSL paradigm, which we term as relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances. Specifically, our proposed method employs sharpened distribution of pairwise similarities among different instances as \textit{relation} metric, which is thus utilized to match the feature embeddings of different augmentations. Moreover, to boost the performance, we argue that weak augmentations matter to represent a more reliable relation, and leverage momentum strategy for practical efficiency. Experimental results show that our proposed ReSSL significantly outperforms the previous state-of-the-art algorithms in terms of both performance and training efficiency. Code is available at \url{this https URL}.

...read moreread less

Posted Content•

Weak-shot Semantic Segmentation by Transferring Semantic Affinity and Boundary.

[...]

Siyuan Zhou, Li Niu, Jianlou Si, Chen Qian, Liqing Zhang - Show less +1 more

04 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a weakly-supervised semantic segmentation (WSSS) with image-level labels was proposed to relieve the annotation burden of the traditional segmentation task, which could also be treated as WSSS with auxiliary fully-annotated categories.

...read moreread less

Abstract: Weakly-supervised semantic segmentation (WSSS) with image-level labels has been widely studied to relieve the annotation burden of the traditional segmentation task. In this paper, we show that existing fully-annotated base categories can help segment objects of novel categories with only image-level labels, even if base and novel categories have no overlap. We refer to this task as weak-shot semantic segmentation, which could also be treated as WSSS with auxiliary fully-annotated categories. Recent advanced WSSS methods usually obtain class activation maps (CAMs) and refine them by affinity propagation. Based on the observation that semantic affinity and boundary are class-agnostic, we propose a method under the WSSS framework to transfer semantic affinity and boundary from base categories to novel ones. As a result, we find that pixel-level annotation of base categories can facilitate affinity learning and propagation, leading to higher-quality CAMs of novel categories. Extensive experiments on PASCAL VOC 2012 dataset demonstrate that our method significantly outperforms WSSS baselines on novel categories.

...read moreread less

Posted Content•

ReSSL: Relational Self-Supervised Learning with Weak Augmentation

[...]

Mingkai Zheng, Shan You¹, Fei Wang¹, Chen Qian¹, Changshui Zhang², Xiaogang Wang³, Chang Xu⁴ - Show less +3 more•Institutions (4)

SenseTime¹, Tsinghua University², The Chinese University of Hong Kong³, University of Sydney⁴

20 Jul 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Li et al. as discussed by the authors proposed a relational self-supervised learning (ReSSL) framework, which employs sharpened distribution of pairwise similarities among different instances as \textit{relation} metric, which is thus utilized to match the feature embeddings of different augmentations.

...read moreread less

Abstract: Self-supervised Learning (SSL) including the mainstream contrastive learning has achieved great success in learning visual representations without data annotations. However, most of methods mainly focus on the instance level information (\ie, the different augmented images of the same instance should have the same feature or cluster into the same class), but there is a lack of attention on the relationships between different instances. In this paper, we introduced a novel SSL paradigm, which we term as relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances. Specifically, our proposed method employs sharpened distribution of pairwise similarities among different instances as \textit{relation} metric, which is thus utilized to match the feature embeddings of different augmentations. Moreover, to boost the performance, we argue that weak augmentations matter to represent a more reliable relation, and leverage momentum strategy for practical efficiency. Experimental results show that our proposed ReSSL significantly outperforms the previous state-of-the-art algorithms in terms of both performance and training efficiency. Code is available at \url{this https URL}.

...read moreread less

Proceedings Article•

Locally Free Weight sharing for Network Width Search

[...]

Xiu Su¹, Shan You², Tao Huang, Fei Wang, Chen Qian³, Changshui Zhang², Chang Xu¹ - Show less +3 more•Institutions (3)

University of Sydney¹, Tsinghua University², The Chinese University of Hong Kong³

03 May 2021

TL;DR: CafeNet as discussed by the authors proposes a locally free weight sharing strategy, where weights are more freely shared and each width is jointly indicated by its base channels and free channels, where free channels are supposed to locate freely in a local zone to better represent each width.

...read moreread less

Abstract: Searching for network width is an effective way to slim deep neural networks with hardware budgets. With this aim, a one-shot supernet is usually leveraged as a performance evaluator to rank the performance \wrt~different width. Nevertheless, current methods mainly follow a manually fixed weight sharing pattern, which is limited to distinguish the performance gap of different width. In this paper, to better evaluate each width, we propose a locally free weight sharing strategy (CafeNet) accordingly. In CafeNet, weights are more freely shared, and each width is jointly indicated by its base channels and free channels, where free channels are supposed to locate freely in a local zone to better represent each width. Besides, we propose to further reduce the search space by leveraging our introduced FLOPs-sensitive bins. As a result, our CafeNet can be trained stochastically and get optimized within a min-min strategy. Extensive experiments on ImageNet, CIFAR-10, CelebA and MS COCO dataset have verified our superiority comparing to other state-of-the-art baselines. For example, our method can further boost the benchmark NAS network EfficientNet-B0 by 0.41\% via searching its width more delicately.

...read moreread less

Posted Content•

GreedyNASv2: Greedier Search with a Greedy Path Filter

[...]

Tao Huang, Shan You, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, Chang Xu - Show less +3 more

24 Nov 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: GreedyNASv2 as discussed by the authors leverages an explicit path filter to capture the characteristics of paths and directly filter those weak ones, so that the search can be thus implemented on the shrunk space more greedily and efficiently.

...read moreread less

Abstract: Training a good supernet in one-shot NAS methods is difficult since the search space is usually considerably huge (\eg, $13^{21}$). In order to enhance the supernet's evaluation ability, one greedy strategy is to sample good paths, and let the supernet lean towards the good ones and ease its evaluation burden as a result. However, in practice the search can be still quite inefficient since the identification of good paths is not accurate enough and sampled paths still scatter around the whole search space. In this paper, we leverage an explicit path filter to capture the characteristics of paths and directly filter those weak ones, so that the search can be thus implemented on the shrunk space more greedily and efficiently. Concretely, based on the fact that good paths are much less than the weak ones in the space, we argue that the label of ``weak paths" will be more confident and reliable than that of ``good paths" in multi-path sampling. In this way, we thus cast the training of path filter in the positive and unlabeled (PU) learning paradigm, and also encourage a \textit{path embedding} as better path/operation representation to enhance the identification capacity of the learned filter. By dint of this embedding, we can further shrink the search space by aggregating similar operations with similar embeddings, and the search can be more efficient and accurate. Extensive experiments validate the effectiveness of the proposed method GreedyNASv2. For example, our obtained GreedyNASv2-L achieves $81.1\%$ Top-1 accuracy on ImageNet dataset, significantly outperforming the ResNet-50 strong baselines.

...read moreread less

Posted Content•

Everything's Talkin': Pareidolia Face Reenactment

[...]

Linsen Song¹, Wayne Wu, Chaoyou Fu, Chen Qian¹, Chen Change Loy¹, Ran He² - Show less +2 more•Institutions (2)

Chinese Academy of Sciences¹, Nanyang Technological University²

07 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors proposed to decompose the reenactment into three catenate processes: shape modeling, motion transfer and texture synthesis, and introduce three crucial components, i.e., Parametric Shape Modeling, Expansionary Motion Transfer and Unsupervised Texture Synthesizer, to overcome the remarkably variances on pareidolia faces.

...read moreread less

Abstract: We present a new application direction named Pareidolia Face Reenactment, which is defined as animating a static illusory face to move in tandem with a human face in the video. For the large differences between pareidolia face reenactment and traditional human face reenactment, two main challenges are introduced, i.e., shape variance and texture variance. In this work, we propose a novel Parametric Unsupervised Reenactment Algorithm to tackle these two challenges. Specifically, we propose to decompose the reenactment into three catenate processes: shape modeling, motion transfer and texture synthesis. With the decomposition, we introduce three crucial components, i.e., Parametric Shape Modeling, Expansionary Motion Transfer and Unsupervised Texture Synthesizer, to overcome the problems brought by the remarkably variances on pareidolia faces. Extensive experiments show the superior performance of our method both qualitatively and quantitatively. Code, model and data are available on our project page.

...read moreread less

Posted Content•

Prioritized Architecture Sampling with Monto-Carlo Tree Search

[...]

Xiu Su¹, Tao Huang², Yanxi Li¹, Shan You², Fei Wang², Chen Qian², Changshui Zhang³, Chang Xu¹ - Show less +4 more•Institutions (3)

University of Sydney¹, SenseTime², Tsinghua University³

22 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as discussed by the authors proposed a sampling strategy based on Monte Carlo tree search (MCTS) with the search space modeled as a MCT, which captures the dependency among layers.

...read moreread less

Abstract: One-shot neural architecture search (NAS) methods significantly reduce the search cost by considering the whole search space as one network, which only needs to be trained once. However, current methods select each operation independently without considering previous layers. Besides, the historical information obtained with huge computation cost is usually used only once and then discarded. In this paper, we introduce a sampling strategy based on Monte Carlo tree search (MCTS) with the search space modeled as a Monte Carlo tree (MCT), which captures the dependency among layers. Furthermore, intermediate results are stored in the MCT for the future decision and a better exploration-exploitation balance. Concretely, MCT is updated using the training loss as a reward to the architecture performance; for accurately evaluating the numerous nodes, we propose node communication and hierarchical node selection methods in the training and search stages, respectively, which make better uses of the operation rewards and hierarchical information. Moreover, for a fair comparison of different NAS methods, we construct an open-source NAS benchmark of a macro search space evaluated on CIFAR-10, namely NAS-Bench-Macro. Extensive experiments on NAS-Bench-Macro and ImageNet demonstrate that our method significantly improves search efficiency and performance. For example, by only searching $20$ architectures, our obtained architecture achieves $78.0\%$ top-1 accuracy with 442M FLOPs on ImageNet. Code (Benchmark) is available at: \url{this https URL}.

...read moreread less

Showing papers by "Chen Qian published in 2021"