Top 31 papers published by Chen Qian from SenseTime in 2019

Proceedings Article•DOI•

The Seventh Visual Object Tracking VOT2019 Challenge Results

[...]

Matej Kristan¹, Amanda Berg², Linyu Zheng³, Litu Rout⁴ +176 more•Institutions (43)

01 Oct 2019

TL;DR: The Visual Object Tracking challenge VOT2019 is the seventh annual tracker benchmarking activity organized by the VOT initiative; results of 81 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years.

...read moreread less

Abstract: The Visual Object Tracking challenge VOT2019 is the seventh annual tracker benchmarking activity organized by the VOT initiative. Results of 81 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis as well as the standard VOT methodology for long-term tracking analysis. The VOT2019 challenge was composed of five challenges focusing on different tracking domains: (i) VOTST2019 challenge focused on short-term tracking in RGB, (ii) VOT-RT2019 challenge focused on "real-time" shortterm tracking in RGB, (iii) VOT-LT2019 focused on longterm tracking namely coping with target disappearance and reappearance. Two new challenges have been introduced: (iv) VOT-RGBT2019 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2019 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2019, VOT-RT2019 and VOT-LT2019 datasets were refreshed while new datasets were introduced for VOT-RGBT2019 and VOT-RGBD2019. The VOT toolkit has been updated to support both standard shortterm, long-term tracking and tracking with multi-channel imagery. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website.

...read moreread less

393 citations

Proceedings Article•DOI•

Deep Comprehensive Correlation Mining for Image Clustering

[...]

Jianlong Wu¹, Keyu Long², Fei Wang, Chen Qian, Cheng Li³, Zhouchen Lin¹, Hongbin Zha¹ - Show less +3 more•Institutions (3)

Peking University¹, Chinese Academy of Sciences², SenseTime³

01 Oct 2019

TL;DR: A novel clustering framework, named deep comprehensive correlation mining~(DCCM), for exploring and taking full advantage of various kinds of correlations behind the unlabeled data from three aspects: Instead of only using pair-wise information, pseudo-label supervision is proposed to investigate category information and learn discriminative features.

...read moreread less

Abstract: Recent developed deep unsupervised methods allow us to jointly learn representation and cluster unlabelled data. These deep clustering methods %like DAC start with mainly focus on the correlation among samples, e.g., selecting high precision pairs to gradually tune the feature representation, which neglects other useful correlations. In this paper, we propose a novel clustering framework, named deep comprehensive correlation mining~(DCCM), for exploring and taking full advantage of various kinds of correlations behind the unlabeled data from three aspects: 1) Instead of only using pair-wise information, pseudo-label supervision is proposed to investigate category information and learn discriminative features. 2) The features' robustness to image transformation of input space is fully explored, which benefits the network learning and significantly improves the performance. 3) The triplet mutual information among features is presented for clustering problem to lift the recently discovered instance-level deep mutual information to a triplet-level formation, which further helps to learn more discriminative features. Extensive experiments on several challenging datasets show that our method achieves good performance, e.g., attaining 62.3% clustering accuracy on CIFAR-10, which is 10.1% higher than the state-of-the-art results.

...read moreread less

141 citations

Proceedings Article•DOI•

Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation

[...]

Xipeng Chen¹, Kwan-Yee Lin², Wentao Liu, Chen Qian, Liang Lin¹ - Show less +1 more•Institutions (2)

Sun Yat-sen University¹, Peking University²

15 Jun 2019

TL;DR: Zhang et al. as mentioned in this paper propose a skeleton-based encoder-decoder mechanism to distil only pose-related representation in the latent space, and a learning-based representation consistency constraint is further introduced to facilitate the robustness of latent 3D representation.

...read moreread less

Abstract: Recent studies have shown remarkable advances in 3D human pose estimation from monocular images, with the help of large-scale in-door 3D datasets and sophisticated network architectures. However, the generalizability to different environments remains an elusive goal. In this work, we propose a geometry-aware 3D representation for the human pose to address this limitation by using multiple views in a simple auto-encoder model at the training stage and only 2D keypoint information as supervision. A view synthesis framework is proposed to learn the shared 3D representation between viewpoints with synthe- sizing the human pose from one viewpoint to the other one. Instead of performing a direct transfer in the raw image- level, we propose a skeleton-based encoder-decoder mechanism to distil only pose-related representation in the latent space. A learning-based representation consistency constraint is further introduced to facilitate the robustness of latent 3D representation. Since the learnt representation encodes 3D geometry information, mapping it to 3D pose will be much easier than conventional frameworks that use an image or 2D coordinates as the input of 3D pose estimator. We demonstrate our approach on the task of 3D human pose estimation. Comprehensive experiments on three popular benchmarks show that our model can significantly improve the performance of state-of-the-art methods with simply injecting the representation as a robust 3D prior.

...read moreread less

109 citations

Proceedings Article•DOI•

TransGaGa: Geometry-Aware Unsupervised Image-To-Image Translation

[...]

Wayne Wu¹, Kaidi Cao², Cheng Li¹, Chen Qian, Chen Change Loy³ - Show less +1 more•Institutions (3)

SenseTime¹, Stanford University², Nanyang Technological University³

15 Jun 2019

TL;DR: TGaGa as mentioned in this paper disentangles the image space into a Cartesian product of the appearance and the geometry latent spaces, and then builds the translation on appearance and geometry space separately.

...read moreread less

Abstract: Unsupervised image-to-image translation aims at learning a mapping between two visual domains. However, learning a translation across large geometry variations al- ways ends up with failure. In this work, we present a novel disentangle-and-translate framework to tackle the complex objects image-to-image translation task. Instead of learning the mapping on the image space directly, we disentangle image space into a Cartesian product of the appearance and the geometry latent spaces. Specifically, we first in- troduce a geometry prior loss and a conditional VAE loss to encourage the network to learn independent but com- plementary representations. The translation is then built on appearance and geometry space separately. Extensive experiments demonstrate the superior performance of our method to other state-of-the-art approaches, especially in the challenging near-rigid and non-rigid objects translation tasks. In addition, by taking different exemplars as the ap- pearance references, our method also supports multimodal translation. Project page: https://wywu.github. io/projects/TGaGa/TGaGa.html

...read moreread less

92 citations

Posted Content•

A Real-Time Cross-modality Correlation Filtering Method for Referring Expression Comprehension

[...]

Yue Liao¹, Si Liu¹, Guanbin Li², Fei Wang³, Chen Yanjie³, Chen Qian³, Bo Li¹ - Show less +3 more•Institutions (3)

Beihang University¹, Sun Yat-sen University², SenseTime³

16 Sep 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors proposed a real-time cross-modality correlation filtering method (RCCF), which reformulates the referring expression comprehension as a correlation filtering process.

...read moreread less

Abstract: Referring expression comprehension aims to localize the object instance described by a natural language expression. Current referring expression methods have achieved good performance. However, none of them is able to achieve real-time inference without accuracy drop. The reason for the relatively slow inference speed is that these methods artificially split the referring expression comprehension into two sequential stages including proposal generation and proposal ranking. It does not exactly conform to the habit of human cognition. To this end, we propose a novel Realtime Cross-modality Correlation Filtering method (RCCF). RCCF reformulates the referring expression comprehension as a correlation filtering process. The expression is first mapped from the language domain to the visual domain and then treated as a template (kernel) to perform correlation filtering on the image feature map. The peak value in the correlation heatmap indicates the center points of the target box. In addition, RCCF also regresses a 2-D object size and 2-D offset. The center point coordinates, object size and center point offset together to form the target bounding box. Our method runs at 40 FPS while achieving leading performance in RefClef, RefCOCO, RefCOCO+ and RefCOCOg benchmarks. In the challenging RefClef dataset, our methods almost double the state-of-the-art performance (34.70% increased to 63.79%). We hope this work can arouse more attention and studies to the new cross-modality correlation filtering framework as well as the one-stage framework for referring expression comprehension.

...read moreread less

80 citations

Proceedings Article•DOI•

Multi-Person Articulated Tracking With Spatial and Temporal Embeddings

[...]

Sheng Jin¹, Wentao Liu, Wanli Ouyang², Chen Qian•Institutions (2)

Tsinghua University¹, University of Sydney²

15 Jun 2019

TL;DR: A unified framework for multi-person pose estimation and tracking, which combines SpatialNet and TemporalNet, and model the grouping procedure into a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable.

...read moreread less

Abstract: We propose a unified framework for multi-person pose estimation and tracking. Our framework consists of two main components, i.e. SpatialNet and TemporalNet. The SpatialNet accomplishes body part detection and part-level data association in a single frame, while the TemporalNet groups human instances in consecutive frames into trajectories. Specifically, besides body part detection heatmaps, SpatialNet also predicts the Keypoint Embedding (KE) and Spatial Instance Embedding (SIE) for body part association. We model the grouping procedure into a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable. TemporalNet extends the spatial grouping of keypoints to temporal grouping of human instances. Given human proposals from two consecutive frames, TemporalNet exploits both appearance features encoded in Human Embedding (HE) and temporally consistent geometric features embodied in Temporal Instance Embedding (TIE) for robust tracking. Extensive experiments demonstrate the effectiveness of our proposed model. Remarkably, we demonstrate substantial improvements over the state-of-the-art pose tracking method from 65.4% to 71.8% Multi-Object Tracking Accuracy (MOTA) on the ICCV'17 PoseTrack Dataset.

...read moreread less

76 citations

Proceedings Article•DOI•

Make a Face: Towards Arbitrary High Fidelity Face Manipulation

[...]

Shengju Qian¹, Kwan-Yee Lin², Wayne Wu³, Yangxiaokang Liu⁴, Quan Wang, Fumin Shen⁴, Chen Qian, Ran He⁵ - Show less +4 more•Institutions (5)

The Chinese University of Hong Kong¹, Peking University², Tsinghua University³, University of Electronic Science and Technology of China⁴, Chinese Academy of Sciences⁵

01 Oct 2019

TL;DR: This work proposes Additive Focal Variational Auto-encoder (AF-VAE), a novel approach that can arbitrarily manipulate high-resolution face images using a simple yet effective model and only weak supervision of reconstruction and KL divergence losses.

...read moreread less

Abstract: Recent studies have shown remarkable success in face manipulation task with the advance of GANs and VAEs paradigms, but the outputs are sometimes limited to low-resolution and lack of diversity. In this work, we propose Additive Focal Variational Auto-encoder (AF-VAE), a novel approach that can arbitrarily manipulate high-resolution face images using a simple yet effective model and only weak supervision of reconstruction and KL divergence losses. First, a novel additive Gaussian Mixture assumption is introduced with an unsupervised clustering mechanism in the structural latent space, which endows better disentanglement and boosts multi-modal representation with external memory. Second, to improve the perceptual quality of synthesized results, two simple strategies in architecture design are further tailored and discussed on the behavior of Human Visual System (HVS) for the first time, allowing for fine control over the model complexity and sample quality. Human opinion studies and new state-of-the-art Inception Score (IS) / Frechet Inception Distance (FID) demonstrate the superiority of our approach over existing algorithms, advancing both the fidelity and extremity of face manipulation task.

...read moreread less

71 citations

Proceedings Article•DOI•

Aggregation via Separation: Boosting Facial Landmark Detector With Semi-Supervised Style Translation

[...]

Shengju Qian¹, Keqiang Sun², Wayne Wu², Chen Qian, Jiaya Jia¹ - Show less +1 more•Institutions (2)

The Chinese University of Hong Kong¹, Tsinghua University²

01 Oct 2019

TL;DR: In this paper, the authors leverage disentangled style and shape space of each individual to augment existing structures via style translation, which leads to further notable improvement in facial landmark detection.

...read moreread less

Abstract: Facial landmark detection, or face alignment, is a fundamental task that has been extensively studied. In this paper, we investigate a new perspective of facial landmark detection and demonstrate it leads to further notable improvement. Given that any face images can be factored into space of style that captures lighting, texture and image environment, and a style-invariant structure space, our key idea is to leverage disentangled style and shape space of each individual to augment existing structures via style translation. With these augmented synthetic samples, our semi-supervised model surprisingly outperforms the fully-supervised one by a large margin. Extensive experiments verify the effectiveness of our idea with state-of-the-art results on WFLW, 300W, COFW, and AFLW datasets. Our proposed structure is general and could be assembled into any face alignment frameworks. The code is made publicly available at https://github.com/thesouthfrog/stylealign.

...read moreread less

56 citations

Posted Content•

Aggregation via Separation: Boosting Facial Landmark Detector with Semi-Supervised Style Translation

[...]

Shengju Qian¹, Keqiang Sun², Wayne Wu², Chen Qian, Jiaya Jia¹ - Show less +1 more•Institutions (2)

The Chinese University of Hong Kong¹, Tsinghua University²

18 Aug 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A new perspective of facial landmark detection is investigated and it is demonstrated it leads to further notable improvement and surprisingly outperforms the fully-supervised model by a large margin.

...read moreread less

Abstract: Facial landmark detection, or face alignment, is a fundamental task that has been extensively studied. In this paper, we investigate a new perspective of facial landmark detection and demonstrate it leads to further notable improvement. Given that any face images can be factored into space of style that captures lighting, texture and image environment, and a style-invariant structure space, our key idea is to leverage disentangled style and shape space of each individual to augment existing structures via style translation. With these augmented synthetic samples, our semi-supervised model surprisingly outperforms the fully-supervised one by a large margin. Extensive experiments verify the effectiveness of our idea with state-of-the-art results on WFLW, 300W, COFW, and AFLW datasets. Our proposed structure is general and could be assembled into any face alignment frameworks. The code is made publicly available at this https URL.

...read moreread less

38 citations

Posted Content•

Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation.

[...]

Xipeng Chen¹, Kwan-Yee Lin², Wentao Liu, Chen Qian, Xiaogang Wang³, Liang Lin - Show less +2 more•Institutions (3)

Sun Yat-sen University¹, Peking University², The Chinese University of Hong Kong³

21 Mar 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A geometry-aware 3D representation for the human pose is proposed to address this limitation by using multiple views in a simple auto-encoder model at the training stage and only 2D keypoint information as supervision, and injecting the representation as a robust 3D prior.

...read moreread less

Abstract: Recent studies have shown remarkable advances in 3D human pose estimation from monocular images, with the help of large-scale in-door 3D datasets and sophisticated network architectures. However, the generalizability to different environments remains an elusive goal. In this work, we propose a geometry-aware 3D representation for the human pose to address this limitation by using multiple views in a simple auto-encoder model at the training stage and only 2D keypoint information as supervision. A view synthesis framework is proposed to learn the shared 3D representation between viewpoints with synthesizing the human pose from one viewpoint to the other one. Instead of performing a direct transfer in the raw image-level, we propose a skeleton-based encoder-decoder mechanism to distil only pose-related representation in the latent space. A learning-based representation consistency constraint is further introduced to facilitate the robustness of latent 3D representation. Since the learnt representation encodes 3D geometry information, mapping it to 3D pose will be much easier than conventional frameworks that use an image or 2D coordinates as the input of 3D pose estimator. We demonstrate our approach on the task of 3D human pose estimation. Comprehensive experiments on three popular benchmarks show that our model can significantly improve the performance of state-of-the-art methods with simply injecting the representation as a robust 3D prior.

...read moreread less

33 citations

Posted Content•

TransGaGa: Geometry-Aware Unsupervised Image-to-Image Translation

[...]

Wayne Wu¹, Kaidi Cao², Cheng Li¹, Chen Qian, Chen Change Loy³ - Show less +1 more•Institutions (3)

SenseTime¹, Stanford University², Nanyang Technological University³

21 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel disentangle-and-translate framework to tackle the complex objects image-to-image translation task, which disentangles image space into a Cartesian product of the appearance and the geometry latent spaces and supports multimodal translation.

...read moreread less

Abstract: Unsupervised image-to-image translation aims at learning a mapping between two visual domains. However, learning a translation across large geometry variations always ends up with failure. In this work, we present a novel disentangle-and-translate framework to tackle the complex objects image-to-image translation task. Instead of learning the mapping on the image space directly, we disentangle image space into a Cartesian product of the appearance and the geometry latent spaces. Specifically, we first introduce a geometry prior loss and a conditional VAE loss to encourage the network to learn independent but complementary representations. The translation is then built on appearance and geometry space separately. Extensive experiments demonstrate the superior performance of our method to other state-of-the-art approaches, especially in the challenging near-rigid and non-rigid objects translation tasks. In addition, by taking different exemplars as the appearance references, our method also supports multimodal translation. Project page: this https URL

...read moreread less

Posted Content•

3D Human Pose Machines with Self-supervised Learning

[...]

Keze Wang¹, Liang Lin¹, Chenhan Jiang¹, Chen Qian², Pengxu Wei¹ - Show less +1 more•Institutions (2)

Sun Yat-sen University¹, SenseTime²

12 Jan 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A simple yet effective self-supervised correction mechanism to learn all intrinsic structures of human poses from abundant images and develops a 3D human pose machine, which jointly integrates the 2D spatial relationship, temporal smoothness of predictions and 3D geometric knowledge.

...read moreread less

Abstract: Driven by recent computer vision and robotic applications, recovering 3D human poses has become increasingly important and attracted growing interests. In fact, completing this task is quite challenging due to the diverse appearances, viewpoints, occlusions and inherently geometric ambiguities inside monocular images. Most of the existing methods focus on designing some elaborate priors /constraints to directly regress 3D human poses based on the corresponding 2D human pose-aware features or 2D pose predictions. However, due to the insufficient 3D pose data for training and the domain gap between 2D space and 3D space, these methods have limited scalabilities for all practical scenarios (e.g., outdoor scene). Attempt to address this issue, this paper proposes a simple yet effective self-supervised correction mechanism to learn all intrinsic structures of human poses from abundant images. Specifically, the proposed mechanism involves two dual learning tasks, i.e., the 2D-to-3D pose transformation and 3D-to-2D pose projection, to serve as a bridge between 3D and 2D human poses in a type of "free" self-supervision for accurate 3D human pose estimation. The 2D-to-3D pose implies to sequentially regress intermediate 3D poses by transforming the pose representation from the 2D domain to the 3D domain under the sequence-dependent temporal context, while the 3D-to-2D pose projection contributes to refining the intermediate 3D poses by maintaining geometric consistency between the 2D projections of 3D poses and the estimated 2D poses. We further apply our self-supervised correction mechanism to develop a 3D human pose machine, which jointly integrates the 2D spatial relationship, temporal smoothness of predictions and 3D geometric knowledge. Extensive evaluations demonstrate the superior performance and efficiency of our framework over all the compared competing methods.

...read moreread less

Posted Content•

PPDM: Parallel Point Detection and Matching for Real-time Human-Object Interaction Detection

[...]

Yue Liao¹, Si Liu¹, Fei Wang², Chen Yanjie², Chen Qian², Jiashi Feng³ - Show less +2 more•Institutions (3)

Beihang University¹, SenseTime², National University of Singapore³

30 Dec 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps on a single Titan XP GPU is proposed, and is the first real-time HOI detection method.

...read moreread less

Abstract: We propose a single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps on a single Titan XP GPU. It is the first real-time HOI detection method. Conventional HOI detection methods are composed of two stages, i.e., human-object proposals generation, and proposals classification. Their effectiveness and efficiency are limited by the sequential and separate architecture. In this paper, we propose a Parallel Point Detection and Matching (PPDM) HOI detection framework. In PPDM, an HOI is defined as a point triplet . Human and object points are the center of the detection boxes, and the interaction point is the midpoint of the human and object points. PPDM contains two parallel branches, namely point detection branch and point matching branch. The point detection branch predicts three points. Simultaneously, the point matching branch predicts two displacements from the interaction point to its corresponding human and object points. The human point and the object point originated from the same interaction point are considered as matched pairs. In our novel parallel architecture, the interaction points implicitly provide context and regularization for human and object detection. The isolated detection boxes are unlikely to form meaning HOI triplets are suppressed, which increases the precision of HOI detection. Moreover, the matching between human and object detection boxes is only applied around limited numbers of filtered candidate interaction points, which saves much computational cost. Additionally, we build a new application-oriented database named HOI-A, which severs as a good supplement to the existing datasets. The source code and the dataset will be made publicly available to facilitate the development of HOI detection.

...read moreread less

Proceedings Article•DOI•

Semi-Supervised Monocular 3D Face Reconstruction With End-to-End Shape-Preserved Domain Transfer

[...]

Jingtan Piao¹, Chen Qian, Hongsheng Li¹•Institutions (1)

The Chinese University of Hong Kong¹

01 Oct 2019

TL;DR: A semi-supervised monocular reconstruction method, which jointly optimizes a shape-preserved domain-transfer CycleGAN and a shape estimation network to jointly solve the challenging face reconstruction problem.

...read moreread less

Abstract: Monocular face reconstruction is a challenging task in computer vision, which aims to recover 3D face geometry from a single RGB face image. Recently, deep learning based methods have achieved great improvements on monocular face reconstruction. However, for deep learning-based methods to reach optimal performance, it is paramount to have large-scale training images with ground-truth 3D face geometry, which is generally difficult for human to annotate. To tackle this problem, we propose a semi-supervised monocular reconstruction method, which jointly optimizes a shape-preserved domain-transfer CycleGAN and a shape estimation network. The framework is semi-supervised trained with 3D rendered images with ground-truth shapes and in-the-wild face images without any extra annotation. The CycleGAN network transforms all realistic images to have the rendered style and is end-to-end trained within the overall framework. This is the key difference compared with existing CycleGAN-based learning methods, which just used CycleGAN as a separate training sample generator. Novel landmark consistency loss and edge-aware shape estimation loss are proposed for our two networks to jointly solve the challenging face reconstruction problem. Extensive experiments on public face reconstruction datasets demonstrate the effectiveness of our overall method as well as the individual components.

...read moreread less

Posted Content•

Deep Comprehensive Correlation Mining for Image Clustering

[...]

Jianlong Wu¹, Keyu Long², Fei Wang, Chen Qian, Cheng Li³, Zhouchen Lin¹, Hongbin Zha¹ - Show less +3 more•Institutions (3)

Peking University¹, Chinese Academy of Sciences², SenseTime³

15 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors proposed a novel clustering framework, named deep comprehensive correlation mining (DCCM), for exploring and taking full advantage of various kinds of correlations behind the unlabeled data from three aspects: 1) Instead of only using pair-wise information, pseudo-label supervision is proposed to investigate category information and learn discriminative features.

...read moreread less

Abstract: Recent developed deep unsupervised methods allow us to jointly learn representation and cluster unlabelled data. These deep clustering methods mainly focus on the correlation among samples, e.g., selecting high precision pairs to gradually tune the feature representation, which neglects other useful correlations. In this paper, we propose a novel clustering framework, named deep comprehensive correlation mining(DCCM), for exploring and taking full advantage of various kinds of correlations behind the unlabeled data from three aspects: 1) Instead of only using pair-wise information, pseudo-label supervision is proposed to investigate category information and learn discriminative features. 2) The features' robustness to image transformation of input space is fully explored, which benefits the network learning and significantly improves the performance. 3) The triplet mutual information among features is presented for clustering problem to lift the recently discovered instance-level deep mutual information to a triplet-level formation, which further helps to learn more discriminative features. Extensive experiments on several challenging datasets show that our method achieves good performance, e.g., attaining $62.3\%$ clustering accuracy on CIFAR-10, which is $10.1\%$ higher than the state-of-the-art results.

...read moreread less

Proceedings Article•DOI•

TRB: A Novel Triplet Representation for Understanding 2D Human Body

[...]

Haodong Duan¹, Kwan-Yee Lin¹, Sheng Jin², Wentao Liu, Chen Qian, Wanli Ouyang³ - Show less +2 more•Institutions (3)

Peking University¹, Tsinghua University², University of Sydney³

25 Oct 2019

TL;DR: This paper proposes the Triplet Representation for Body --- a compact 2D human body representation, with skeleton keypoints capturing human pose information and contour keypoints containing human shape information, and proposes a two-branch network (TRB-net) with three novel techniques, namely X-structure, Directional Convolution and Pairwise mapping.

...read moreread less

Abstract: Human pose and shape are two important components of 2D human body. However, how to efficiently represent both of them in images is still an open question. In this paper, we propose the Triplet Representation for Body (TRB) --- a compact 2D human body representation, with skeleton keypoints capturing human pose information and contour keypoints containing human shape information. TRB not only preserves the flexibility of skeleton keypoint representation, but also contains rich pose and human shape information. Therefore, it promises broader application areas, such as human shape editing and conditional image generation. We further introduce the challenging problem of TRB estimation, where joint learning of human pose and shape is required. We construct several large-scale TRB estimation datasets, based on the popular 2D pose datasets LSP, MPII and COCO. To effectively solve TRB estimation, we propose a two-branch network (TRB-net) with three novel techniques, namely X-structure (Xs), Directional Convolution (DC) and Pairwise mapping (PM), to enforce multi-level message passing for joint feature learning. We evaluate our proposed TRB-net and several leading approaches on our proposed TRB datasets, and demonstrate the superiority of our method through extensive evaluations.

...read moreread less

Posted Content•

FAB: A Robust Facial Landmark Detection Framework for Motion-Blurred Videos

[...]

Keqiang Sun¹, Wayne Wu¹, Liu Tinghao¹, Shuo Yang¹, Quan Wang, Qiang Zhou¹, Zuochang Ye¹, Chen Qian² - Show less +4 more•Institutions (2)

Tsinghua University¹, Amazon.com²

26 Oct 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A framework named FAB is proposed that takes advantage of structure consistency in the temporal dimension for facial landmark detection in motion-blurred videos, and a structure predictor is proposed to predict the missing face structural information temporally, which serves as a geometry prior.

...read moreread less

Abstract: Recently, facial landmark detection algorithms have achieved remarkable performance on static images. However, these algorithms are neither accurate nor stable in motion-blurred videos. The missing of structure information makes it difficult for state-of-the-art facial landmark detection algorithms to yield good results. In this paper, we propose a framework named FAB that takes advantage of structure consistency in the temporal dimension for facial landmark detection in motion-blurred videos. A structure predictor is proposed to predict the missing face structural information temporally, which serves as a geometry prior. This allows our framework to work as a virtuous circle. On one hand, the geometry prior helps our structure-aware deblurring network generates high quality deblurred images which lead to better landmark detection results. On the other hand, better landmark detection results help structure predictor generate better geometry prior for the next frame. Moreover, it is a flexible video-based framework that can incorporate any static image-based methods to provide a performance boost on video datasets. Extensive experiments on Blurred-300VW, the proposed Real-world Motion Blur (RWMB) datasets and 300VW demonstrate the superior performance to the state-of-the-art methods. Datasets and models will be publicly available at this https URL.

...read moreread less

Posted Content•

Make a Face: Towards Arbitrary High Fidelity Face Manipulation

[...]

Shengju Qian¹, Kwan-Yee Lin², Wayne Wu³, Yangxiaokang Liu⁴, Quan Wang, Fumin Shen⁴, Chen Qian, Ran He⁵ - Show less +4 more•Institutions (5)

The Chinese University of Hong Kong¹, Peking University², Tsinghua University³, University of Electronic Science and Technology of China⁴, Chinese Academy of Sciences⁵

20 Aug 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as mentioned in this paper proposed an additive Gaussian mixture assumption with an unsupervised clustering mechanism in the structural latent space, which endows better disentanglement and boosts multi-modal representation with external memory, and two simple strategies in architecture design are further tailored and discussed on the behavior of human visual system (HVS) for the first time, allowing for fine control over the model complexity and sample quality.

...read moreread less

Abstract: Recent studies have shown remarkable success in face manipulation task with the advance of GANs and VAEs paradigms, but the outputs are sometimes limited to low-resolution and lack of diversity. In this work, we propose Additive Focal Variational Auto-encoder (AF-VAE), a novel approach that can arbitrarily manipulate high-resolution face images using a simple yet effective model and only weak supervision of reconstruction and KL divergence losses. First, a novel additive Gaussian Mixture assumption is introduced with an unsupervised clustering mechanism in the structural latent space, which endows better disentanglement and boosts multi-modal representation with external memory. Second, to improve the perceptual quality of synthesized results, two simple strategies in architecture design are further tailored and discussed on the behavior of Human Visual System (HVS) for the first time, allowing for fine control over the model complexity and sample quality. Human opinion studies and new state-of-the-art Inception Score (IS) / Frechet Inception Distance (FID) demonstrate the superiority of our approach over existing algorithms, advancing both the fidelity and extremity of face manipulation task.

...read moreread less

Proceedings Article•

Disentangling Content and Style via Unsupervised Geometry Distillation

[...]

Wayne Wu¹, Kaidi Cao², Cheng Li³, Chen Qian⁴, Chen Change Loy⁵ - Show less +1 more•Institutions (5)

SenseTime¹, Stanford University², University of Illinois at Urbana–Champaign³, Amazon.com⁴, Nanyang Technological University⁵

11 May 2019

TL;DR: This paper presents a novel framework to learn this disentangled representation of object's content-style representation in a completely unsupervised manner and demonstrates the superior disentanglement and visual analogy quality both in synthesized and real-world data.

...read moreread less

Abstract: It is challenging to disentangle an object into two orthogonal spaces of content and style since each can influence the visual observation differently and unpredictably. It is rare for one to have access to a large number of data to help separate the influences. In this paper, we present a novel framework to learn this disentangled representation in a completely unsupervised manner. We address this problem in a two-branch Autoencoder framework. For the structural content branch, we project the latent factor into a soft structured point tensor and constrain it with losses derived from prior knowledge. This constraint encourages the branch to distill geometry information. Another branch learns the complementary style information. The two branches form an effective framework that can disentangle object's content-style representation without any human annotation. We evaluate our approach on four image datasets, on which we demonstrate the superior disentanglement and visual analogy quality both in synthesized and real-world data. We are able to generate photo-realistic images with 256*256 resolution that are clearly disentangled in content and style.

...read moreread less

Proceedings Article•DOI•

FAB: A Robust Facial Landmark Detection Framework for Motion-Blurred Videos

[...]

Keqiang Sun¹, Wayne Wu¹, Liu Tinghao, Shuo Yang², Quan Wang, Qiang Zhou¹, Zuochang Ye¹, Chen Qian - Show less +4 more•Institutions (2)

Tsinghua University¹, Amazon.com²

01 Oct 2019

TL;DR: Sun et al. as mentioned in this paper proposed a structure predictor to predict the missing face structural information temporally, which serves as a geometry prior for facial landmark detection in motion-blurred videos.

...read moreread less

Abstract: Recently, facial landmark detection algorithms have achieved remarkable performance on static images. However, these algorithms are neither accurate nor stable in motion-blurred videos. The missing of structure information makes it difficult for state-of-the-art facial landmark detection algorithms to yield good results. In this paper, we propose a framework named FAB that takes advantage of structure consistency in the temporal dimension for facial landmark detection in motion-blurred videos. A structure predictor is proposed to predict the missing face structural information temporally, which serves as a geometry prior. This allows our framework to work as a virtuous circle. On one hand, the geometry prior helps our structure-aware deblurring network generates high quality deblurred images which lead to better landmark detection results. On the other hand, better landmark detection results help structure predictor generate better geometry prior for the next frame. Moreover, it is a flexible video-based framework that can incorporate any static image-based methods to provide a performance boost on video datasets. Extensive experiments on Blurred-300VW, the proposed Real-world Motion Blur (RWMB) datasets and 300VW demonstrate the superior performance to the state-of-the-art methods. Datasets and model will be publicly available at \href{https://github.com/KeqiangSun/FAB}{https://github.com/KeqiangSun/FAB}.

...read moreread less

Journal Article•DOI•

Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation

[...]

Wei Feng, Wentao Liu, Li Tong, Jing Peng, Chen Qian, Xiaolin Hu¹ - Show less +2 more•Institutions (1)

Tsinghua University¹

17 Jul 2019

TL;DR: A turbo learning framework to perform HOI recognition and pose estimation simultaneously and achieves the state-of-the-art performance on two public benchmarks including Verbs in COCO (V-COCO) and HICO-DET datasets.

...read moreread less

Abstract: Human-object interactions (HOI) recognition and pose estimation are two closely related tasks. Human pose is an essential cue for recognizing actions and localizing the interacted objects. Meanwhile, human action and their interacted objects’ localizations provide guidance for pose estimation. In this paper, we propose a turbo learning framework to perform HOI recognition and pose estimation simultaneously. First, two modules are designed to enforce message passing between the tasks, i.e. pose aware HOI recognition module and HOI guided pose estimation module. Then, these two modules form a closed loop to utilize the complementary information iteratively, which can be trained in an end-to-end manner. The proposed method achieves the state-of-the-art performance on two public benchmarks including Verbs in COCO (V-COCO) and HICO-DET datasets.

...read moreread less

Patent•

Gesture identification, control, and neural network training methods and apparatuses, and electronic devices

[...]

Wang Quan¹, Wentao Liu, Chen Qian•Institutions (1)

SenseTime¹

29 Sep 2019

TL;DR: In this paper, a neural network is used to detect the potential hand regions, a potential gesture category and a potential hand category probability in an image, including a gesture-free category and at least one gesture category.

...read moreread less

Abstract: A gesture identification method includes: performing gesture information detection on an image by means of a neural network, to obtain a potential hand region, a potential gesture category and a potential gesture category probability in the image, the potential gesture category including a gesture-free category and at least one gesture category; and if the obtained potential gesture category with the maximum probability is the gesture-free category, not outputting position information of the potential hand region of the image; or otherwise, outputting the position information of the potential hand region of the image and the obtained potential gesture category with the maximum probability.

...read moreread less

Posted Content•

Multi-person Articulated Tracking with Spatial and Temporal Embeddings

[...]

Sheng Jin¹, Wentao Liu, Wanli Ouyang², Chen Qian•Institutions (2)

Tsinghua University¹, University of Sydney²

21 Mar 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as discussed by the authors proposed a unified framework for multi-person pose estimation and tracking, which consists of two main components, SpatialNet and TemporalNet, for body part detection and part-level data association in a single frame.

...read moreread less

Abstract: We propose a unified framework for multi-person pose estimation and tracking. Our framework consists of two main components,~\ie~SpatialNet and TemporalNet. The SpatialNet accomplishes body part detection and part-level data association in a single frame, while the TemporalNet groups human instances in consecutive frames into trajectories. Specifically, besides body part detection heatmaps, SpatialNet also predicts the Keypoint Embedding (KE) and Spatial Instance Embedding (SIE) for body part association. We model the grouping procedure into a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable. TemporalNet extends spatial grouping of keypoints to temporal grouping of human instances. Given human proposals from two consecutive frames, TemporalNet exploits both appearance features encoded in Human Embedding (HE) and temporally consistent geometric features embodied in Temporal Instance Embedding (TIE) for robust tracking. Extensive experiments demonstrate the effectiveness of our proposed model. Remarkably, we demonstrate substantial improvements over the state-of-the-art pose tracking method from 65.4\% to 71.8\% Multi-Object Tracking Accuracy (MOTA) on the ICCV'17 PoseTrack Dataset.

...read moreread less

Patent•

Image processing method and device, detection equipment and storage medium

[...]

Sheng Jin, Liu Wentao, Chen Qian

25 Jun 2019

TL;DR: In this article, an image processing method for determining a target area of a target in an image, extracting a first type of features from the target area, including image features of the target, and obtaining a second type of feature according to the distribution of the same target in the front and back two frames of images.

...read moreread less

Abstract: The embodiment of the invention discloses an image processing method and device, a detection device and a storage medium. The image processing method comprises the steps of determining a target area of a target in an image; extracting a first type of features from the target area, the first type of features comprising image features of the target; obtaining a second type of features according to the distribution of the same target in the front and back two frames of images; and carrying out target tracking according to the first type of features and the second type of features.

...read moreread less

Patent•

Deep learning model training method and device, training equipment and storage medium

[...]

Sheng Jin, Liu Wentao, Chen Qian

21 Jun 2019

TL;DR: In this paper, a deep learning model training method consisting of three steps is described: training a model by using a training image to obtain training characteristics output by the model; carrying out conversion processing on the training features by using an auxiliary training module to obtain conversion features; determining a loss value based on the conversion feature; and determining whether to continue to train the model based on loss value.

...read moreread less

Abstract: The embodiment of the invention discloses a deep learning model training method and device, training equipment and a storage medium. The deep learning model training method comprises the following steps: training a deep learning model by using a training image to obtain training characteristics output by the deep learning model; carrying out conversion processing on the training features by usingan auxiliary training module to obtain conversion features; Determining a loss value based on the conversion feature; And determining whether to continue to train the deep learning model based on theloss value.

...read moreread less

Posted Content•

Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation

[...]

Wei Feng, Wentao Liu, Li Tong, Jing Peng, Chen Qian, Xiaolin Hu¹ - Show less +2 more•Institutions (1)

Tsinghua University¹

15 Mar 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as mentioned in this paper proposed a turbo learning framework to perform HOI recognition and pose estimation simultaneously, which achieves state-of-the-art performance on two public benchmarks including Verbs in COCO (V-COCO) and HICO-DET datasets.

...read moreread less

Abstract: Human-object interactions (HOI) recognition and pose estimation are two closely related tasks. Human pose is an essential cue for recognizing actions and localizing the interacted objects. Meanwhile, human action and their interacted objects' localizations provide guidance for pose estimation. In this paper, we propose a turbo learning framework to perform HOI recognition and pose estimation simultaneously. First, two modules are designed to enforce message passing between the tasks, i.e. pose aware HOI recognition module and HOI guided pose estimation module. Then, these two modules form a closed loop to utilize the complementary information iteratively, which can be trained in an end-to-end manner. The proposed method achieves the state-of-the-art performance on two public benchmarks including Verbs in COCO (V-COCO) and HICO-DET datasets.

...read moreread less

Posted Content•

TRB: A Novel Triplet Representation for Understanding 2D Human Body.

[...]

Haodong Duan¹, Kwan-Yee Lin¹, Sheng Jin², Wentao Liu, Chen Qian, Wanli Ouyang³ - Show less +2 more•Institutions (3)

Peking University¹, Tsinghua University², University of Sydney³

25 Oct 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Triplet Representation for Body (TRB) as mentioned in this paper is a compact 2D human body representation, with skeleton keypoints capturing human pose information and contour keypoints containing human shape information.

...read moreread less

Abstract: Human pose and shape are two important components of 2D human body. However, how to efficiently represent both of them in images is still an open question. In this paper, we propose the Triplet Representation for Body (TRB) -- a compact 2D human body representation, with skeleton keypoints capturing human pose information and contour keypoints containing human shape information. TRB not only preserves the flexibility of skeleton keypoint representation, but also contains rich pose and human shape information. Therefore, it promises broader application areas, such as human shape editing and conditional image generation. We further introduce the challenging problem of TRB estimation, where joint learning of human pose and shape is required. We construct several large-scale TRB estimation datasets, based on popular 2D pose datasets: LSP, MPII, COCO. To effectively solve TRB estimation, we propose a two-branch network (TRB-net) with three novel techniques, namely X-structure (Xs), Directional Convolution (DC) and Pairwise Mapping (PM), to enforce multi-level message passing for joint feature learning. We evaluate our proposed TRB-net and several leading approaches on our proposed TRB datasets, and demonstrate the superiority of our method through extensive evaluations.

...read moreread less

Patent•

Method and device for creating face model and electronic device

[...]

Xu Shengwei, Quan Wang, Piao Jingtan, Chen Qian

09 Aug 2019

TL;DR: In this article, the authors proposed a method and device for creating a face model and an electronic device, wherein the method comprises the steps of carrying out the key point detection on a current face image, and obtaining the key points features of the current face images; obtaining a target skeleton parameter matched with the face image from a preset reference model database according to the key feature; and creating a virtual 3D face model corresponding to the current image according to target skeleton parameters and a preset standard three-dimensional face model.

...read moreread less

Abstract: The invention provides a method and device for creating a face model and an electronic device, wherein the method comprises the steps of carrying out the key point detection on a current face image, and obtaining the key point features of the current face image; obtaining a target skeleton parameter matched with the current face image from a preset reference model database according to the key point feature; and creating a virtual three-dimensional face model corresponding to the current face image according to the target skeleton parameters and a preset standard three-dimensional face model.By adopting the method for creating the face model provided by the invention, the virtual three-dimensional face model matched with the current face image can be efficiently and accurately created.

...read moreread less

Posted Content•

Disentangling Content and Style via Unsupervised Geometry Distillation

[...]

Wayne Wu¹, Kaidi Cao², Cheng Li³, Chen Qian⁴, Chen Change Loy⁵ - Show less +1 more•Institutions (5)

SenseTime¹, Stanford University², University of Illinois at Urbana–Champaign³, Amazon.com⁴, Nanyang Technological University⁵

11 May 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a two-branch autoencoder is proposed to disentangle an object into two orthogonal spaces of content and style in a completely unsupervised manner.

...read moreread less

Abstract: It is challenging to disentangle an object into two orthogonal spaces of content and style since each can influence the visual observation differently and unpredictably. It is rare for one to have access to a large number of data to help separate the influences. In this paper, we present a novel framework to learn this disentangled representation in a completely unsupervised manner. We address this problem in a two-branch Autoencoder framework. For the structural content branch, we project the latent factor into a soft structured point tensor and constrain it with losses derived from prior knowledge. This constraint encourages the branch to distill geometry information. Another branch learns the complementary style information. The two branches form an effective framework that can disentangle object's content-style representation without any human annotation. We evaluate our approach on four image datasets, on which we demonstrate the superior disentanglement and visual analogy quality both in synthesized and real-world data. We are able to generate photo-realistic images with 256*256 resolution that are clearly disentangled in content and style.

...read moreread less

Patent•

Pose detection method and device, electronic device and storage medium

[...]

Min Wang¹, Wentao Liu, Chen Qian•Institutions (1)

SenseTime¹

18 Jan 2019

TL;DR: In this paper, an orientation detection method and device, electronic device and storage medium, the method comprising: determining first position information of at least one first feature part of a target object in a target image (S100), determining three-dimensional position information (S200), and determining spatial orientation of the target object on the basis of the first position and device parameters of a camera device (S300).

...read moreread less

Abstract: An orientation detection method and device, electronic device and storage medium, the method comprising: determining first position information of at least one first feature part of a target object in a target image (S100); determining three-dimensional position information of a second feature part of the target object on the basis of the first position information and device parameters of a camera device (S200); and determining spatial orientation of the target object on the basis of the first position information of the at least one first feature part comprised in the second feature part and the three-dimensional position information of the second feature part (S300). The described method may increase the accuracy of orientation detection.

...read moreread less

Showing papers by "Chen Qian published in 2019"