scispace - formally typeset
Search or ask a question

Showing papers by "Hong Liu published in 2021"


Proceedings Article
18 May 2021
TL;DR: Wang et al. as mentioned in this paper proposed a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition.
Abstract: Graph convolutional networks have been widely used for skeleton-based action recognition due to their excellent modeling ability of non-Euclidean data. As the graph convolution is a local operation, it can only utilize the short-range joint dependencies and short-term trajectory but fails to directly model the distant joints relations and long-range temporal information that are vital to distinguishing various actions. To solve this problem, we present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module to enrich the receptive field of the model in spatial and temporal dimensions. Concretely, the MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of sub-graph convolution, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-graph convolutions, and each node could complete multiple spatial and temporal aggregations with its neighborhoods. The final equivalent receptive field is accordingly enlarged, which is capable of capturing both short- and long-range dependencies in spatial and temporal domains. By coupling these two modules as a basic block, we further propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition. The proposed MST-GCN achieves remarkable performance on three challenging benchmark datasets, NTU RGB+D, NTU-120 RGB+D and Kinetics-Skeleton, for skeleton-based action recognition.

55 citations


Journal ArticleDOI
TL;DR: Tang et al. as mentioned in this paper proposed an attention-guided discriminator to identify the most discriminative foreground objects and minimize the change of the background for image-to-image translation.
Abstract: State-of-the-art methods in the image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data. Though the existing methods have achieved promising results, they still produce visual artifacts, being able to translate low-level information but not high-level semantics of input images. One possible reason is that generators do not have the ability to perceive the most discriminative parts between the source and target domains, thus making the generated images low quality. In this article, we propose a new Attention-Guided Generative Adversarial Networks (AttentionGAN) for the unpaired image-to-image translation task. AttentionGAN can identify the most discriminative foreground objects and minimize the change of the background. The attention-guided generators in AttentionGAN are able to produce attention masks, and then fuse the generation output with the attention masks to obtain high-quality target images. Accordingly, we also design a novel attention-guided discriminator which only considers attended regions. Extensive experiments are conducted on several generative tasks with eight public datasets, demonstrating that the proposed method is effective to generate sharper and more realistic images compared with existing competitive models. The code is available at https://github.com/Ha0Tang/AttentionGAN.

32 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a novel ranking loss function, named Bi-directional Exponential Angular Triplet Loss, to help learn an angularly separable common feature space by explicitly constraining the included angles between embedding vectors.
Abstract: RGB-Infrared person re-identification (RGB-IR Re-ID) is a cross-modality matching problem, where the modality discrepancy is a big challenge. Most existing works use Euclidean metric based constraints to resolve the discrepancy between features of images from different modalities. However, these methods are incapable of learning angularly discriminative feature embedding because Euclidean distance cannot measure the included angle between embedding vectors effectively. As an angularly discriminative feature space is important for classifying the human images based on their embedding vectors, in this paper, we propose a novel ranking loss function, named Bi-directional Exponential Angular Triplet Loss, to help learn an angularly separable common feature space by explicitly constraining the included angles between embedding vectors. Moreover, to help stabilize and learn the magnitudes of embedding vectors, we adopt a common space batch normalization layer. The quantitative and qualitative experiments on the SYSU-MM01 and RegDB dataset support our analysis. On SYSU-MM01 dataset, the performance is improved from 7.40% / 11.46% to 38.57% / 38.61% for rank-1 accuracy / mAP compared with the baseline. The proposed method can be generalized to the task of single-modality Re-ID and improves the rank-1 accuracy / mAP from 92.0% / 81.7% to 94.7% / 86.6% on the Market-1501 dataset, from 82.6% / 70.6% to 87.6% / 77.1% on the DukeMTMC-reID dataset.

29 citations


Proceedings ArticleDOI
10 Jan 2021
TL;DR: Wang et al. as discussed by the authors proposed a multi-scale part-aware cascading framework for RGB-IR Re-ID, which aggregates multiscale fine-grained features from part to global in a cascading manner, resulting in a unified representation containing rich and enhanced semantic features.
Abstract: RGB-Infrared person re-identification (RGB-IR Re-ID) aims to matching persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in the surveillance system under poor light conditions. Facing great challenges in complex variances including conventional single-modality and additional inter-modality discrepancies, most of the existing RGB-IR Re-ID methods propose to impose constraints in image level, feature level or a hybrid of both. Despite better performance of hybrid constraints, they are usually implemented with heavy network architecture. As a matter of fact, previous efforts contribute more as pioneering works in new cross-modal Re-ID area while leaving large space for improvement. This can be mainly attributed to: (a) lack of abundant person image pairs from different modalities for training, and (b) scarcity of salient modality-invariant features especially on coarse representations for effective matching. To address these issues, a novel Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global in a cascading manner, which results in a unified representation containing rich and enhanced semantic features. Furthermore, a marginal exponential center (MeCen) loss is introduced to jointly eliminate mixed variances from intra- and inter-modal examples. Cross-modality correlations can thus be efficiently explored on salient features for distinctive modality-invariant feature learning. Extensive experiments are conducted to demonstrate that the proposed method outperforms all the state-of-the-art by a large margin.

8 citations


Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this paper, an audio-visual early feature fusion (AV-EFF) stream is added to the baseline model to learn the fusion information of different stages, preserving the original features as much as possible and ensuring the independence of different features.
Abstract: Lip-reading methods and fusion strategy are crucial for audio-visual speech recognition. In recent years, most approaches involve two separate audio and visual streams with early or late fusion strategies. Such a single-stage fusion method may fail to guarantee the integrity and representativeness of fusion information simultaneously. This paper extends a traditional single-stage fusion network to a two-step feature fusion network by adding an audio-visual early feature fusion (AV-EFF) stream to the baseline model. This method can learn the fusion information of different stages, preserving the original features as much as possible and ensuring the independence of different features. Besides, to capture long-range dependencies of video information, a non-local block is added to the feature extraction part of the visual stream (NL-Visual) to obtain the long-term spatio-temporal features. Experimental results on the two largest public datasets in English (LRW) and Mandarin (LRW-1000) demonstrate our method is superior to other state-of-the-art methods.

7 citations


Journal ArticleDOI
10 Apr 2021-Sensors
TL;DR: In this article, an optimization-based online initialization and spatial-temporal calibration method for monocular visual-inertial odometry (VIO) has been proposed, which does not need any prior knowledge about spatial and temporal configurations.
Abstract: The online system state initialization and simultaneous spatial-temporal calibration are critical for monocular Visual-Inertial Odometry (VIO) since these parameters are either not well provided or even unknown. Although impressive performance has been achieved, most of the existing methods are designed for filter-based VIOs. For the optimization-based VIOs, there is not much online spatial-temporal calibration method in the literature. In this paper, we propose an optimization-based online initialization and spatial-temporal calibration method for VIO. The method does not need any prior knowledge about spatial and temporal configurations. It estimates the initial states of metric-scale, velocity, gravity, Inertial Measurement Unit (IMU) biases, and calibrates the coordinate transformation and time offsets between the camera and IMU sensors. The work routine of the method is as follows. First, it uses a time offset model and two short-term motion interpolation algorithms to align and interpolate the camera and IMU measurement data. Then, the aligned and interpolated results are sent to an incremental estimator to estimate the initial states and the spatial-temporal parameters. After that, a bundle adjustment is additionally included to improve the accuracy of the estimated results. Experiments using both synthetic and public datasets are performed to examine the performance of the proposed method. The results show that both the initial states and the spatial-temporal parameters can be well estimated. The method outperforms other contemporary methods used for comparison.

6 citations


Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this article, a hybrid fusion based audio-visual speech recognition (AVSR) method with residual networks and Bidirectional Gated Recurrent Unit (BGRU) is proposed.
Abstract: The fusion of audio and visual modalities is an important stage of audio-visual speech recognition (AVSR), which is generally approached through feature fusion or decision fusion. Feature fusion can exploit the covariations between features from different modalities effectively, whereas decision fusion shows the robustness of capturing an optimal combination of multimodality. In this work, to take full advantage of the complementarity of the two fusion strategies and address the challenge of inherent ambiguity in noisy environments, we propose a novel hybrid fusion based AVSR method with residual networks and Bidirectional Gated Recurrent Unit (BGRU), which is able to distinguish homophones in both clean and noisy conditions. Specifically, a simple yet effective audio-visual encoder is used to map audio and visual features into a shared latent space to capture more discriminative multi-modal feature and find the internal correlation between spatial-temporal information for different modalities. Furthermore, a decision fusion module is designed to get final predictions in order to robustly utilize the reliability measures of audio-visual information. Finally, we introduce a combined loss, which shows its noise-robustness in learning the joint representation across various modalities. Experimental results on the largest publicly available dataset (LRW) demonstrate the robustness of the proposed method under various noisy conditions.

4 citations


Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this article, a mutual feature alignment method for audio visual speech recognition (AVSR) is proposed, which can make full use of cross modality information to address the asynchronization issue by introducing Mutual Iterative Attention (MIA) mechanism.
Abstract: Asynchronization issue caused by different types of modalities is one of the major problems in audio visual speech recognition (AVSR) research. However, most AVSR systems merely rely on up sampling of video or down sampling of audio to align audio and visual features, assuming that the feature sequences are aligned frame-by-frame. These pre-processing steps oversimplify the asynchrony relation between acoustic signal and lip motion, lacking flexibility and impairing the performance of the system. Although there are systems modeling the asynchrony between the modalities, sometimes they fail to align speech and video precisely over some even all noisy conditions. In this paper, we propose a mutual feature alignment method for AVSR which can make full use of cross modility information to address the asynchronization issue by introducing Mutual Iterative Attention (MIA) mechanism. Our method can automatically learn an alignment in a mutual way by performing mutual attention iteratively between the audio and visual features, relying on the modified encoder structure of Transformer. Experimental results show that our proposed method obtains absolute improvements up to 20.42% over the audio modality alone depending upon the signal-to-noise-ratio (SNR) level. Better recognition performance can also be achieved comparing with the traditional feature concatenation method under both clean and noisy conditions. It is expectable that our proposed mutual feature alignment method can be easily generalized to other multimodal tasks with semantically correlated information.

4 citations


Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this article, a new base-derivative framework is proposed, where base refers to the original visible and infrared modalities, and derivative refers to two auxiliary modalities that are derived from base.
Abstract: Cross-modality RGB-infrared (RGB-IR) person reidentification (Re-ID) is a challenging research topic due to the heterogeneity of RGB and infrared images. In this paper, we aim to find some auxiliary modalities, which are homologous with the visible or infrared modalities, to help reduce the modality discrepancy caused by heterogeneous images. Accordingly, a new base-derivative framework is proposed, where base refers to the original visible and infrared modalities, and derivative refers to the two auxiliary modalities that are derived from base. In the proposed framework, the double-modality cross-modal learning problem is reformulated as a four-modality one. After that, the images of all the base and derivative modalities are fed into the feature learning network. With the doubled input images, the learned person features become more discriminative. Furthermore, the proposed framework is optimized by the enhanced intra- and cross-modality constraints with the assistance of two derivative modalities. Experimental results on two publicly available datasets SYSU-MM and RegDB show that the proposed method outperforms the other state-of-the-art methods. For instance, we achieve a gain of over 13 % in terms of both Rank- and mAP on RegDB dataset.

4 citations


Posted Content
26 Mar 2021
TL;DR: Li et al. as discussed by the authors proposed a strided transformer encoder (STE) to lift a sequence of 2D joint locations to a 3D pose, which not only significantly reduces the computation cost but also effectively aggregates information to a single vector representation in a global and local fashion.
Abstract: Despite great progress in video-based 3D human pose estimation, it is still challenging to learn a discriminative single-pose representation from redundant sequences. To this end, we propose a novel Transformer-based architecture, called Lifting Transformer, for 3D human pose estimation to lift a sequence of 2D joint locations to a 3D pose. Specifically, a vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce redundancy of the sequence and aggregate information from local context, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively reduce the sequence length. The modified VTE is termed as strided Transformer encoder (STE) and it is built upon the outputs of VTE. STE not only significantly reduces the computation cost but also effectively aggregates information to a single-vector representation in a global and local fashion. Moreover, a full-to-single supervision scheme is employed at both the full sequence scale and single target frame scale, applying to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision. The proposed architecture is evaluated on two challenging benchmark datasets, namely, Human3.6M and HumanEva-I, and achieves state-of-the-art results with much fewer parameters.

4 citations


Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this paper, a supervised direct-path relative transfer function (DP-RTF) learning method with deep neural networks was proposed for robust binaural sound source localization. But the method is not suitable for the localization in the presence of noise and reverberation.
Abstract: Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two channels. Though DP-RTF fully encodes the sound directional cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes a supervised DP-RTF learning method with deep neural networks for robust binaural sound source localization. To exploit the complementarity of single-channel spectrogram and dual-channel difference information, we first recover the direct-path magnitude spectrogram from the contaminated one using a monaural enhancement network, and then predict the DP-RTF from the dual-channel (enhanced-) intensity and phase cues using a binaural enhancement network. In addition, a weighted-matching softmax training loss is designed to promote the predicted DP-RTFs to be concentrated for the same direction and separated for different directions. Finally, the direction of arrival (DOA) of source is estimated by matching the predicted DP-RTF with the ground truths of candidate directions. Experimental results show the superiority of our method for DOA estimation in the environments with various levels of noise and reverberation.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed Position Constraint Loss (PCLoss) to constrain error landmark locations by utilizing the position relationship of landmarks, which can be easily applied to both regression and heatmap based methods without extra computation during inference.

Proceedings ArticleDOI
10 Jan 2021
TL;DR: Zhang et al. as mentioned in this paper proposed a novel particle filter based method for 3D audio-visual speaker tracking, where the particle likelihood is calculated by fusing both the visual distance and audiovisual direction information.
Abstract: 3D speaker tracking using co-located audio-visual sensors has received much attention recently. Though various methods have been attempted to this field, it is still challenging to obtain a reliable 3D tracking result since the position of colocated sensors are restricted to a small area. In this paper, a novel particle filter (PF) based method is proposed for 3D audio-visual speaker tracking. Compared with traditional PF based audio-visual speaker tracking method, our 3D audio-visual tracker has two main characteristics. In the prediction stage, we use audio-visual information at current frame to further adjust the direction of the particles after the particle state transition process, which can make the particles more concentrated around the speaker direction. In the update stage, the particle likelihood is calculated by fusing both the visual distance and audiovisual direction information. Specially, the distance likelihood is obtained according to the camera projection model and the adaptively estimated size of speaker face or head, and the direction likelihood is determined by audio-visual particle fitness. In this way, the particle likelihood can better represent the speaker presence probability in 3D space. Experimental results show that the proposed tracker outperforms other methods and provides a favorable speaker tracking performance both in 3D space and on the image plane.

Posted Content
TL;DR: In this article, a strided transformer encoder (STE) is proposed for 3D human pose estimation in videos to lift a sequence of 2D joint locations to a 3D pose.
Abstract: Despite great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of redundant 2D pose sequences to learn representative representation for generating one single 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, for 3D human pose estimation in videos to lift a sequence of 2D joint locations to a 3D pose. Specifically, a vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce redundancy of the sequence and aggregate information from local context, strided convolutions are incorporated into VTE to progressively reduce the sequence length. The modified VTE is termed as strided Transformer encoder (STE) which is built upon the outputs of VTE. STE not only effectively aggregates long-range information to a single-vector representation in a hierarchical global and local fashion but also significantly reduces the computation cost. Furthermore, a full-to-single supervision scheme is designed at both the full sequence scale and single target frame scale, applied to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision and improves the representation ability of features for the target frame. The proposed architecture is evaluated on two challenging benchmark datasets, Human3.6M and HumanEva-I, and achieves state-of-the-art results with much fewer parameters.

Posted Content
Guoliang Hua, Wenhao Li, Qian Zhang, Runwei Ding, Hong Liu1 
TL;DR: In this article, a weakly-supervised cross-view 3D human pose estimation method is proposed, which uses U-shaped graph convolutional networks (UGCN) to refine the coarse 3D poses.
Abstract: Although monocular 3D human pose estimation methods have made significant progress, it's far from being solved due to the inherent depth ambiguity. Instead, exploiting multi-view information is a practical way to achieve absolute 3D human pose estimation. In this paper, we propose a simple yet effective pipeline for weakly-supervised cross-view 3D human pose estimation. By only using two camera views, our method can achieve state-of-the-art performance in a weakly-supervised manner, requiring no 3D ground truth but only 2D annotations. Specifically, our method contains two steps: triangulation and refinement. First, given the 2D keypoints that can be obtained through any classic 2D detection methods, triangulation is performed across two views to lift the 2D keypoints into coarse 3D poses.Then, a novel cross-view U-shaped graph convolutional network (CV-UGCN), which can explore the spatial configurations and cross-view correlations, is designed to refine the coarse 3D poses. In particular, the refinement progress is achieved through weakly-supervised learning, in which geometric and structure-aware consistency checks are performed. We evaluate our method on the standard benchmark dataset, Human3.6M. The Mean Per Joint Position Error on the benchmark dataset is 27.4 mm, which outperforms the state-of-the-arts remarkably (27.4 mm vs 30.2 mm).

Journal ArticleDOI
Muyao Ye1, Chang Wang1, Ling Li1, Qiulan Zhao1, Youming Peng1, Hong Liu1 
TL;DR: In this paper, the authors explored the association between α-hemolytic streptococcus (α-HS) infection and complement activation in human tonsillar mononuclear cells (TMCs) in IgAN.
Abstract: While β-hemolytic streptococcus (β-HS) infections are known to predispose patients to acute poststreptococcal glomerulonephritis, there is evidence that implicates α-hemolytic streptococcus (α-HS) in IgA nephropathy (IgAN). The alternative pathway of the complement system has also been implicated in IgAN. We aimed to explore the association between α-HS and complement activation in human tonsillar mononuclear cells (TMCs) in IgAN. In our study, α-HS induced higher IgA levels than IgG levels, while β-HS increased higher IgG levels than IgA levels with more activation-induced cytidine deaminase, in TMCs in the IgAN group. Aberrant IgA1 O-glycosylation levels were higher in IgAN patients with α-HS. C3 and C3b expression was decreased in IgAN patients, but in chronic tonsillitis control patients, the expression decreased only after stimulation with β-HS. Complement factor B and H (CFH) mRNA increased, but the CFH concentration in culture supernatants decreased with α-HS. The percentage of CD19 + CD35 + cells/complement receptor 1 (CR1) decreased with α-HS more than with β-HS, while CD19 + CD21 + cells/complement receptor 2 (CR2) increased more with β-HS than with α-HS. The component nephritis-associated plasmin receptor (NAPlr) of α-HS was not detected on tonsillar or kidney tissues in IgAN patients and was positive on cultured TMCs and mesangial cells. We concluded that α-HS induced the secretion of aberrantly O-glycosylated IgA while decreasing the levels of the inhibitory factor CFH in culture supernatants and CR1 + B cells. These findings provide testable mechanisms that relate α-HS infection to abnormal mucosal responses involving the alternative complement pathway in IgAN.

Proceedings ArticleDOI
Hong Liu1, Lisi Guan1
10 Jan 2021
TL;DR: Wang et al. as discussed by the authors proposed a novel Dilation Pyramid Module (DPM), which can enlarge the receptive field multiplicatively to extract high-level-semantic information as subsampling without reducing spatial resolution.
Abstract: Ensuring that features contain both high-resolution and high-level-semantic information is important for human pose estimation, while most existing methods suffer from spatial information loss or semantic information mismatch when extracting high-resolution high-level-semantic features. To efficiently address these issues, we propose a novel Dilation Pyramid Module (DPM), which can enlarge the receptive field multiplicatively to extract high-level-semantic information as subsampling without reducing spatial resolution. DPM is composed of several consecutive dilated convolution layers of which dilation radius is specially designed to enlarge the receptive field multiplicatively and avoid the gridding issue of dilated convolution. Based on DPM, the Dilation Pyramid Net (DPN) is proposed to efficiently extract high-resolution high-level-semantic features. We experimentally demonstrate the effectiveness and efficiency of the proposed DPN with competitive performance to the state-of-the-art methods over two challenging benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset.

Posted Content
TL;DR: In this article, the style transfer model transforms images from one source domain to another, enriching the domain diversity of the training data, and interpolating different domains on feature level, new domains can be sampled on the domain manifold.
Abstract: The performance of existing underwater object detection methods degrades seriously when facing domain shift problem caused by complicated underwater environments. Due to the limitation of the number of domains in the dataset, deep detectors easily just memorize a few seen domain, which leads to low generalization ability. Ulteriorly, it can be inferred that the detector trained on as many domains as possible is domain-invariant. Based on this viewpoint, we propose a domain generalization method from the aspect of data augmentation. First, the style transfer model transforms images from one source domain to another, enriching the domain diversity of the training data. Second, interpolating different domains on feature level, new domains can be sampled on the domain manifold. With our method, detectors will be robust to domain shift. Comprehensive experiments on S-UODAC2020 datasets demonstrate that the proposed method is able to learn domain-invariant representations, and outperforms other domain generalization methods. The source code is available at this https URL.

Posted Content
Wenhao Li1, Hong Liu1, Hao Tang2, Pichao Wang2, Luc Van Gool3 
TL;DR: In this paper, a multi-hypothesis transformer (MHformer) is proposed to learn spatio-temporal representations of multiple plausible pose hypotheses, and the task is decomposed into three stages: generating multiple initial hypothesis representations, merging multiple hypotheses into a single converged representation and then partitioning it into several diverged hypotheses.
Abstract: Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion. Most existing works attempt to solve both issues by exploiting spatial and temporal relationships. However, those works ignore the fact that it is an inverse problem where multiple feasible solutions (i.e., hypotheses) exist. To relieve this limitation, we propose a Multi-Hypothesis Transformer (MHFormer) that learns spatio-temporal representations of multiple plausible pose hypotheses. In order to effectively model multi-hypothesis dependencies and build strong relationships across hypothesis features, the task is decomposed into three stages: (i) Generate multiple initial hypothesis representations; (ii) Model self-hypothesis communication, merge multiple hypotheses into a single converged representation and then partition it into several diverged hypotheses; (iii) Learn cross-hypothesis communication and aggregate the multi-hypothesis features to synthesize the final 3D pose. Through the above processes, the final representation is enhanced and the synthesized pose is much more accurate. Extensive experiments show that MHFormer achieves state-of-the-art results on two challenging datasets: Human3.6M and MPI-INF-3DHP. Without bells and whistles, its performance surpasses the previous best result by a large margin of 3% on Human3.6M. Code and models are available at this https URL