scispace - formally typeset
Search or ask a question

Showing papers by "Mengyuan Liu published in 2022"


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an identity-relevance aware neural network (IRANet) for cloth-changing person re-identification, where a human head detection module is designed to localize the human head part with the help of human parsing estimation.

5 citations


Journal ArticleDOI
TL;DR: This work presents a contrastive visual semantic embedding framework, named ConVSE, which achieves intra-modal semantic alignment by contrastive learning from augmented image-image (or text-text) pairs and achieves inter- modal semantic alignments by applying hardest-negative-enhanced triplet loss on image-text pairs.
Abstract: Learning visual semantic embedding for image-text matching has achieved high success by using triplet loss to pull positive image-text pairs which share similar semantic meaning and to push negative image-text pairs which share different semantic meaning. Without modeling constraints from image-image or text-text pairs, the generated visual semantic embedding inevitably faces the problem of semantic misalignments among similar images or among similar texts. To solve this problem, we present a contrastive visual semantic embedding framework, named ConVSE, which achieves intra-modal semantic alignment by contrastive learning from augmented image-image (or text-text) pairs and achieves inter-modal semantic alignment by applying hardest-negative-enhanced triplet loss on image-text pairs. To the best of our knowledge, we are the first to find that contrastive learning benefits visual semantic embedding. Extensive experiments on large-scale MSCOCO and Flickr30 K datasets verify the effectiveness of our proposed ConVSE by outperforming visual semantic embedding-based methods and achieving new state-of-the-art.

5 citations


Journal ArticleDOI
TL;DR: A deep I2V ReID pipeline based on three-dimensional semantic appearance alignment (3D-SAA) and cross-modal interactive learning (CMIL) and a CMIL module enables the communication between global image and video streams by interactively propagating the temporal information in videos to the channels of image feature maps.

4 citations


Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors presented a pose decoupled flow network (PDF-E) to learn from direction and norm in a multi-task learning framework, where 1 encoder is used to generate representation and 2 decoders are used to generating direction and normal, respectively.
Abstract: Human action representation is derived from the description of human shape and motion. The traditional unsupervised 3-dimensional (3D) human action representation learning method uses a recurrent neural network (RNN)-based autoencoder to reconstruct the input pose sequence and then takes the midlevel feature of the autoencoder as representation. Although RNN can implicitly learn a certain amount of motion information, the extracted representation mainly describes the human shape and is insufficient to describe motion information. Therefore, we first present a handcrafted motion feature called pose flow to guide the reconstruction of the autoencoder, whose midlevel feature is expected to describe motion information. The performance is limited as we observe that actions can be distinctive in either motion direction or motion norm. For example, we can distinguish “sitting down” and “standing up” from motion direction yet distinguish “running” and “jogging” from motion norm. In these cases, it is difficult to learn distinctive features from pose flow where direction and norm are mixed. To this end, we present an explicit pose decoupled flow network (PDF-E) to learn from direction and norm in a multi-task learning framework, where 1 encoder is used to generate representation and 2 decoders are used to generating direction and norm, respectively. Further, we use reconstructing the input pose sequence as an additional constraint and present a generalized PDF network (PDF-G) to learn both motion and shape information, which achieves state-of-the-art performances on large-scale and challenging 3D action recognition datasets including the NTU RGB+D 60 dataset and NTU RGB+D 120 dataset.

4 citations


Journal ArticleDOI
TL;DR: This work designs a novel spatial-temporal asynchronous normalization (STAN) method, which normalizes original skeleton sequence in two steps and generates normalized motion sequence that suffers less from the effect of different human body shapes.
Abstract: Unsupervised 3D action representation learning from skeleton sequences has attracted increasing attention in recent years. Existing methods have successfully applied autoencoder network to learn 3D action representation by reconstructing original skeleton sequence. However, these methods ignore motion cues thus suffer from distinguishing actions especially with similar shape information and slightly different motion information. Instead of reconstructing original skeleton sequence, we learn distinctive 3D action representation with autoencoder network by reconstructing normalized motion sequence extracted from original input. To obtain the normalized motion sequence, we specifically design a novel spatial-temporal asynchronous normalization (STAN) method, which normalizes original skeleton sequence in two steps. First, STAN reduces redundant temporal information and extracts motion sequence by subtracting mean value along the temporal dimension. Second, STAN further normalizes the motion sequence along the spatial dimension and generates normalized motion sequence that suffers less from the effect of different human body shapes. Extensive experiments on large scale NTU RGB+D 60 and NTU RGB+D 120 datasets verify the effectiveness of our proposed STAN method, which achieves comparative results with state-of-the-art methods, and also outperforms alternative normalization methods.

2 citations