scispace - formally typeset
Search or ask a question

Showing papers by "Yue Ming published in 2023"


Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed a lightweight multiscale fusion network (LMFNet) with a hierarchical structure based on single-mode data for low-quality 3D face recognition.
Abstract: Three-dimensional (3-D) face recognition (FR) can improve the usability and user-friendliness of human–machine interaction. In general, 3-D FR can be divided into high-quality and low-quality 3-D FR according to different interaction scenarios. The low-quality data can be easily obtained, so its application prospect is more extensive. However, the challenge is how to balance the trade-offs between data accuracy and real-time performance. To solve this problem, we propose a lightweight multiscale fusion network (LMFNet) with a hierarchical structure based on single-mode data for low-quality 3-D FR. First, we design a backbone network with only five feature extraction blocks to reduce computational complexity and improve the inference speed. Second, we devise a mid-low adjacent layer with a multiscale feature fusion (ML-MSFF) module to extract the facial texture and contour information, and a mid-high adjacent layer with a multiscale feature fusion (MH-MSFF) module to obtain the discriminative information in high-level features. Then, a hierarchical multiscale feature fusion (HMSFF) module is formed by combining these two modules mentioned above to acquire the local information of different scales. Finally, we enhance the expression of features by integrating HMSFF with a global convolutional neural network for improving recognition accuracy. Experiments on Lock3DFace, KinectFaceDB, and IIIT-D datasets demonstrate that our proposed LMFNet can achieve superior performance on low-quality datasets. Furthermore, experiments on the cross-quality database based on Bosphorus and the different intensity noise low-quality datasets based on UMB-DB and Bosphorus show that our network is robust and has a high generalization ability. It satisfies the real-time requirement, which lays a foundation for a smooth and user-friendly interactive experience.

1 citations


Journal ArticleDOI
Fan Feng, Yue Ming, Nannan Hu, Hui Yu, Yuanan Liu 
TL;DR: Wang et al. as discussed by the authors propose a bidirectional guided co-attention (BGCA) block, containing two distinct attention paths from audio to vision and from vision to audio, to focus on sound-related visual regions and event-related sound segments.
Abstract: Audio-visual event (AVE) localization aims to localize the temporal boundaries of events that contains visual and audio contents, to identify event categories in unconstrained videos. Existing work usually utilizes successive video segments for temporal modeling. However, ambient sounds or irrelevant visual targets in some segments often cause the problem of audio-visual semantics inconsistency, resulting in inaccurate global event modeling. To tackle this issue, we present a consistent segment selection network (CSS-Net) in this paper. First, we propose a novel bidirectional guided co-attention (BGCA) block, containing two distinct attention paths from audio to vision and from vision to audio, to focus on sound-related visual regions and event-related sound segments. Then, we propose a novel context-aware similarity measure (CASM) module to select semantic consistent visual and audio segments. A cross-correlation matrix is constructed using the correlation coefficients between the visual and audio feature pairs in all time steps. By extracting highly correlated segments and discarding low correlated segments, visual and audio features can learn global event semantics in videos. Finally, we propose a novel audio-visual contrastive loss to learn the similar semantics representation for visual and audio global features under the constraints of cosine and L2 similarities. Extensive experiments on public AVE dataset demonstrates the effectiveness of our proposed CSS-Net. The localization accuracies achieve the best performance of 80.5% and 76.8% in both fully- and weakly-supervised settings compared with other state-of-the-art methods.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an Enhancing Hybrid Architecture with Fast Attention and Capsule Network (termed En-HACN), which can model the position relationships between different acoustic unit features to improve the discriminability of speech features while providing the text local information during inference.
Abstract: Automatic speech recognition (ASR) is a fundamental technology in the field of artificial intelligence. End-to-end (E2E) ASR is favored for its state-of-the-art performance. However, E2E speech recognition still faces speech spatial information loss and text local information loss, which results in the increase of deletion and substitution errors during inference. To overcome this challenge, we propose a novel Enhancing Hybrid Architecture with Fast Attention and Capsule Network (termed En-HACN), which can model the position relationships between different acoustic unit features to improve the discriminability of speech features while providing the text local information during inference. Firstly, a new CNN-Capsule Network (CNN-Caps) module is proposed to capture the spatial information in the spectrogram through the capsule output and dynamic routing mechanism. Then, we design a novel hybrid structure of LocalGRU Augmented Decoder (LA-decoder) that generates text hidden representations to obtain text local information of the target sequences. Finally, we introduce fast attention instead of self-attention in En-HACN, which improves the generalization ability and efficiency of the model in long utterances. Experiments on corpora Aishell-1 and Librispeech demonstrate that our En-HACN has achieved the state-of-the-art compared with existing works. Besides, experiments on the long utterances dataset based on Aishell-1-long show that our model has a high generalization ability and efficiency.