Analyzing lower half facial gestures for lip reading applications: Survey on vision techniques

The rise of the metaverse concept has brought about widespread attention in wearable gesture recognition devices. Data gloves based on flexible strain sensors have been favoured by researchers owing to their low cost, light weight, direct and continuous monitoring of finger movements. In this review, we first compare the advantages and disadvantages of four different approaches based on the vision sensors, myoelectric sensors, inertial and magnetic sensors, and flexible strain sensors in designing data gloves, and demonstrate the superiority of the flexible strain sensor-based data glove used for metaverse applications. Next, some latest commercial data gloves are exampled and the function modules of the data gloves are presented based on the flexible strain sensors. Meanwhile, the potential applications of gesture recognition in the metaverse are summarized in diversified fields. Finally, the existing problems and development prospects of the current data gloves based on flexible strain sensors are concluded. We are optimistic that novel flexible strain sensor-based data gloves will make transformational impact to realize accurate, low-latency, and immersive gesture interaction in the metaverse. 

Flexible Strain Sensor-Based Data Glove for Gesture Interaction in the Metaverse: A Review

The purpose of Audio-Visual Speech Recognition is to identify the content of the spoken sentence by extracting the lip movement features and acoustic features from an input video file containing a person's conversation. Although the current audio-visual fusion models solve the problem of inconsistency in the time length of different modalities to a certain extent, the fusion of the modalities may cause acoustic boundary ambiguity. To better solve this problem, in this paper, we propose a model named Cross-Modal Continuous Integrate-and-Fire (CM-CIF). The model integrates cross-modal information to the accumulated weight so that the acoustic boundary can be located more accurate. We use the Transformer-seq2seq model as the baseline and test CM-CIF on the public datasets LRS2 and LRS3. Experimental results show that CM-CIF achieves a competitive performance.

CM-CIF: Cross-Modal for Unaligned Modality Fusion with Continuous Integrate-and-Fire

The robustness of automatic speech recognition (ASR) systems degrade due to the factors such as environmental noises, speaker variability, and channel distortion, among others. The approaches such as speech signal processing, model adaptation, hybrid techniques and integration of multiple sources are used for ASR system development. This paper focuses on building a robust ASR system by combining the complementary evidence present is the multiple modalities through which speech is expressed. Speech sounds are produced with lip radiation accompanied lip movements called Visual Speech Recognition (VSR). VSR system converts lip movement into spoken words. This system consists of lip region detection, visual speech feature extraction method and modeling techniques. Robust feature extraction from visual lip movement is a challenging task in VSR system. Hence, this paper reviews the feature extraction methods and existing databases used for VSR system. The fusion of visual lip movements with ASR system at different levels is also presented.

A. Nayeemulla khan

Papers

A Survey on Visual Speech Recognition Approaches