The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements, and it is demonstrated that this model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations.
Abstract:
The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers.
TL;DR: This article proposed an attention-based pooling mechanism to aggregate visual speech representations and used sub-word units for lip reading for the first time and showed that this allowed them to better model the ambiguities of the task.
TL;DR: A comprehensive review of lip reading applications is broadly classified into five distinct applications: Lip Reading Biometrics (LRB), audio visual speech recognition (AVSR), Silent Speech Recognition (SSR), Voice from Lips, and Lip HCI (Human-computer interaction) as discussed by the authors .
TL;DR: This paper examined the acoustic signatures of bilingual voices using a conversational corpus of speech from early Cantonese-English bilinguals and found that all talkers show strong similarity with themselves, suggesting an individual voice remains relatively constant across languages.
TL;DR: In this article , the authors used linguistic information as a soft biometric trait to enhance the performance of a visual (auditory-free) identification system based on lip movement and reported a significant improvement in the identification performance of the proposed visual system as a result of the integration of these data using a score-based fusion strategy.
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
TL;DR: It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.
TL;DR: Xception as mentioned in this paper proposes a novel deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with depthwise separable convolutions, which can be interpreted as an Inception module with a maximally large number of towers.
TL;DR: VGGFace2 as discussed by the authors is a large-scale face dataset with 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject.
Q1. What are the future works mentioned in the paper "Now you’re speaking my language: visual language identification" ?
In future work the authors plan to investigate which lip movements provide the most discriminative cues, as well as explore the visual similarities and differences between languages – e. g. determine if certain viseme combinations are more prominent for some groups of languages than in others.
Q2. What contributions have the authors mentioned in the paper "Now you’re speaking my language: visual language identification" ?
The goal of this work is to train models that can identify a spoken language just by interpreting the speaker ’ s lip movements. Their contributions are the following: ( i ) the authors show that models can learn to discriminate among 14 different languages using only visual speech information ; ( ii ) they compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task ; ( iii ) they investigate the factors that contribute discriminative cues and show that their model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. The authors demonstrate this further by evaluating their models on challenging examples from bilingual speakers.
Q3. What is the main reason for the progress in the recent years?
There has been significant progress in the recent years, mainly due to the advances in deep learning and the creation of large scale datasets.
Q4. How do the authors train the face recognition models?
To accelerate training for all models the authors use a curriculum, first setting T = 64 and then increasing it to 128 and 256 frames (2.5s, 5s and 10s).
Q5. What is the way to train the face recognition model?
The authors take a ResNet50 convolutional network [38] pretrained for face recognition on the VGGFace2 dataset [44] and fine-tune it on the on the VLID task.
Q6. What are the main features of the LID task?
Cai et al. [15] explore the encoder and loss function for LID and propose some efficient temporal aggregation strategies, while Chen et al. [16] use NetVLAD [17] for temporal aggregation.
Q7. What is the way to use a 2D CNN?
In more recent work [18] use a 2D CNN as feature extractor with a BLSTM backend for temporal modelling and a self-attentive pooling layer for utterance level aggregation.
Q8. How do the authors ensure that the models are distinguishing between languages?
To ensure that the models are indeed distinguishing between languages by finding patterns in the mouth movement, and not instead using other factors (e.g. inferring ethnicity from appearance cues) or spurious correlations, the authors compare with a face recognition baseline and also evaluate the models on a dataset from a different domain, VoxCeleb2 [13].
Q9. How can the authors infer the language from audio?
Newman and Cox [11, 12] have shown that, under controlled visual conditions, visual language identification can also be automated.
Q10. What might be the reason for the poor performance of the VLID models?
The authors conjecture that this might be due to background landmarks or camera artefacts correlated with the location of shooting of the TEDx events.