Learning to lip read words by watching videos

Question

Q1. What contributions have the authors mentioned in the paper "Learning to lip read words by watching videos" ?

Q2. How can the authors classify the words in a video?

Q3. How does the augmentation improve the top-1 validation error?

Q4. Why do the authors consider using CNNs for sequence modelling?

Q5. How many subjects have been used in previous works?

Q6. Why is the size of the cropped mouth images smaller than in VGG-M?

Q7. How many frames do the authors repeat to fill the clip?

Q8. What is the way to train a CNN?

Q9. What is the disadvantage of building on the VGG-M model?

Q10. How many different speakers can be extracted from the spoken text?

Q11. What is the disadvantage of increasing the size of the averaging window?

Q12. What is the main limitation of lip-reading?

Q13. What is the way to train a classifier?

Q14. What is the approach to tolerant to motion jitter?

Accepted Answer

With this the authors have generated a dataset with over a million word instances, spoken by over a thousand different people ; second, they develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. The authors apply this network to the tasks of audio-to-video synchronisation and active speaker detection ; third, they train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset. In lip reading and in speaker detection, the authors demonstrate results that exceed the current state-of-the-art on public benchmark datasets.

Accepted Answer

The authors have shown that CNN and LSTM architectures can be used to classify temporal lip motion sequences of words with excellent results.

Accepted Answer

To further augment the training data, the authors make random shifts in time by up to 0.2 seconds, which improves the top-1 validation error by 3.5% compared to the standard ImageNet augmentation methods.

Accepted Answer

Their reason for considering CNNs, rather than the more usual Recurrent Neural Networks that are used for sequence modelling, is their ability to learn to classify images on their content given only image supervision at the class level, i.e. without having to provide stronger supervisory information such as bounding boxes or pixel-wise segmentation.

Accepted Answer

The dataset consists of 52 subjects uttering 10 phrases (e.g. ‘thank you’, ‘hello’, etc.), and has been widely used in previous works.

Accepted Answer

The reason is that the size of the cropped mouth images are rarely larger than 111×111 pixels, and this smaller choice means that smaller filters can be used at conv1 than those used in VGG-M without sacrificing receptive fields, but at a gain in avoiding unnecessary parameters being learnt.

Accepted Answer

For all models apart from LSTM-5, the authors simply repeat the first and the last frames to fill the 1-second clip if the phrase is shorter than 25 frames.

Accepted Answer

Koller et al. [9] train an image classifier CNN to discriminate visemes (mouth shapes, visual equivalent of phonemes) on a sign language dataset where the signers mouth words.

Accepted Answer

In particular, the authors build on the VGG-M model [39] since this has a good classification performance, but is much faster to train and experiment on than deeper models, such as VGG-16 [41].

Accepted Answer

Using this pipeline the authors have been able to extract 1000s of hours of spoken text covering an extensive vocabulary of 1000s of different words, with over 1M word instances, and over 1000 different speakers.

Accepted Answer

The disadvantage of increasing the size of the averaging window is that the method cannot detect examples in which the person speaks for a very short period; though this is not a problem for this dataset.

Accepted Answer

Apart from this limitation, lip-reading is a challenging problem in any case due to intra-class variations (such as accents, speed of speaking, mumbling), and adversarial imaging conditions (such as poor lighting, strong shadows, motion, resolution, foreshortening, etc.).

Accepted Answer

Similarly [13] uses DBF to encode the image for every frame, and trains a LSTM classifier to generate a word-level classification.

Accepted Answer

One approach would be to tightly register the mouth region (including lips, teeth and tongue, that all contribute to word recognition), but another is to develop networks that are tolerant to some degree of motion jitter.

Learning to lip read words by watching videos

Figures

Citations

VoxCeleb2: Deep Speaker Recognition.

Voxceleb: Large-scale speaker verification in the wild

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation

Deep Lip Reading: a comparison of models and an online application

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Learning Spatiotemporal Features with 3D Convolutional Networks

Related Papers (5)

Lip reading in the wild

Lip Reading Sentences in the Wild

Deep Residual Learning for Image Recognition

Out of Time: Automated Lip Sync in the Wild

An audio-visual corpus for speech perception and automatic speech recognition

Frequently Asked Questions (14)