Q2. How can the authors classify the words in a video?
The authors have shown that CNN and LSTM architectures can be used to classify temporal lip motion sequences of words with excellent results.
Q3. How does the augmentation improve the top-1 validation error?
To further augment the training data, the authors make random shifts in time by up to 0.2 seconds, which improves the top-1 validation error by 3.5% compared to the standard ImageNet augmentation methods.
Q4. Why do the authors consider using CNNs for sequence modelling?
Their reason for considering CNNs, rather than the more usual Recurrent Neural Networks that are used for sequence modelling, is their ability to learn to classify images on their content given only image supervision at the class level, i.e. without having to provide stronger supervisory information such as bounding boxes or pixel-wise segmentation.
Q5. How many subjects have been used in previous works?
The dataset consists of 52 subjects uttering 10 phrases (e.g. ‘thank you’, ‘hello’, etc.), and has been widely used in previous works.
Q6. Why is the size of the cropped mouth images smaller than in VGG-M?
The reason is that the size of the cropped mouth images are rarely larger than 111×111 pixels, and this smaller choice means that smaller filters can be used at conv1 than those used in VGG-M without sacrificing receptive fields, but at a gain in avoiding unnecessary parameters being learnt.
Q7. How many frames do the authors repeat to fill the clip?
For all models apart from LSTM-5, the authors simply repeat the first and the last frames to fill the 1-second clip if the phrase is shorter than 25 frames.
Q8. What is the way to train a CNN?
Koller et al. [9] train an image classifier CNN to discriminate visemes (mouth shapes, visual equivalent of phonemes) on a sign language dataset where the signers mouth words.
Q9. What is the disadvantage of building on the VGG-M model?
In particular, the authors build on the VGG-M model [39] since this has a good classification performance, but is much faster to train and experiment on than deeper models, such as VGG-16 [41].
Q10. How many different speakers can be extracted from the spoken text?
Using this pipeline the authors have been able to extract 1000s of hours of spoken text covering an extensive vocabulary of 1000s of different words, with over 1M word instances, and over 1000 different speakers.
Q11. What is the disadvantage of increasing the size of the averaging window?
The disadvantage of increasing the size of the averaging window is that the method cannot detect examples in which the person speaks for a very short period; though this is not a problem for this dataset.
Q12. What is the main limitation of lip-reading?
Apart from this limitation, lip-reading is a challenging problem in any case due to intra-class variations (such as accents, speed of speaking, mumbling), and adversarial imaging conditions (such as poor lighting, strong shadows, motion, resolution, foreshortening, etc.).
Q13. What is the way to train a classifier?
Similarly [13] uses DBF to encode the image for every frame, and trains a LSTM classifier to generate a word-level classification.
Q14. What is the approach to tolerant to motion jitter?
One approach would be to tightly register the mouth region (including lips, teeth and tongue, that all contribute to word recognition), but another is to develop networks that are tolerant to some degree of motion jitter.