scispace - formally typeset
Open AccessJournal ArticleDOI

Learning to lip read words by watching videos

Reads0
Chats0
TLDR
A pipeline for fully automated data collection from TV broadcasts is developed, a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data is developed and hundreds of words are recognized from a large-scale dataset.
About
This article is published in Computer Vision and Image Understanding.The article was published on 2018-08-01 and is currently open access. It has received 67 citations till now. The article focuses on the topics: Convolutional neural network.

read more

Citations
More filters
Proceedings ArticleDOI

VoxCeleb2: Deep Speaker Recognition.

TL;DR: In this article, a large-scale audio-visual speaker recognition dataset, VoxCeleb2, is presented, which contains over a million utterances from over 6,000 speakers.
Journal ArticleDOI

Voxceleb: Large-scale speaker verification in the wild

TL;DR: A very large-scale audio-visual dataset collected from open source media using a fully automated pipeline and developed and compared different CNN architectures with various aggregation methods and training loss functions that can effectively recognise identities from voice under various conditions are introduced.
Proceedings ArticleDOI

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

TL;DR: This paper presents a naturally-distributed large-scale benchmark for lip-reading in the wild, named LRW-1000, which contains 1,000 classes with 718,018 samples from more than 2,000 individual speakers, and is currently the largest word-level lipreading dataset and also the only public large- scale Mandarin lip-read dataset.
Proceedings ArticleDOI

Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation

TL;DR: In this article, a cross-modal retrieval strategy was proposed to find the most relevant audio segment given a short video clip for audio-to-video synchronisation, where the objective is to find an audio segment that is relevant to the video.
Posted Content

Deep Lip Reading: a comparison of models and an online application

TL;DR: The best performing model improves the state-of-the-art word error rate on the challenging BBC-Oxford Lip Reading Sentences 2 (LRS2) benchmark dataset by over 20 percent.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Posted Content

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.
Proceedings ArticleDOI

Learning Spatiotemporal Features with 3D Convolutional Networks

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Frequently Asked Questions (14)
Q1. What contributions have the authors mentioned in the paper "Learning to lip read words by watching videos" ?

With this the authors have generated a dataset with over a million word instances, spoken by over a thousand different people ; second, they develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. The authors apply this network to the tasks of audio-to-video synchronisation and active speaker detection ; third, they train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset. In lip reading and in speaker detection, the authors demonstrate results that exceed the current state-of-the-art on public benchmark datasets. 

The authors have shown that CNN and LSTM architectures can be used to classify temporal lip motion sequences of words with excellent results. 

To further augment the training data, the authors make random shifts in time by up to 0.2 seconds, which improves the top-1 validation error by 3.5% compared to the standard ImageNet augmentation methods. 

Their reason for considering CNNs, rather than the more usual Recurrent Neural Networks that are used for sequence modelling, is their ability to learn to classify images on their content given only image supervision at the class level, i.e. without having to provide stronger supervisory information such as bounding boxes or pixel-wise segmentation. 

The dataset consists of 52 subjects uttering 10 phrases (e.g. ‘thank you’, ‘hello’, etc.), and has been widely used in previous works. 

The reason is that the size of the cropped mouth images are rarely larger than 111×111 pixels, and this smaller choice means that smaller filters can be used at conv1 than those used in VGG-M without sacrificing receptive fields, but at a gain in avoiding unnecessary parameters being learnt. 

For all models apart from LSTM-5, the authors simply repeat the first and the last frames to fill the 1-second clip if the phrase is shorter than 25 frames. 

Koller et al. [9] train an image classifier CNN to discriminate visemes (mouth shapes, visual equivalent of phonemes) on a sign language dataset where the signers mouth words. 

In particular, the authors build on the VGG-M model [39] since this has a good classification performance, but is much faster to train and experiment on than deeper models, such as VGG-16 [41]. 

Using this pipeline the authors have been able to extract 1000s of hours of spoken text covering an extensive vocabulary of 1000s of different words, with over 1M word instances, and over 1000 different speakers. 

The disadvantage of increasing the size of the averaging window is that the method cannot detect examples in which the person speaks for a very short period; though this is not a problem for this dataset. 

Apart from this limitation, lip-reading is a challenging problem in any case due to intra-class variations (such as accents, speed of speaking, mumbling), and adversarial imaging conditions (such as poor lighting, strong shadows, motion, resolution, foreshortening, etc.). 

Similarly [13] uses DBF to encode the image for every frame, and trains a LSTM classifier to generate a word-level classification. 

One approach would be to tightly register the mouth region (including lips, teeth and tongue, that all contribute to word recognition), but another is to develop networks that are tolerant to some degree of motion jitter.