Top 58 papers published by Andrew Zisserman from University of Oxford in 2019

Proceedings Article•DOI•

The VIA Annotation Software for Images, Audio and Video

[...]

Abhishek Dutta¹, Andrew Zisserman¹•Institutions (1)

15 Oct 2019

TL;DR: A light weight, standalone and offline software package that does not require any installation or setup and runs solely in a web browser, the VIA software allows human annotators to define and describe spatial regions in images or video frames, and temporal segments in audio or video.

...read moreread less

Abstract: In this paper, we introduce a simple and standalone manual annotation tool for images, audio and video: the VGG Image Annotator (VIA). This is a light weight, standalone and offline software package that does not require any installation or setup and runs solely in a web browser. The VIA software allows human annotators to define and describe spatial regions in images or video frames, and temporal segments in audio or video. These manual annotations can be exported to plain text data formats such as JSON and CSV and therefore are amenable to further processing by other software tools. VIA also supports collaborative annotation of a large dataset by a group of human annotators. The BSD open source license of this software allows it to be used in any academic project or commercial application.

...read moreread less

518 citations

Proceedings Article•DOI•

Video Action Transformer Network

[...]

Rohit Girdhar¹, Joao Carreira, Carl Doersch, Andrew Zisserman²•Institutions (2)

Carnegie Mellon University¹, University of Oxford²

15 Jun 2019

TL;DR: Action Transformer as mentioned in this paper uses a Transformer-style architecture to aggregate features from the spatio-temporal context around the person whose actions we are trying to classify, and shows that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others.

...read moreread less

Abstract: We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action – all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.

...read moreread less

486 citations

Journal Article•DOI•

Deep Audio-visual Speech Recognition

[...]

Triantafyllos Afouras¹, Joon Son Chung¹, Andrew W. Senior², Oriol Vinyals², Andrew Zisserman¹ - Show less +1 more•Institutions (2)

University of Oxford¹, Google²

01 Jan 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work compares two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss, built on top of the transformer self-attention architecture.

...read moreread less

Abstract: The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem -- unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release two new datasets for audio-visual speech recognition: LRS2-BBC, consisting of thousands of natural sentences from British television; and LRS3-TED, consisting of hundreds of hours of TED and TEDx talks obtained from YouTube. The models that we train surpass the performance of all previous work on lip reading benchmark datasets by a significant margin.

...read moreread less

454 citations

Proceedings Article•DOI•

Video Representation Learning by Dense Predictive Coding

[...]

Tengda Han¹, Weidi Xie¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

10 Sep 2019

TL;DR: With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

...read moreread less

Abstract: The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101(75.7% top1 acc) and HMDB51(35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

...read moreread less

370 citations

Proceedings Article•DOI•

Utterance-level Aggregation for Speaker Recognition in the Wild

[...]

Weidi Xie¹, Arsha Nagrani¹, Joon Son Chung¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

12 May 2019

TL;DR: This paper proposes a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end.

...read moreread less

Abstract: The objective of this paper is speaker recognition ‘in the wild’ – where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for ‘in the wild’ data, a longer length is beneficial.

...read moreread less

308 citations

Posted Content•

A Short Note on the Kinetics-700 Human Action Dataset

[...]

Joao Carreira¹, Eric Noland¹, Chloe Hillier¹, Andrew Zisserman¹•Institutions (1)

Google¹

15 Jul 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: An extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos, and includes a comprehensive set of statistics.

...read moreread less

Abstract: We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results using the I3D network.

...read moreread less

279 citations

Posted Content•

The VGG Image Annotator (VIA).

[...]

Abhishek Dutta, Andrew Zisserman

24 Apr 2019

TL;DR: This paper introduces a simple and standalone manual image annotation tool: the VGG Image Annotator, a light weight, standalone and offline software package that does not require any installation or setup and runs solely in a web browser.

...read moreread less

Abstract: Manual image annotation, such as defining and labelling regions of interest, is a fundamental processing stage of many research projects and industrial applications. In this paper, we introduce a simple and standalone manual image annotation tool: the VGG Image Annotator (VIA). This is a light weight, standalone and offline software package that does not require any installation or setup and runs solely in a web browser. Due to its lightness and flexibility, the VIA software has quickly become an essential and invaluable research support tool in many academic disciplines. Furthermore, it has also been immensely popular in several industrial sectors which have invested in adapting this open source software to their requirements. Since its public release in 2017, the VIA software has been used more than 500, 000 times and has nurtured a large and thriving open source community.

...read moreread less

259 citations

Proceedings Article•

Use What You Have: Video retrieval using representations from collaborative experts.

[...]

Yang Liu¹, Samuel Albanie², Arsha Nagrani², Andrew Zisserman²•Institutions (2)

University of Cambridge¹, University of Oxford²

01 Jan 2019

TL;DR: In this article, a collaborative experts model is proposed to aggregate information from different pre-trained experts and assess their approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.

...read moreread less

Abstract: The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human-generated queries for video datasets `in the wild' vary a lot in terms of degree of specificity, with some queries describing specific details such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pre-trained semantic embeddings which include 'general' features such as motion, appearance, and scene features from visual content. We also explore the use of more 'specific' cues from ASR and OCR which are intermittently available for videos and find that these signals remain challenging to use effectively for retrieval. We propose a collaborative experts model to aggregate information from these different pre-trained experts and assess our approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet. Code and data can be found at this http URL. This paper contains a correction to results reported in the previous version.

...read moreread less

198 citations

Proceedings Article•DOI•

The VIA Annotation Software for Images, Audio and Video

[...]

Abhishek Dutta¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

24 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: The VGG Image Annotator (VIA) as discussed by the authors is a simple and standalone manual annotation tool for images, audio and video that allows human annotators to define and describe spatial regions in images or video frames and temporal segments in audio or video.

...read moreread less

Abstract: In this paper, we introduce a simple and standalone manual annotation tool for images, audio and video: the VGG Image Annotator (VIA). This is a light weight, standalone and offline software package that does not require any installation or setup and runs solely in a web browser. The VIA software allows human annotators to define and describe spatial regions in images or video frames, and temporal segments in audio or video. These manual annotations can be exported to plain text data formats such as JSON and CSV and therefore are amenable to further processing by other software tools. VIA also supports collaborative annotation of a large dataset by a group of human annotators. The BSD open source license of this software allows it to be used in any academic project or commercial application.

...read moreread less

188 citations

Proceedings Article•DOI•

Exploiting Temporal Context for 3D Human Pose Estimation in the Wild

[...]

Anurag Arnab¹, Carl Doersch, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

15 Jun 2019

TL;DR: A bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos and shows that retraining a single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on the 3DPW and HumanEVA datasets.

...read moreread less

Abstract: We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos. Unlike previous algorithms which operate on single frames, we show that reconstructing a person over an entire sequence gives extra constraints that can resolve ambiguities. This is because videos often give multiple views of a person, yet the overall body shape does not change and 3D positions vary slowly. Our method improves not only on standard mocap-based datasets like Human 3.6M -- where we show quantitative improvements -- but also on challenging in-the-wild datasets such as Kinetics. Building upon our algorithm, we present a new dataset of more than 3 million frames of YouTube videos from Kinetics with automatically generated 3D poses and meshes. We show that retraining a single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on the 3DPW and HumanEVA datasets.

...read moreread less

180 citations

Proceedings Article•DOI•

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

[...]

Evangelos Kazakos¹, Arsha Nagrani², Andrew Zisserman², Dima Damen¹•Institutions (2)

University of Bristol¹, University of Oxford²

02 Nov 2019

TL;DR: This work proposes a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets, and demonstrates the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects.

...read moreread less

Abstract: We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities -- RGB, Flow and Audio -- and combine them with mid-level fusion alongside sparse temporal sampling of fused representations. In contrast with previous works, modalities are fused before temporal aggregation, with shared modality fusion weights over time. Our proposed architecture is trained end-to-end, outperforming individual modalities as well as late-fusion of modalities. We demonstrate the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects. Our method achieves state of the art results on both the seen and unseen test sets of the largest egocentric dataset: EPIC-Kitchens, on all metrics using the public leaderboard.

...read moreread less

Proceedings Article•DOI•

Learning to Discover Novel Visual Categories via Deep Transfer Clustering

[...]

Kai Han¹, Andrea Vedaldi¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Oct 2019

TL;DR: In this article, the authors extend Deep Embedded Clustering to a transfer learning setting and propose a method to estimate the number of classes in the unlabeled data, using knowledge from the known classes.

...read moreread less

Abstract: We consider the problem of discovering novel object categories in an image collection. While these images are unlabelled, we also assume prior knowledge of related but different image classes. We use such prior knowledge to reduce the ambiguity of clustering, and improve the quality of the newly discovered classes. Our contributions are twofold. The first contribution is to extend Deep Embedded Clustering to a transfer learning setting; we also improve the algorithm by introducing a representation bottleneck, temporal ensembling, and consistency. The second contribution is a method to estimate the number of classes in the unlabelled data. This also transfers knowledge from the known classes, using them as probes to diagnose different choices for the number of classes in the unlabelled subset. We thoroughly evaluate our method, substantially outperforming state-of-the-art techniques in a large number of benchmarks, including ImageNet, OmniGlot, CIFAR-100, CIFAR-10, and SVHN.

...read moreread less

Posted Content•

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

[...]

Evangelos Kazakos¹, Arsha Nagrani², Andrew Zisserman², Dima Damen¹•Institutions (2)

University of Bristol¹, University of Oxford²

22 Aug 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors focus on multi-modal fusion for egocentric action recognition, i.e. the combination of modalities within a range of temporal offsets, and train the architecture with RGB, Flow and Audio, and combine them with mid-level fusion alongside sparse temporal sampling of fused representations.

...read moreread less

Abstract: We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities -- RGB, Flow and Audio -- and combine them with mid-level fusion alongside sparse temporal sampling of fused representations. In contrast with previous works, modalities are fused before temporal aggregation, with shared modality and fusion weights over time. Our proposed architecture is trained end-to-end, outperforming individual modalities as well as late-fusion of modalities. We demonstrate the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects. Our method achieves state of the art results on both the seen and unseen test sets of the largest egocentric dataset: EPIC-Kitchens, on all metrics using the public leaderboard.

...read moreread less

Proceedings Article•DOI•

Temporal Cycle-Consistency Learning

[...]

Debidatta Dwibedi¹, Yusuf Aytar, Jonathan Tompson¹, Pierre Sermanet¹, Andrew Zisserman² - Show less +1 more•Institutions (2)

Google¹, University of Oxford²

15 Jun 2019

TL;DR: It is shown that the learned embeddings enable few-shot classification of these action phases, significantly reducing the supervised training requirements; and TCC is complementary to other methods of self-supervised learning in videos, such as Shuffle and Learn and Time-Contrastive Networks.

...read moreread less

Abstract: We introduce a self-supervised representation learning method based on the task of temporal alignment between videos. The method trains a network using temporal cycle-consistency (TCC), a differentiable cycle-consistency loss that can be used to find correspondences across time in multiple videos. The resulting per-frame embeddings can be used to align videos by simply matching frames using nearest-neighbors in the learned embedding space. To evaluate the power of the embeddings, we densely label the Pouring and Penn Action video datasets for action phases. We show that (i) the learned embeddings enable few-shot classification of these action phases, significantly reducing the supervised training requirements; and (ii) TCC is complementary to other methods of self-supervised learning in videos, such as Shuffle and Learn and Time-Contrastive Networks. The embeddings are also used for a number of applications based on alignment (dense temporal correspondence) between video pairs, including transfer of metadata of synchronized modalities between videos (sounds, temporal semantic labels), synchronized playback of multiple videos, and anomaly detection. Project webpage: https://sites.google.com/view/temporal-cycle-consistency .

...read moreread less

Journal Article•DOI•

You Said That?: Synthesising Talking Faces from Audio

[...]

Amir Jamaludin¹, Joon Son Chung¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Dec 2019-International Journal of Computer Vision

TL;DR: An encoder–decoder convolutional neural network model is developed that uses a joint embedding of the face and audio to generate synthesised talking face video frames and proposed methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.

...read moreread less

Abstract: We describe a method for generating a video of a talking face. The method takes still images of the target face and an audio speech segment as inputs, and generates a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we develop an encoder–decoder convolutional neural network (CNN) model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on unlabelled videos using cross-modal self-supervision. We also propose methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.

...read moreread less

Journal Article•DOI•

Chimpanzee face recognition from videos in the wild using deep learning

[...]

Daniel Schofield¹, Arsha Nagrani¹, Andrew Zisserman¹, Misato Hayashi², Tetsuro Matsuzawa², Dora Biro¹, Susana Carvalho - Show less +3 more•Institutions (2)

University of Oxford¹, Primate Research Institute²

01 Sep 2019-Science Advances

TL;DR: A deep convolutional neural network approach is presented that provides a fully automated pipeline for face detection, tracking, and recognition of wild chimpanzees from long-term video records, and generates co-occurrence matrices to trace changes in the social network structure of an aging population.

...read moreread less

Abstract: Video recording is now ubiquitous in the study of animal behavior, but its analysis on a large scale is prohibited by the time and resources needed to manually process large volumes of data. We present a deep convolutional neural network (CNN) approach that provides a fully automated pipeline for face detection, tracking, and recognition of wild chimpanzees from long-term video records. In a 14-year dataset yielding 10 million face images from 23 individuals over 50 hours of footage, we obtained an overall accuracy of 92.5% for identity recognition and 96.2% for sex recognition. Using the identified faces, we generated co-occurrence matrices to trace changes in the social network structure of an aging population. The tools we developed enable easy processing and annotation of video datasets, including those from other species. Such automated analysis unveils the future potential of large-scale longitudinal video archives to address fundamental questions in behavior and conservation.

...read moreread less

Proceedings Article•

Sim2real transfer learning for 3D human pose estimation: motion to the rescue

[...]

Carl Doersch¹, Andrew Zisserman²•Institutions (2)

Google¹, University of Oxford²

04 Jul 2019

TL;DR: This paper shows that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person’s motion, notably as optical flow and the motion of 2D keypoints.

...read moreread less

Abstract: Synthetic visual data can provide practicically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human pose estimation is a particularly interesting example of this sim2real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person’s motion, notably as optical flow and the motion of 2D keypoints. Therefore, our results suggest that motion can be a simple way to bridge a sim2real gap when video is available. We evaluate on the 3D Poses in the Wild dataset, the most challenging modern benchmark for 3D pose estimation, where we show full 3D mesh recovery that is on par with state-of-the-art methods trained on real 3D sequences, despite training only on synthetic humans from the SURREAL dataset.

...read moreread less

Posted Content•

Exploiting temporal context for 3D human pose estimation in the wild

[...]

Anurag Arnab¹, Carl Doersch², Andrew Zisserman²•Institutions (2)

University of Oxford¹, Google²

10 May 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos is presented, where reconstructing a person over an entire sequence gives extra constraints that can resolve ambiguities.

...read moreread less

Abstract: We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos. Unlike previous algorithms which operate on single frames, we show that reconstructing a person over an entire sequence gives extra constraints that can resolve ambiguities. This is because videos often give multiple views of a person, yet the overall body shape does not change and 3D positions vary slowly. Our method improves not only on standard mocap-based datasets like Human 3.6M -- where we show quantitative improvements -- but also on challenging in-the-wild datasets such as Kinetics. Building upon our algorithm, we present a new dataset of more than 3 million frames of YouTube videos from Kinetics with automatically generated 3D poses and meshes. We show that retraining a single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on the 3DPW and HumanEVA datasets.

...read moreread less

Journal Article•DOI•

Non-contact physiological monitoring of preterm infants in the Neonatal Intensive Care Unit.

[...]

Mauricio Villarroel¹, Sitthichok Chaichulee¹, João Jorge¹, Sara Davis², Gabrielle Green², Carlos Arteta¹, Andrew Zisserman¹, Kenny McCormick², Peter J. Watkinson¹, Lionel Tarassenko¹ - Show less +6 more•Institutions (2)

University of Oxford¹, John Radcliffe Hospital²

12 Dec 2019

TL;DR: A clinical study to evaluate the accuracy and the proportion of time that heart rate and respiratory rate can be estimated from preterm infants using only a video camera in a clinical environment, without interfering with regular patient care and proposes signal quality assessment algorithms to discriminate between clinically acceptable and noisy signals.

...read moreread less

Abstract: The implementation of video-based non-contact technologies to monitor the vital signs of preterm infants in the hospital presents several challenges, such as the detection of the presence or the absence of a patient in the video frame, robustness to changes in lighting conditions, automated identification of suitable time periods and regions of interest from which vital signs can be estimated. We carried out a clinical study to evaluate the accuracy and the proportion of time that heart rate and respiratory rate can be estimated from preterm infants using only a video camera in a clinical environment, without interfering with regular patient care. A total of 426.6 h of video and reference vital signs were recorded for 90 sessions from 30 preterm infants in the Neonatal Intensive Care Unit (NICU) of the John Radcliffe Hospital in Oxford. Each preterm infant was recorded under regular ambient light during daytime for up to four consecutive days. We developed multi-task deep learning algorithms to automatically segment skin areas and to estimate vital signs only when the infant was present in the field of view of the video camera and no clinical interventions were undertaken. We propose signal quality assessment algorithms for both heart rate and respiratory rate to discriminate between clinically acceptable and noisy signals. The mean absolute error between the reference and camera-derived heart rates was 2.3 beats/min for over 76% of the time for which the reference and camera data were valid. The mean absolute error between the reference and camera-derived respiratory rate was 3.5 breaths/min for over 82% of the time. Accurate estimates of heart rate and respiratory rate could be derived for at least 90% of the time, if gaps of up to 30 seconds with no estimates were allowed.

...read moreread less

Proceedings Article•

Unsupervised Learning of Object Keypoints for Perception and Control

[...]

Tejas Kulkarni¹, Ankush Gupta¹, Catalin Ionescu¹, Sebastian Borgeaud¹, Malcolm Reynolds¹, Andrew Zisserman¹, Volodymyr Mnih¹ - Show less +3 more•Institutions (1)

Google¹

19 Jun 2019

TL;DR: Transporter as discussed by the authors is a neural network architecture for discovering geometric object representations in terms of keypoints or image-space coordinates, which can track objects and object parts across long time-horizons.

...read moreread less

Abstract: The study of object representations in computer vision has primarily focused on developing representations that are useful for image classification, object detection, or semantic segmentation as downstream tasks. In this work we aim to learn object representations that are useful for control and reinforcement learning (RL). To this end, we introduce Transporter, a neural network architecture for discovering concise geometric object representations in terms of keypoints or image-space coordinates. Our method learns from raw video frames in a fully unsupervised manner, by transporting learnt image features between video frames using a keypoint bottleneck. The discovered keypoints track objects and object parts across long time-horizons more accurately than recent similar methods. Furthermore, consistent long-term tracking enables two notable results in control domains -- (1) using the keypoint co-ordinates and corresponding image features as inputs enables highly sample-efficient reinforcement learning; (2) learning to explore by controlling keypoint locations drastically reduces the search space, enabling deep exploration (leading to states unreachable through random action exploration) without any extrinsic rewards.

...read moreread less

Posted Content•

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

[...]

Yang Liu¹, Samuel Albanie¹, Arsha Nagrani¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

31 Jul 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.

...read moreread less

Abstract: The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human-generated queries for video datasets `in the wild' vary a lot in terms of degree of specificity, with some queries describing specific details such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pre-trained semantic embeddings which include 'general' features such as motion, appearance, and scene features from visual content. We also explore the use of more 'specific' cues from ASR and OCR which are intermittently available for videos and find that these signals remain challenging to use effectively for retrieval. We propose a collaborative experts model to aggregate information from these different pre-trained experts and assess our approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet. Code and data can be found at this http URL. This paper contains a correction to results reported in the previous version.

...read moreread less

Posted Content•

Unsupervised Learning of Object Keypoints for Perception and Control

[...]

Tejas Kulkarni¹, Ankush Gupta¹, Catalin Ionescu¹, Sebastian Borgeaud¹, Malcolm Reynolds¹, Andrew Zisserman¹, Volodymyr Mnih¹ - Show less +3 more•Institutions (1)

Google¹

19 Jun 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Transporter is introduced, a neural network architecture for discovering concise geometric object representations in terms of keypoints or image-space coordinates that helps track objects and object parts across long time-horizons more accurately than recent similar methods.

...read moreread less

Abstract: The study of object representations in computer vision has primarily focused on developing representations that are useful for image classification, object detection, or semantic segmentation as downstream tasks. In this work we aim to learn object representations that are useful for control and reinforcement learning (RL). To this end, we introduce Transporter, a neural network architecture for discovering concise geometric object representations in terms of keypoints or image-space coordinates. Our method learns from raw video frames in a fully unsupervised manner, by transporting learnt image features between video frames using a keypoint bottleneck. The discovered keypoints track objects and object parts across long time-horizons more accurately than recent similar methods. Furthermore, consistent long-term tracking enables two notable results in control domains -- (1) using the keypoint co-ordinates and corresponding image features as inputs enables highly sample-efficient reinforcement learning; (2) learning to explore by controlling keypoint locations drastically reduces the search space, enabling deep exploration (leading to states unreachable through random action exploration) without any extrinsic rewards.

...read moreread less

Posted Content•

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

[...]

Antoine Miech¹, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev¹, Josef Sivic¹, Andrew Zisserman - Show less +2 more•Institutions (1)

French Institute for Research in Computer Science and Automation¹

13 Dec 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a self-supervised learning approach, MIL-NCE, is proposed to address misalignments inherent in narrated videos without the need for any manual annotation.

...read moreread less

Abstract: Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

...read moreread less

Posted Content•

A Hierarchical Probabilistic U-Net for Modeling Multi-Scale Ambiguities.

[...]

Simon A. A. Kohl¹, Bernardino Romera-Paredes¹, Klaus H. Maier-Hein¹, Danilo Jimenez Rezende¹, S. M. Ali Eslami¹, Pushmeet Kohli¹, Andrew Zisserman¹, Olaf Ronneberger² - Show less +4 more•Institutions (2)

Google¹, German Cancer Research Center²

30 May 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Hierarchical Probabilistic U-Net is proposed, a segmentation network with a conditional variational auto-encoder (cVAE) that uses a hierarchical latent space decomposition that automatically separates independent factors across scales, an inductive bias that is deemed beneficial in structured output prediction tasks beyond segmentation.

...read moreread less

Abstract: Medical imaging only indirectly measures the molecular identity of the tissue within each voxel, which often produces only ambiguous image evidence for target measures of interest, like semantic segmentation. This diversity and the variations of plausible interpretations are often specific to given image regions and may thus manifest on various scales, spanning all the way from the pixel to the image level. In order to learn a flexible distribution that can account for multiple scales of variations, we propose the Hierarchical Probabilistic U-Net, a segmentation network with a conditional variational auto-encoder (cVAE) that uses a hierarchical latent space decomposition. We show that this model formulation enables sampling and reconstruction of segmenations with high fidelity, i.e. with finely resolved detail, while providing the flexibility to learn complex structured distributions across scales. We demonstrate these abilities on the task of segmenting ambiguous medical scans as well as on instance segmentation of neurobiological and natural images. Our model automatically separates independent factors across scales, an inductive bias that we deem beneficial in structured output prediction tasks beyond segmentation.

...read moreread less

Posted Content•

ASR is all you need: cross-modal distillation for lip reading

[...]

Triantafyllos Afouras¹, Joon Son Chung¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

28 Nov 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is shown that ground truth transcriptions are not necessary to train a lip reading system and how arbitrary amounts of unlabelled video data can be leveraged to improve performance.

...read moreread less

Abstract: The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a large-scale audio-only corpus. We use a cross-modal distillation method that combines Connectionist Temporal Classification (CTC) with a frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that ground truth transcriptions are not necessary to train a lip reading system; (ii) we show how arbitrary amounts of unlabelled video data can be leveraged to improve performance; (iii) we demonstrate that distillation significantly speeds up training; and, (iv) we obtain state-of-the-art results on the challenging LRS2 and LRS3 datasets for training only on publicly available data.

...read moreread less

Proceedings Article•DOI•

My lips are concealed: Audio-visual speech enhancement through obstructions

[...]

Triantafyllos Afouras¹, Joon Son Chung¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

11 Jul 2019

TL;DR: In this article, a deep audio-visual speech enhancement network is proposed to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice.

...read moreread less

Abstract: Our objective is an audio-visual model for separating a single speaker from a mixture of sounds such as other speakers and background noise. Moreover, we wish to hear the speaker even when the visual cues are temporarily absent due to occlusion. To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice. The voice representation can be obtained by either (i) enrollment, or (ii) by self-enrollment -- learning the representation on-the-fly given sufficient unobstructed visual input. The model is trained by blending audios, and by introducing artificial occlusions around the mouth region that prevent the visual modality from dominating. The method is speaker-independent, and we demonstrate it on real examples of speakers unheard (and unseen) during training. The method also improves over previous models in particular for cases of occlusion in the visual modality.

...read moreread less

Posted Content•

Temporal Cycle-Consistency Learning

[...]

Debidatta Dwibedi¹, Yusuf Aytar, Jonathan Tompson¹, Pierre Sermanet¹, Andrew Zisserman² - Show less +1 more•Institutions (2)

Google¹, University of Oxford²

16 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a self-supervised representation learning method based on the task of temporal alignment between videos is introduced, which can be used to align videos by simply matching frames using the nearest neighbors in the learned embedding space.

...read moreread less

Abstract: We introduce a self-supervised representation learning method based on the task of temporal alignment between videos. The method trains a network using temporal cycle consistency (TCC), a differentiable cycle-consistency loss that can be used to find correspondences across time in multiple videos. The resulting per-frame embeddings can be used to align videos by simply matching frames using the nearest-neighbors in the learned embedding space. To evaluate the power of the embeddings, we densely label the Pouring and Penn Action video datasets for action phases. We show that (i) the learned embeddings enable few-shot classification of these action phases, significantly reducing the supervised training requirements; and (ii) TCC is complementary to other methods of self-supervised learning in videos, such as Shuffle and Learn and Time-Contrastive Networks. The embeddings are also used for a number of applications based on alignment (dense temporal correspondence) between video pairs, including transfer of metadata of synchronized modalities between videos (sounds, temporal semantic labels), synchronized playback of multiple videos, and anomaly detection. Project webpage: this https URL .

...read moreread less

Posted Content•

My lips are concealed: Audio-visual speech enhancement through obstructions

[...]

Triantafyllos Afouras¹, Joon Son Chung¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

11 Jul 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice by learning the representation on-the-fly given sufficient unobstructed visual input.

...read moreread less

Abstract: Our objective is an audio-visual model for separating a single speaker from a mixture of sounds such as other speakers and background noise. Moreover, we wish to hear the speaker even when the visual cues are temporarily absent due to occlusion. To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice. The voice representation can be obtained by either (i) enrollment, or (ii) by self-enrollment -- learning the representation on-the-fly given sufficient unobstructed visual input. The model is trained by blending audios, and by introducing artificial occlusions around the mouth region that prevent the visual modality from dominating. The method is speaker-independent, and we demonstrate it on real examples of speakers unheard (and unseen) during training. The method also improves over previous models in particular for cases of occlusion in the visual modality.

...read moreread less

Journal Article•DOI•

Synthetic Humans for Action Recognition from Unseen Viewpoints

[...]

Gül Varol¹, Ivan Laptev², Cordelia Schmid², Andrew Zisserman³•Institutions (3)

École des ponts ParisTech¹, French Institute for Research in Computer Science and Automation², University of Oxford³

09 Dec 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work makes use of the recent advances in monocular 3D human body reconstruction from real action sequences to automatically render synthetic training videos for the action labels, and introduces a new data generation methodology that allows training of spatio-temporal CNNs for action classification.

...read moreread less

Abstract: Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored. Our goal in this work is to answer the question whether synthetic humans can improve the performance of human action recognition, with a particular focus on generalization to unseen viewpoints. We make use of the recent advances in monocular 3D human body reconstruction from real action sequences to automatically render synthetic training videos for the action labels. We make the following contributions: (i) we investigate the extent of variations and augmentations that are beneficial to improving performance at new viewpoints. We consider changes in body shape and clothing for individuals, as well as more action relevant augmentations such as non-uniform frame sampling, and interpolating between the motion of individuals performing the same action; (ii) We introduce a new data generation methodology, SURREACT, that allows training of spatio-temporal CNNs for action classification; (iii) We substantially improve the state-of-the-art action recognition performance on the NTU RGB+D and UESTC standard human action multi-view benchmarks; Finally, (iv) we extend the augmentation approach to in-the-wild videos from a subset of the Kinetics dataset to investigate the case when only one-shot training data is available, and demonstrate improvements in this case as well.

...read moreread less

Proceedings Article•DOI•

LAEO-Net: Revisiting People Looking at Each Other in Videos

[...]

Manuel J. Marín-Jiménez¹, Vicky Kalogeiton², Pablo Medina-Suarez¹, Andrew Zisserman²•Institutions (2)

University of Córdoba (Spain)¹, University of Oxford²

01 Jun 2019

TL;DR: Li et al. as mentioned in this paper proposed a new deep CNN for determining people Looking At Each Other (LAEO) in videos, which takes spatio-temporal tracks as input and reasons about the whole track.

...read moreread less

Abstract: Capturing the ‘mutual gaze’ of people is essential for understanding and interpreting the social interactions between them. To this end, this paper addresses the problem of detecting people Looking At Each Other (LAEO) in video sequences. For this purpose, we propose LAEO-Net, a new deep CNN for determining LAEO in videos. In contrast to previous works, LAEO-Net takes spatio-temporal tracks as input and reasons about the whole track. It consists of three branches, one for each character’s tracked head and one for their relative position. Moreover, we introduce two new LAEO datasets: UCO-LAEO and AVA-LAEO. A thorough experimental evaluation demonstrates the ability of LAEO-Net to successfully determine if two people are LAEO and the temporal window where it happens. Our model achieves state-of-the-art results on the existing TVHID-LAEO video dataset, significantly outperforming previous approaches.

...read moreread less

Showing papers by "Andrew Zisserman published in 2019"