Showing papers by "Andrew Zisserman published in 2021"

PDF

Open Access

Proceedings Article•

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

[...]

Max Bain¹, Arsha Nagrani², Gül Varol¹, Andrew Zisserman¹•Institutions (2)

01 Jan 2021

TL;DR: In this article, the authors propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets, and also provide a new video-text pretraining dataset WebVid-2M.

...read moreread less

Abstract: Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval. The challenges in this area include the design of the visual architecture and the nature of the training data, in that the available large scale video-text training datasets, such as HowTo100M, are noisy and hence competitive performance is achieved only at scale through large amounts of compute. We address both these challenges in this paper. We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is an adaptation and extension of the recent ViT and Timesformer architectures, and consists of attention in both space and time. The model is flexible and can be trained on both image and video text datasets, either independently or in conjunction. It is trained with a curriculum learning schedule that begins by treating images as 'frozen' snapshots of video, and then gradually learns to attend to increasing temporal context when trained on video datasets. We also provide a new video-text pretraining dataset WebVid-2M, comprised of over two million videos with weak captions scraped from the internet. Despite training on datasets that are an order of magnitude smaller, we show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

...read moreread less

99 citations

Proceedings Article•DOI•

Localizing Visual Sounds the Hard Way

[...]

Honglie Chen¹, Weidi Xie¹, Triantafyllos Afouras¹, Arsha Nagrani¹, Andrea Vedaldi¹, Andrew Zisserman¹ - Show less +2 more•Institutions (1)

University of Oxford¹

06 Apr 2021

TL;DR: In this article, the authors propose a method to mine hard samples and add them to a contrastive learning formulation automatically, achieving state-of-the-art performance on the VGG-Sound Source (VGG-SS) dataset.

...read moreread less

Abstract: The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous existing ones, contains 5K videos spanning over 200 categories, and, differently from Flickr SoundNet, is video-based. On VGG-SS, we also show that our algorithm achieves state-of-the-art performance against several baselines. Code and datasets can be found at http://www.robots.ox.ac.uk/˜vgg/research/lvs/.

...read moreread less

98 citations

Proceedings Article•DOI•

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

[...]

Antoine Miech¹, Jean-Baptiste Alayrac¹, Ivan Laptev², Josef Sivic³, Andrew Zisserman¹ - Show less +1 more•Institutions (3)

École Normale Supérieure¹, French Institute for Research in Computer Science and Automation², Czech Technical University in Prague³

01 Jun 2021

TL;DR: In this paper, a fine-grained cross-attention architecture was proposed to improve the performance of the transformer-based model. But, this approach is often inapplicable in practice for large-scale retrieval given the cost of the crossattention mechanisms required for each sample at test time.

...read moreread less

Abstract: Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale retrieval given the cost of the cross-attention mechanisms required for each sample at test time. This work combines the best of both worlds. We make the following three contributions. First, we equip transformer-based models with a new fine-grained cross-attention architecture, providing significant improvements in retrieval accuracy whilst preserving scalability. Second, we introduce a generic approach for combining a Fast dual encoder model with our Slow but accurate transformer-based model via distillation and reranking. Finally, we validate our approach on the Flickr30K image dataset where we show an increase in inference speed by several orders of magnitude while having results competitive to the state of the art. We also extend our method to the video domain, improving the state of the art on the VATEX dataset.

...read moreread less

92 citations

Posted Content•

With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations

[...]

Debidatta Dwibedi¹, Yusuf Aytar¹, Jonathan Tompson¹, Pierre Sermanet¹, Andrew Zisserman¹ - Show less +1 more•Institutions (1)

Google¹

29 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Nearest-Neighbor Contrastive Learning of visual representations (NNCLR) as mentioned in this paper samples the nearest neighbors from the dataset in the latent space, and treats them as positives, which provides more semantic variations than pre-defined transformations.

...read moreread less

Abstract: Self-supervised learning algorithms based on instance discrimination train encoders to be invariant to pre-defined transformations of the same instance. While most methods treat different views of the same image as positives for a contrastive loss, we are interested in using positives from other instances in the dataset. Our method, Nearest-Neighbor Contrastive Learning of visual Representations (NNCLR), samples the nearest neighbors from the dataset in the latent space, and treats them as positives. This provides more semantic variations than pre-defined transformations. We find that using the nearest-neighbor as positive in contrastive losses improves performance significantly on ImageNet classification, from 71.7% to 75.6%, outperforming previous state-of-the-art methods. On semi-supervised learning benchmarks we improve performance significantly when only 1% ImageNet labels are available, from 53.8% to 56.5%. On transfer learning benchmarks our method outperforms state-of-the-art methods (including supervised learning with ImageNet) on 8 out of 12 downstream datasets. Furthermore, we demonstrate empirically that our method is less reliant on complex data augmentations. We see a relative reduction of only 2.1% ImageNet Top-1 accuracy when we train using only random crops.

...read moreread less

75 citations

Proceedings Article•DOI•

Temporal Query Networks for Fine-grained Video Understanding

[...]

Chuhan Zhang¹, Ankush Gupta, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

19 Apr 2021

TL;DR: Temporal Query Network (TQN) as discussed by the authors proposes a temporal attention mechanism to attend to relevant segments for each query with an attention mechanism, and can be trained using only the labels for each queried segment.

...read moreread less

Abstract: Our objective in this work is fine-grained classification of actions in untrimmed videos, where the actions may be temporally extended or may span only a few frames of the video. We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set.We make the following four contributions: (i) We propose a new model—a Temporal Query Network—which enables the query-response functionality, and a structural understanding of fine-grained actions. It attends to relevant segments for each query with a temporal attention mechanism, and can be trained using only the labels for each query. (ii) We propose a new way—stochastic feature bank update—to train a network on videos of various lengths with the dense sampling required to respond to fine-grained queries. (iii) we compare the TQN to other architectures and text supervision methods, and analyze their pros and cons. Finally, (iv) we evaluate the method extensively on the FineGym and Diving48 bench-marks for fine-grained action classification and surpass the state-of-the-art using only RGB features. Project page: https://www.robots.ox.ac.uk/~vgg/research/tqn/.

...read moreread less

64 citations

Journal Article•DOI•

AutoNovel: Automatically Discovering and Learning Novel Visual Categories.

[...]

Kai Han¹, Sylvestre-Alvise Rebuffi¹, Sebastien Ehrhardt¹, Andrea Vedaldi¹, Andrew Zisserman¹ - Show less +1 more•Institutions (1)

University of Oxford¹

24 Jun 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: In this paper, a self-supervised learning approach called AutoNovel is proposed to address the problem of discovering novel classes in an image collection given labelled examples of other classes.

...read moreread less

Abstract: We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. We present a new approach called AutoNovel to address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. Moreover, we propose a method to estimate the number of classes for the case where the number of new categories is not known a priori. We evaluate AutoNovel on standard classification benchmarks and substantially outperform current methods for novel category discovery. In addition, we also show that AutoNovel can be used for fully unsupervised image clustering, achieving promising results.

...read moreread less

51 citations

Proceedings Article•DOI•

Slow-Fast Auditory Streams for Audio Recognition

[...]

Evangelos Kazakos¹, Arsha Nagrani², Andrew Zisserman², Dima Damen¹•Institutions (2)

University of Bristol¹, University of Oxford²

06 Jun 2021

TL;DR: In this article, a two-stream convolutional network for audio recognition is proposed, which operates on time-frequency spectrogram inputs and achieves state-of-the-art results on both VGG-Sound and EPIC-KITCHENS-100 datasets.

...read moreread less

Abstract: We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve state- of-the-art results on both.

...read moreread less

48 citations

Posted Content•

Perceiver: General Perception with Iterative Attention

[...]

Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira - Show less +2 more

04 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Perceiver as mentioned in this paper is a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.

...read moreread less

Abstract: Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.

...read moreread less

47 citations

Proceedings Article•DOI•

Co-Attention for Conditioned Image Matching

[...]

Olivia Wiles¹, Sebastien Ehrhardt¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jun 2021

TL;DR: In this paper, a spatial attention mechanism (a co-attention module, CoAM) is proposed to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material.

...read moreread less

Abstract: We propose a new approach to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material. While other approaches find correspondences between pairs of images by treating the images independently, we instead condition on both images to implicitly take account of the differences between them. To achieve this, we introduce (i) a spatial attention mechanism (a co-attention module, CoAM) for conditioning the learned features on both images, and (ii) a distinctiveness score used to choose the best matches at test time. CoAM can be added to standard architectures and trained using self-supervision or supervised data, and achieves a significant performance improvement under hard conditions, e.g. large viewpoint changes. We demonstrate that models using CoAM achieve state of the art or competitive results on a wide range of tasks: local matching, camera localization, 3D reconstruction, and image stylization.

...read moreread less

36 citations

Journal Article•DOI•

Synthetic Humans for Action Recognition from Unseen Viewpoints

[...]

Gül Varol¹, Ivan Laptev², Cordelia Schmid², Andrew Zisserman³•Institutions (3)

École Normale Supérieure¹, French Institute for Research in Computer Science and Automation², University of Oxford³

05 Apr 2021-International Journal of Computer Vision

TL;DR: In this article, the authors make use of the recent advances in monocular 3D human body reconstruction from real action sequences to automatically render synthetic training videos for the action labels, and investigate the extent of variations and augmentations that are beneficial to improving performance at new viewpoints.

...read moreread less

Abstract: Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored. Our goal in this work is to answer the question whether synthetic humans can improve the performance of human action recognition, with a particular focus on generalization to unseen viewpoints. We make use of the recent advances in monocular 3D human body reconstruction from real action sequences to automatically render synthetic training videos for the action labels. We make the following contributions: (1) we investigate the extent of variations and augmentations that are beneficial to improving performance at new viewpoints. We consider changes in body shape and clothing for individuals, as well as more action relevant augmentations such as non-uniform frame sampling, and interpolating between the motion of individuals performing the same action; (2) We introduce a new data generation methodology, SURREACT, that allows training of spatio-temporal CNNs for action classification; (3) We substantially improve the state-of-the-art action recognition performance on the NTU RGB+D and UESTC standard human action multi-view benchmarks; Finally, (4) we extend the augmentation approach to in-the-wild videos from a subset of the Kinetics dataset to investigate the case when only one-shot training data is available, and demonstrate improvements in this case as well.

...read moreread less

35 citations

Posted Content•

TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval

[...]

Ioana Croitoru, Simion-Vlad Bogolin, Yang Liu, Samuel Albanie, Marius Leordeanu, Hailin Jin, Andrew Zisserman - Show less +3 more

16 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: TeachText as discussed by the authors leverages complementary cues from multiple text encoders to provide an enhanced supervisory signal to the retrieval model, which can effectively reduce the number of used modalities at test time without compromising performance.

...read moreread less

Abstract: In recent years, considerable progress on the task of text-video retrieval has been achieved by leveraging large-scale pretraining on visual and audio datasets to construct powerful video encoders. By contrast, despite the natural symmetry, the design of effective algorithms for exploiting large-scale language pretraining remains under-explored. In this work, we are the first to investigate the design of such algorithms and propose a novel generalized distillation method, TeachText, which leverages complementary cues from multiple text encoders to provide an enhanced supervisory signal to the retrieval model. Moreover, we extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time without compromising performance. Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time. Last but not least, we show an effective application of our method for eliminating noise from retrieval datasets. Code and data can be found at this https URL.

...read moreread less

Proceedings Article•DOI•

PASS: An ImageNet replacement for self-supervised pretraining without humans

[...]

Yuki M. Asano¹, Christian Rupprecht¹, Andrew Zisserman¹, Andrea Vedaldi²•Institutions (2)

University of Oxford¹, Facebook²

08 Jun 2021

TL;DR: This work proposes an unlabelled dataset PASS: Pictures without humAns for Self-Supervision, which shows that model pretraining is often possible while using safer data, and provides the basis for a more robust evaluation of pretraining methods.

...read moreread less

Abstract: Computer vision has long relied on ImageNet and other large datasets of images sampled from the Internet for pretraining models. However, these datasets have ethical and technical shortcomings, such as containing personal information taken without consent, unclear license usage, biases, and, in some cases, even problematic image content. On the other hand, state-of-the-art pretraining is nowadays obtained with unsupervised methods, meaning that labelled datasets such as ImageNet may not be necessary, or perhaps not even optimal, for model pretraining. We thus propose an unlabelled dataset PASS: Pictures without humAns for Self-Supervision. PASS only contains images with CC-BY license and complete attribution metadata, addressing the copyright issue. Most importantly, it contains no images of people at all, and also avoids other types of images that are problematic for data protection or ethics. We show that PASS can be used for pretraining with methods such as MoCo-v2, SwAV and DINO. In the transfer learning setting, it yields similar downstream performances to ImageNet pretraining even on tasks that involve humans, such as human pose estimation. PASS does not make existing datasets obsolete, as for instance it is insufficient for benchmarking. However, it shows that model pretraining is often possible while using safer data, and it also provides the basis for a more robust evaluation of pretraining methods.

...read moreread less

Proceedings Article•DOI•

Omnimatte: Associating Objects and Their Effects in Video

[...]

Erika Lu¹, Forrester Cole¹, Tali Dekel¹, Andrew Zisserman², William T. Freeman¹, Michael Rubinstein¹ - Show less +2 more•Institutions (2)

Google¹, University of Oxford²

14 May 2021

TL;DR: In this article, a self-supervised approach is proposed to estimate an alpha matte and color image for each subject in a video, including the subject along with all its related time-varying scene elements.

...read moreread less

Abstract: Computer vision is increasingly effective at segmenting objects in images and videos; however, scene effects related to the objects—shadows, reflections, generated smoke, etc.—are typically overlooked. Identifying such scene effects and associating them with the objects producing them is important for improving our fundamental understanding of visual scenes, and can also assist a variety of applications such as removing, duplicating, or enhancing objects in video. In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video. Given an ordinary video and a rough segmentation mask over time of one or more subjects of interest, we estimate an omnimatte for each subject—an alpha matte and color image that includes the subject along with all its related time-varying scene elements. Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic—it produces omnimattes automatically for arbitrary objects and a variety of effects. We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semitransparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject.1

...read moreread less

Proceedings Article•

Self-Supervised Video Object Segmentation by Motion Grouping

[...]

Charig Yang¹, Hala Lamdouar¹, Erika Lu¹, Andrew Zisserman¹, Weidi Xie¹ - Show less +1 more•Institutions (1)

University of Oxford¹

15 Apr 2021

TL;DR: In this article, a simple variant of the Transformer is introduced to segment optical flow frames into primary objects and the background, which achieves superior or comparable results to previous state-of-the-art self-supervised methods, while being an order of magnitude faster.

...read moreread less

Abstract: Animals have evolved highly functional visual systems to understand motion, assisting perception even under complex environments. In this paper, we work towards developing a computer vision system able to segment objects by exploiting motion cues, i.e. motion segmentation. We make the following contributions: First, we introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background. Second, we train the architecture in a self-supervised manner, i.e. without using any manual annotations. Third, we analyze several critical components of our method and conduct thorough ablation studies to validate their necessity. Fourth, we evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59). Despite using only optical flow as input, our approach achieves superior or comparable results to previous state-of-the-art self-supervised methods, while being an order of magnitude faster. We additionally evaluate on a challenging camouflage dataset (MoCA), significantly outperforming the other self-supervised approaches, and comparing favourably to the top supervised approach, highlighting the importance of motion cues, and the potential bias towards visual appearance in existing video segmentation models.

...read moreread less

Proceedings Article•DOI•

Read and Attend: Temporal Localisation in Sign Language Videos

[...]

Gül Varol¹, Liliane Momeni¹, Samuel Albanie¹, Triantafyllos Afouras¹, Andrew Zisserman¹ - Show less +1 more•Institutions (1)

University of Oxford¹

01 Jun 2021

TL;DR: In this article, a Transformer model is trained to ingest a continuous signing stream and output a sequence of written tokens on a large-scale collection of signing footage with weakly-aligned subtitles.

...read moreread less

Abstract: The objective of this work is to annotate sign instances across a broad vocabulary in continuous sign language. We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens on a large-scale collection of signing footage with weakly-aligned subtitles. We show that through this training it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation. Our contributions are as follows: (1) we demonstrate the ability to leverage large quantities of continuous signing videos with weakly-aligned subtitles to localise signs in continuous sign language; (2) we employ the learned attention to automatically generate hundreds of thousands of annotations for a large sign vocabulary; (3) we collect a set of 37K manually verified sign instances across a vocabulary of 950 sign classes to support our study of sign language recognition; (4) by training on the newly annotated data from our method, we outperform the prior state of the art on the BSL-1K sign language recognition benchmark.

...read moreread less

Journal Article•DOI•

Automated audiovisual behavior recognition in wild primates.

[...]

Max Bain¹, Arsha Nagrani¹, Daniel Schofield¹, Sophie Berdugo¹, Joana Bessa¹, Jake Owen¹, Kimberley J. Hockings², Tetsuro Matsuzawa³, Misato Hayashi, Dora Biro⁴, Dora Biro¹, Susana Carvalho, Andrew Zisserman¹ - Show less +9 more•Institutions (4)

University of Oxford¹, University of Exeter², California Institute of Technology³, University of Rochester⁴

12 Nov 2021-Science Advances

TL;DR: In this paper, large video datasets of wild animal behavior are crucial to produce longitudinal research and accelerate conservation efforts; however, large-scale behavior analyses continue to be severely constrainable.

...read moreread less

Abstract: Large video datasets of wild animal behavior are crucial to produce longitudinal research and accelerate conservation efforts; however, large-scale behavior analyses continue to be severely constra...

...read moreread less

Proceedings Article•DOI•

Face, Body, Voice: Video Person-Clustering with Multiple Modalities

[...]

Andrew Brown¹, Vicky Kalogeiton², Andrew Zisserman¹•Institutions (2)

University of Oxford¹, École Polytechnique²

20 May 2021

TL;DR: In this paper, a Multi-Modal High-Precision Clustering algorithm for person-clustering in videos using cues from several modalities (face, body, and voice) is proposed.

...read moreread less

Abstract: The objective of this work is person-clustering in videos-grouping characters according to their identity. Previous methods focus on the narrower task of face-clustering, and for the most part ignore other cues such as the person's voice, their overall appearance (hair, clothes, posture), and the editing structure of the videos. Similarly, most current datasets evaluate only the task of face-clustering, rather than person-clustering. This limits their applicability to downstream applications such as story understanding which require person-level, rather than only face-level, reasoning. In this paper we make contributions to address both these deficiencies: first, we introduce a Multi-Modal High-Precision Clustering algorithm for person-clustering in videos using cues from several modalities (face, body, and voice). Second, we introduce a Video Person-Clustering dataset, for evaluating multi-modal person-clustering. It contains body-tracks for each annotated character, facetracks when visible, and voice-tracks when speaking, with their associated features. The dataset is by far the largest of its kind, and covers films and TV-shows representing a wide range of demographics. Finally, we show the e ectiveness of using multiple modalities for person-clustering, explore the use of this new broad task for story understanding through character co-occurrences, and achieve a new state of the art on all available datasets for face and person-clustering.

...read moreread less

Proceedings Article•DOI•

QUERYD: A Video Dataset with High-Quality Text and Audio Narrations

[...]

Andreea-Maria Oncescu¹, João F. Henriques¹, Yang Liu¹, Andrew Zisserman¹, Samuel Albanie¹ - Show less +1 more•Institutions (1)

University of Oxford¹

06 Jun 2021

TL;DR: The QuerYD dataset as mentioned in this paper is a large-scale dataset for retrieval and event localisation in videos, which contains highly detailed, temporally aligned audio and text annotations, which can be used to train and benchmark strong models for retrieval.

...read moreread less

Abstract: We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video. A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content. The dataset is based on YouDescribe [1], a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos. This ever-growing collection of videos contains highly detailed, temporally aligned audio and text annotations. The content descriptions are more relevant than dialogue, and more detailed than previous description attempts, which can be observed to contain many superficial or uninformative descriptions. To demonstrate the utility of the QuerYD dataset, we show that it can be used to train and benchmark strong models for retrieval and event localisation. Data, code and models are made publicly available, and we hope that QuerYD inspires further research on video understanding with written and spoken natural language.

...read moreread less

Posted Content•

Perceiver IO: A General Architecture for Structured Inputs & Outputs

[...]

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J. Hénaff, Matthew Botvinick, Andrew Zisserman, Oriol Vinyals, Joao Carreira - Show less +11 more

30 Jul 2021-arXiv: Learning

TL;DR: Perceiver IO as mentioned in this paper proposes to learn to flexibly query the model's latent space to produce outputs of arbitrary size and semantics, and achieves state-of-the-art results on tasks with highly structured output spaces.

...read moreread less

Abstract: The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves strong results on tasks with highly structured output spaces, such as natural language and visual understanding, StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation.

...read moreread less

Book Chapter•DOI•

Self-supervised Multi-modal Alignment for Whole Body Medical Imaging

[...]

Rhydian Windsor¹, Amir Jamaludin¹, Timor Kadir¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

27 Sep 2021

TL;DR: In this paper, a multi-modal image-matching contrastive framework is proposed to learn to match different-modality scans of the same subject with high accuracy, and the correspondences learned during this contrastive training step can be used to perform automatic cross-mode scan registration in a completely unsupervised manner.

...read moreread less

Abstract: This paper explores the use of self-supervised deep learning in medical imaging in cases where two scan modalities are available for the same subject. Specifically, we use a large publicly-available dataset of over 20,000 subjects from the UK Biobank with both whole body Dixon technique magnetic resonance (MR) scans and also dual-energy x-ray absorptiometry (DXA) scans. We make three contributions: (i) We introduce a multi-modal image-matching contrastive framework, that is able to learn to match different-modality scans of the same subject with high accuracy. (ii) Without any adaption, we show that the correspondences learnt during this contrastive training step can be used to perform automatic cross-modal scan registration in a completely unsupervised manner. (iii) Finally, we use these registrations to transfer segmentation maps from the DXA scans to the MR scans where they are used to train a network to segment anatomical regions without requiring ground-truth MR examples. To aid further research, our code is publicly available (https://github.com/rwindsor1/biobank-self-supervised-alignment).

...read moreread less

Journal Article•DOI•

LAEO-Net++: revisiting people Looking At Each Other in videos.

[...]

Manuel J. Marín-Jiménez¹, Vicky Kalogeiton², Pablo Medina-Suarez¹, Andrew Zisserman²•Institutions (2)

University of Córdoba (Spain)¹, University of Oxford²

04 Feb 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper proposes LAEO-Net, a new deep CNN for determining LAEO in videos, which achieves state-of-the-art results on the existing TVHID-LAEO video dataset, significantly outperforming previous approaches.

...read moreread less

Abstract: Capturing the ‘mutual gaze’ of people is essential for understanding and interpreting the social interactions between them. To this end, this paper addresses the problem of detecting people Looking At Each Other (LAEO) in video sequences. For this purpose, we propose LAEO-Net++, a new deep CNN for determining LAEO in videos. In contrast to previous works, LAEO-Net++ takes spatio-temporal tracks as input and reasons about the whole track. It consists of three branches, one for each character's tracked head and one for their relative position. Moreover, we introduce two new LAEO datasets: UCO-LAEO and AVA-LAEO. A thorough experimental evaluation demonstrates the ability of LAEO-Net++ to successfully determine if two people are LAEO and the temporal window where it happens. Our model achieves state-of-the-art results on the existing TVHID-LAEO video dataset, significantly outperforming previous approaches. Finally, we apply LAEO-Net++ to a social network, where we automatically infer the social relationship between pairs of people based on the frequency and duration that they LAEO, and show that LAEO can be a useful tool for guided search of human interactions in videos.

...read moreread less

Posted Content•

Broaden Your Views for Self-Supervised Video Learning

[...]

Adrià Recasens¹, Pauline Luc², Jean-Baptiste Alayrac³, Luyu Wang¹, Florian Strub⁴, Corentin Tallec⁵, Mateusz Malinowski¹, Viorica Patraucean⁶, Florent Altché⁷, Michal Valko⁸, Jean-Bastien Grill¹, Aaron van den Oord¹, Andrew Zisserman⁹ - Show less +9 more•Institutions (9)

Google¹, Facebook², École Normale Supérieure³, university of lille⁴, University of Paris-Sud⁵, University of Cambridge⁶, PSL Research University⁷, French Institute for Research in Computer Science and Automation⁸, University of Oxford⁹

30 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a self-supervised learning framework for video is proposed, where one view has access to a narrow temporal window of the video while the other view has a broad access to the video content.

...read moreread less

Abstract: Most successful self-supervised learning methods are trained to align the representations of two independent views from the data. State-of-the-art methods in video are inspired by image techniques, where these two views are similarly extracted by cropping and augmenting the resulting crop. However, these methods miss a crucial element in the video domain: time. We introduce BraVe, a self-supervised learning framework for video. In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content. Our models learn to generalise from the narrow view to the general content of the video. Furthermore, BraVe processes the views with different backbones, enabling the use of alternative augmentations or modalities into the broad view such as optical flow, randomly convolved RGB frames, audio or their combinations. We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks including UCF101, HMDB51, Kinetics, ESC-50 and AudioSet.

...read moreread less

Proceedings Article•DOI•

Playing a Part: Speaker Verification at the movies

[...]

Andrew Brown¹, Jaesung Huh¹, Arsha Nagrani¹, Joon Son Chung², Andrew Zisserman¹ - Show less +1 more•Institutions (2)

University of Oxford¹, Naver Corporation²

06 Jun 2021

TL;DR: The authors investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character, and demonstrate that both speaker verification and identification performance drops steeply on this new data, showing the challenge in transferring models across domains.

...read moreread less

Abstract: The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions: (i) We collect a novel, challenging speaker recognition dataset called VoxMovies, with speech for 856 identities from almost 4000 movie clips. VoxMovies contains utterances with varying emotion, accents and background noise, and therefore comprises an entirely different domain to the interview-style, emotionally calm utterances in current speaker recognition datasets such as VoxCeleb; (ii) We provide a number of domain adaptation evaluation sets, and benchmark the performance of state-of-the-art speaker recognition models on these evaluation pairs. We demonstrate that both speaker verification and identification performance drops steeply on this new data, showing the challenge in transferring models across domains; and finally (iii) We show that simple domain adaptation paradigms improve performance, but there is still large room for improvement.

...read moreread less

Proceedings Article•DOI•

Automated Video Labelling: Identifying Faces by Corroborative Evidence

[...]

Andrew Brown¹, Ernesto Coto¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

10 Feb 2021

TL;DR: In this article, a method for automatically labelling all faces in video archives, such as TV broadcasts, by combining multiple evidence sources and multiple modalities (visual and audio) is presented.

...read moreread less

Abstract: We present a method for automatically labelling all faces in video archives, such as TV broadcasts, by combining multiple evidence sources and multiple modalities (visual and audio). We target the problem of ever-growing online video archives, where an effective, scalable indexing solution cannot require a user to provide manual annotation or supervision. To this end, we make three key contributions: (1) We provide a novel, simple, method for determining if a person is famous or not using image-search engines. In turn this enables a face-identity model to be built reliably and robustly, and used for high precision automatic labelling; (2) We show that even for less-famous people, image-search engines can then be used for corroborative evidence to accurately label faces that are named in the scene or the speech; (3) Finally, we quantitatively demonstrate the benefits of our approach on different video domains and test settings, such as TV shows and news broadcasts. Our method works across three disparate datasets without any explicit domain adaptation, and sets new state-of-the-art results on all the public benchmarks.

...read moreread less

Proceedings Article•

Perceiver: General Perception with Iterative Attention

[...]

Andrew Jaegle, Felix Axel Gimeno Gil, Andrew Brock, Oriol Vinyals¹, Andrew Zisserman², Joao Carreira - Show less +2 more•Institutions (2)

Google¹, University of Oxford²

18 Jul 2021

Proceedings Article•DOI•

LSD-C: Linearly Separable Deep Clusters

[...]

Sylvestre-Alvise Rebuffi¹, Sebastien Ehrhardt¹, Kai Han¹, Andrea Vedaldi¹, Andrew Zisserman¹ - Show less +1 more•Institutions (1)

University of Oxford¹

01 Oct 2021

TL;DR: LSD-C as mentioned in this paper uses pairwise connections in the feature space between the samples of the minibatch based on a similarity metric and regroups in clusters the connected samples and enforces a linear separation between clusters.

...read moreread less

Abstract: We present LSD-C, a novel method to identify clusters in an unlabeled dataset. Our algorithm first establishes pairwise connections in the feature space between the samples of the minibatch based on a similarity metric. Then it regroups in clusters the connected samples and enforces a linear separation between clusters. This is achieved by using the pairwise connections as targets together with a binary cross-entropy loss on the predictions that the associated pairs of samples belong to the same cluster. This way, the feature representation of the network will evolve such that similar samples in this feature space will belong to the same linearly separated cluster. Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation. We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.

...read moreread less

Posted Content•

NeRF in detail: Learning to sample for view synthesis.

[...]

Relja Arandjelovic, Andrew Zisserman

09 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: NeRF-ID as mentioned in this paper introduces a differentiable module that learns to propose samples and their importance for the fine network, and considers and compare multiple alternatives for its neural architecture, and proposes an effective pre-training strategy.

...read moreread less

Abstract: Neural radiance fields (NeRF) methods have demonstrated impressive novel view synthesis performance. The core approach is to render individual rays by querying a neural network at points sampled along the ray to obtain the density and colour of the sampled points, and integrating this information using the rendering equation. Since dense sampling is computationally prohibitive, a common solution is to perform coarse-to-fine sampling. In this work we address a clear limitation of the vanilla coarse-to-fine approach -- that it is based on a heuristic and not trained end-to-end for the task at hand. We introduce a differentiable module that learns to propose samples and their importance for the fine network, and consider and compare multiple alternatives for its neural architecture. Training the proposal module from scratch can be unstable due to lack of supervision, so an effective pre-training strategy is also put forward. The approach, named `NeRF in detail' (NeRF-ID), achieves superior view synthesis quality over NeRF and the state-of-the-art on the synthetic Blender benchmark and on par or better performance on the real LLFF-NeRF scenes. Furthermore, by leveraging the predicted sample importance, a 25% saving in computation can be achieved without significantly sacrificing the rendering quality.

...read moreread less

Proceedings Article•

Broaden Your Views for Self-Supervised Video Learning

[...]

30 Mar 2021

...read moreread less

Proceedings Article•

With a Little Help From My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations

[...]

Debidatta Dwibedi¹, Yusuf Aytar¹, Jonathan Tompson¹, Pierre Sermanet¹, Andrew Zisserman¹ - Show less +1 more•Institutions (1)

Google¹

29 Apr 2021

...read moreread less

Proceedings Article•DOI•

SeeHear: Signer Diarisation and a New Dataset

[...]

Samuel Albanie¹, Gül Varol¹, Liliane Momeni¹, Triantafyllos Afouras¹, Andrew Brown¹, Chuhan Zhang¹, Ernesto Coto¹, Necati Cihan Camgoz², Ben Saunders², Abhishek Dutta¹, Neil Fox, Richard Bowden², Bencie Woll, Andrew Zisserman¹ - Show less +10 more•Institutions (2)

University of Oxford¹, University of Surrey²

06 Jun 2021

TL;DR: The SeeHear dataset as discussed by the authors contains 90 hours of British Sign Language (BSL) content featuring more than 1000 signers, including interviews, monologues and debates, annotated with 35k active signing tracks, with corresponding signer identities and subtitles, and 40k automatically localised sign labels.

...read moreread less

Abstract: In this work, we propose a framework to collect a large-scale, diverse sign language dataset that can be used to train automatic sign language recognition models.The first contribution of this work is SDTrack, a generic method for signer tracking and diarisation in the wild. Our second contribution is SeeHear, a dataset of 90 hours of British Sign Language (BSL) content featuring more than 1000 signers, and including interviews, monologues and debates. Using SDTrack, the SeeHear dataset is annotated with 35K active signing tracks, with corresponding signer identities and subtitles, and 40K automatically localised sign labels. As a third contribution, we provide benchmarks for signer diarisation and sign recognition on SeeHear.

...read moreread less