scispace - formally typeset
Search or ask a question

Showing papers by "Jakob Uszkoreit published in 2020"


Posted Content
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

12,690 citations


Proceedings Article
26 Jun 2020
TL;DR: An architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention is presented.
Abstract: Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.

372 citations


Journal ArticleDOI
TL;DR: A deep-learning system, CUBBITT, which approaches the quality of human translation and even surpasses it in adequacy in certain circumstances, suggesting that deep learning may have the potential to replace humans in applications where conservation of meaning is the primary aim.
Abstract: The quality of human translation was long thought to be unattainable for computer translation systems. In this study, we present a deep-learning system, CUBBITT, which challenges this view. In a context-aware blind evaluation by human judges, CUBBITT significantly outperformed professional-agency English-to-Czech news translation in preserving text meaning (translation adequacy). While human translation is still rated as more fluent, CUBBITT is shown to be substantially more fluent than previous state-of-the-art systems. Moreover, most participants of a Translation Turing test struggle to distinguish CUBBITT translations from human translations. This work approaches the quality of human translation and even surpasses it in adequacy in certain circumstances.This suggests that deep learning may have the potential to replace humans in applications where conservation of meaning is the primary aim. The quality of human language translation has been thought to be unattainable by computer translation systems. Here the authors present CUBBITT, a deep learning system that outperforms professional human translators in retaining text meaning in English-to-Czech news translation, and validate the system on English-French and English-Polish language pairs.

156 citations


Proceedings Article
30 Apr 2020
TL;DR: It is shown that conceptually simple autoregressive video generation models based on a three-dimensional self-attention mechanism achieve competitive results across multiple metrics on popular benchmark datasets, for which they produce continuations of high fidelity and realism.
Abstract: Due to the statistical complexity of video, the high degree of inherent stochasticity, and the sheer amount of data, generating natural video remains a challenging task. State-of-the-art video generation models attempt to address these issues by combining sometimes complex, often video-specific neural network architectures, latent variable models, adversarial training and a range of other methods. Despite their often high complexity, these approaches still fall short of generating high quality video continuations outside of narrow domains and often struggle with fidelity. In contrast, we show that conceptually simple, autoregressive video generation models based on a three-dimensional self-attention mechanism achieve highly competitive results across multiple metrics on popular benchmark datasets for which they produce continuations of high fidelity and realism. Furthermore, we find that our models are capable of producing diverse and surprisingly realistic continuations on a subset of videos from Kinetics, a large scale action recognition dataset comprised of YouTube videos exhibiting phenomena such as camera movement, complex object interactions and diverse human movement. To our knowledge, this is the first promising application of video-generation models to videos of this complexity.

128 citations


Posted Content
TL;DR: The Slot Attention module as mentioned in this paper is an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention.
Abstract: Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks

41 citations


Proceedings ArticleDOI
01 Nov 2020
TL;DR: An empirical study of generation order for machine translation finds that traditional left-to-right generation is not strictly necessary to achieve high performance, and results on the WMT'18 English $\to$ Chinese task tend to vary more widely, suggesting that translation for less well-aligned language pairs may be more sensitive to generation order.
Abstract: In this work, we present an empirical study of generation order for machine translation. Building on recent advances in insertion-based modeling, we first introduce a soft order-reward framework that enables us to train models to follow arbitrary oracle generation policies. We then make use of this framework to explore a large variety of generation orders, including uninformed orders, location-based orders, frequency-based orders, content-based orders, and model-based orders. Curiously, we find that for the WMT'14 English $\to$ German and WMT'18 English $\to$ Chinese translation tasks, order does not have a substantial impact on output quality. Moreover, for English $\to$ German, we even discover that unintuitive orderings such as alphabetical and shortest-first can match the performance of a standard Transformer, suggesting that traditional left-to-right generation may not be necessary to achieve high performance.

10 citations


Posted Content
TL;DR: An end-to-end neural model is proposed for this task inspired by recent approaches to neural machine translation, and promising initial results are demonstrated based purely on pixel-level supervision.
Abstract: In this paper, we offer a preliminary investigation into the task of in-image machine translation: transforming an image containing text in one language into an image containing the same text in another language. We propose an end-to-end neural model for this task inspired by recent approaches to neural machine translation, and demonstrate promising initial results based purely on pixel-level supervision. We then offer a quantitative and qualitative evaluation of our system outputs and discuss some common failure modes. Finally, we conclude with directions for future work.

7 citations


Proceedings ArticleDOI
01 Nov 2020
TL;DR: In this article, an end-to-end neural model for in-image machine translation is proposed, which can transform an image containing text in one language into an image with the same text in another language.
Abstract: In this paper, we offer a preliminary investigation into the task of in-image machine translation: transforming an image containing text in one language into an image containing the same text in another language. We propose an end-to-end neural model for this task inspired by recent approaches to neural machine translation, and demonstrate promising initial results based purely on pixel-level supervision. We then offer a quantitative and qualitative evaluation of our system outputs and discuss some common failure modes. Finally, we conclude with directions for future work.

1 citations


Patent
26 Nov 2020
TL;DR: In this paper, a method for generating a video including multiple frames, each of the frames having multiple channels, is described, which includes identifying a partitioning of the initial output video into a set of channel slices that are indexed according to a particular slice order.
Abstract: A method for generating a video is described. The method includes: generating an initial output video including multiple frames, each of the frames having multiple channels; identifying a partitioning of the initial output video into a set of channel slices that are indexed according to a particular slice order, each channel slice being a down sampling of a channel stack from a set of channel stacks; initializing, for each channel stack in the set of channel stacks, a set of fully-generated channel slices; repeatedly processing, using an encoder and a decoder, a current output video to generate a next fully-generated channel slice to be added to the current set of fully-generated channel slices; generating, for each channel index, a respective fully-generated channel stack using the respective fully generated channel slices; and generating a fully-generated output video using the fully-generated channel stacks.