Showing papers by "Jakob Uszkoreit published in 2020"

PDF

Open Access

Posted Content•

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[...]

Alexey Dosovitskiy¹, Lucas Beyer¹, Alexander Kolesnikov¹, Dirk Weissenborn², Xiaohua Zhai¹, Thomas Unterthiner¹, Mostafa Dehghani¹, Matthias Minderer¹, Georg Heigold², Sylvain Gelly¹, Jakob Uszkoreit¹, Neil Houlsby¹ - Show less +8 more•Institutions (2)

Google¹, German Research Centre for Artificial Intelligence²

22 Oct 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

12,690 citations

Proceedings Article•

Object-Centric Learning with Slot Attention

[...]

Francesco Locatello¹, Dirk Weissenborn², Thomas Unterthiner², Aravindh Mahendran², Georg Heigold³, Jakob Uszkoreit², Alexey Dosovitskiy², Thomas Kipf² - Show less +4 more•Institutions (3)

ETH Zurich¹, Google², German Research Centre for Artificial Intelligence³

26 Jun 2020

TL;DR: An architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention is presented.

...read moreread less

Abstract: Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.

...read moreread less

372 citations

Journal Article•DOI•

Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals.

[...]

Martin Popel¹, Marketa Tomkova², Jakub Tomek³, Łukasz Kaiser⁴, Jakob Uszkoreit⁴, Ondřej Bojar¹, Zdeněk Žabokrtský¹ - Show less +3 more•Institutions (4)

Charles University in Prague¹, Ludwig Institute for Cancer Research², University of Oxford³, Google⁴

01 Sep 2020-Nature Communications

TL;DR: A deep-learning system, CUBBITT, which approaches the quality of human translation and even surpasses it in adequacy in certain circumstances, suggesting that deep learning may have the potential to replace humans in applications where conservation of meaning is the primary aim.

...read moreread less

Abstract: The quality of human translation was long thought to be unattainable for computer translation systems. In this study, we present a deep-learning system, CUBBITT, which challenges this view. In a context-aware blind evaluation by human judges, CUBBITT significantly outperformed professional-agency English-to-Czech news translation in preserving text meaning (translation adequacy). While human translation is still rated as more fluent, CUBBITT is shown to be substantially more fluent than previous state-of-the-art systems. Moreover, most participants of a Translation Turing test struggle to distinguish CUBBITT translations from human translations. This work approaches the quality of human translation and even surpasses it in adequacy in certain circumstances.This suggests that deep learning may have the potential to replace humans in applications where conservation of meaning is the primary aim. The quality of human language translation has been thought to be unattainable by computer translation systems. Here the authors present CUBBITT, a deep learning system that outperforms professional human translators in retaining text meaning in English-to-Czech news translation, and validate the system on English-French and English-Polish language pairs.

...read moreread less

156 citations

Proceedings Article•

Scaling Autoregressive Video Models

[...]

Dirk Weissenborn¹, Oscar Täckström¹, Jakob Uszkoreit¹•Institutions (1)

Google¹

30 Apr 2020

TL;DR: It is shown that conceptually simple autoregressive video generation models based on a three-dimensional self-attention mechanism achieve competitive results across multiple metrics on popular benchmark datasets, for which they produce continuations of high fidelity and realism.

...read moreread less

Abstract: Due to the statistical complexity of video, the high degree of inherent stochasticity, and the sheer amount of data, generating natural video remains a challenging task. State-of-the-art video generation models attempt to address these issues by combining sometimes complex, often video-specific neural network architectures, latent variable models, adversarial training and a range of other methods. Despite their often high complexity, these approaches still fall short of generating high quality video continuations outside of narrow domains and often struggle with fidelity. In contrast, we show that conceptually simple, autoregressive video generation models based on a three-dimensional self-attention mechanism achieve highly competitive results across multiple metrics on popular benchmark datasets for which they produce continuations of high fidelity and realism. Furthermore, we find that our models are capable of producing diverse and surprisingly realistic continuations on a subset of videos from Kinetics, a large scale action recognition dataset comprised of YouTube videos exhibiting phenomena such as camera movement, complex object interactions and diverse human movement. To our knowledge, this is the first promising application of video-generation models to videos of this complexity.

...read moreread less

128 citations

Posted Content•

Object-Centric Learning with Slot Attention

[...]

Francesco Locatello¹, Dirk Weissenborn², Thomas Unterthiner², Aravindh Mahendran², Georg Heigold³, Jakob Uszkoreit², Alexey Dosovitskiy², Thomas Kipf² - Show less +4 more•Institutions (3)

ETH Zurich¹, Google², German Research Centre for Artificial Intelligence³

26 Jun 2020-arXiv: Learning

TL;DR: The Slot Attention module as mentioned in this paper is an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention.

...read moreread less

Abstract: Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks

...read moreread less

41 citations

Proceedings Article•DOI•

An Empirical Study of Generation Order for Machine Translation

[...]

William Chan¹, Mitchell Stern¹, Jamie Kiros¹, Jakob Uszkoreit¹•Institutions (1)

Google¹

01 Nov 2020

TL;DR: An empirical study of generation order for machine translation finds that traditional left-to-right generation is not strictly necessary to achieve high performance, and results on the WMT'18 English $\to$ Chinese task tend to vary more widely, suggesting that translation for less well-aligned language pairs may be more sensitive to generation order.

...read moreread less

Abstract: In this work, we present an empirical study of generation order for machine translation. Building on recent advances in insertion-based modeling, we first introduce a soft order-reward framework that enables us to train models to follow arbitrary oracle generation policies. We then make use of this framework to explore a large variety of generation orders, including uninformed orders, location-based orders, frequency-based orders, content-based orders, and model-based orders. Curiously, we find that for the WMT'14 English $\to$ German and WMT'18 English $\to$ Chinese translation tasks, order does not have a substantial impact on output quality. Moreover, for English $\to$ German, we even discover that unintuitive orderings such as alphabetical and shortest-first can match the performance of a standard Transformer, suggesting that traditional left-to-right generation may not be necessary to achieve high performance.

...read moreread less

10 citations

Posted Content•

Towards End-to-End In-Image Neural Machine Translation

[...]

Elman Mansimov, Mitchell Stern, Mia Xu Chen, Orhan Firat, Jakob Uszkoreit, Puneet Jain - Show less +2 more

20 Oct 2020-arXiv: Computation and Language

TL;DR: An end-to-end neural model is proposed for this task inspired by recent approaches to neural machine translation, and promising initial results are demonstrated based purely on pixel-level supervision.

...read moreread less

Abstract: In this paper, we offer a preliminary investigation into the task of in-image machine translation: transforming an image containing text in one language into an image containing the same text in another language. We propose an end-to-end neural model for this task inspired by recent approaches to neural machine translation, and demonstrate promising initial results based purely on pixel-level supervision. We then offer a quantitative and qualitative evaluation of our system outputs and discuss some common failure modes. Finally, we conclude with directions for future work.

...read moreread less

7 citations

Proceedings Article•DOI•

Towards End-to-End In-Image Neural Machine Translation

[...]

Elman Mansimov, Mitchell Stern, Mia Xu Chen, Orhan Firat, Jakob Uszkoreit, Puneet Jain - Show less +2 more

01 Nov 2020

TL;DR: In this article, an end-to-end neural model for in-image machine translation is proposed, which can transform an image containing text in one language into an image with the same text in another language.

...read moreread less

1 citations

Patent•

Auto-regressive video generation neural networks

[...]

Oscar Täckström, Jakob Uszkoreit, Dirk Weissenborn

26 Nov 2020

TL;DR: In this paper, a method for generating a video including multiple frames, each of the frames having multiple channels, is described, which includes identifying a partitioning of the initial output video into a set of channel slices that are indexed according to a particular slice order.

...read moreread less

Abstract: A method for generating a video is described. The method includes: generating an initial output video including multiple frames, each of the frames having multiple channels; identifying a partitioning of the initial output video into a set of channel slices that are indexed according to a particular slice order, each channel slice being a down sampling of a channel stack from a set of channel stacks; initializing, for each channel stack in the set of channel stacks, a set of fully-generated channel slices; repeatedly processing, using an encoder and a decoder, a current output video to generate a next fully-generated channel slice to be added to the current set of fully-generated channel slices; generating, for each channel index, a respective fully-generated channel stack using the respective fully generated channel slices; and generating a fully-generated output video using the fully-generated channel stacks.

...read moreread less

Patent•

Generating neural network outputs using insertion operations

[...]

Jakob Uszkoreit¹, Mitchell Stern¹, Jamie Kiros, William Chan•Institutions (1)

Google¹

23 Jul 2020