scispace - formally typeset
Open AccessPosted Content

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

Reads0
Chats0
TLDR
In this article, a coarse-to-fine hierarchy of context is incorporated by combining the autoregressive formulation with a multinomial diffusion process, which can solve free-form image inpainting and local, text-guided image modification without requiring mask-specific training.
Abstract
Autoregressive models and their sequential factorization of the data likelihood have recently demonstrated great potential for image representation and synthesis. Nevertheless, they incorporate image context in a linear 1D order by attending only to previously synthesized image patches above or to the left. Not only is this unidirectional, sequential bias of attention unnatural for images as it disregards large parts of a scene until synthesis is almost complete. It also processes the entire image on a single scale, thus ignoring more global contextual information up to the gist of the entire scene. As a remedy we incorporate a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process: Whereas a multistage diffusion process successively removes information to coarsen an image, we train a (short) Markov chain to invert this process. In each stage, the resulting autoregressive ImageBART model progressively incorporates context from previous stages in a coarse-to-fine manner. Experiments show greatly improved image modification capabilities over autoregressive models while also providing high-fidelity image generation, both of which are enabled through efficient training in a compressed latent space. Specifically, our approach can take unrestricted, user-provided masks into account to perform local image editing. Thus, in contrast to pure autoregressive models, it can solve free-form image inpainting and, in the case of conditional models, local, text-guided image modification without requiring mask-specific training.

read more

Citations
More filters
Posted Content

L-Verse: Bidirectional Generation Between Image and Text

TL;DR: L-Verse as discussed by the authors proposes a novel architecture consisting of feature-augmented variational autoencoder and bidirectional auto-regressive transformer (BiART) for text-to-image and image-totext generation.
Posted Content

EdiBERT, a generative model for image editing

TL;DR: In this article, a bi-directional transformer is proposed for image editing, which is trained in the discrete latent space built by a vector-quantized auto-encoder.
Posted Content

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

TL;DR: In this article, a discrete diffusion probabilistic model prior is proposed for parallel prediction of vector-quantized tokens by using an unconstrained Transformer architecture as the backbone.
Posted Content

Vector Quantized Diffusion Model for Text-to-Image Synthesis

TL;DR: In this article, a vector quantized diffusion (VQ-Diffusion) model is proposed for text-to-image generation, where the latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model.
References
More filters
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings Article

Auto-Encoding Variational Bayes

TL;DR: A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.
Dissertation

Learning Multiple Layers of Features from Tiny Images

TL;DR: In this paper, the authors describe how to train a multi-layer generative model of natural images, using a dataset of millions of tiny colour images, described in the next section.
Related Papers (5)