ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

Open AccessPosted Content

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

Patrick Esser, +3 more

- 19 Aug 2021 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

In this article, a coarse-to-fine hierarchy of context is incorporated by combining the autoregressive formulation with a multinomial diffusion process, which can solve free-form image inpainting and local, text-guided image modification without requiring mask-specific training.

Abstract:

Autoregressive models and their sequential factorization of the data likelihood have recently demonstrated great potential for image representation and synthesis. Nevertheless, they incorporate image context in a linear 1D order by attending only to previously synthesized image patches above or to the left. Not only is this unidirectional, sequential bias of attention unnatural for images as it disregards large parts of a scene until synthesis is almost complete. It also processes the entire image on a single scale, thus ignoring more global contextual information up to the gist of the entire scene. As a remedy we incorporate a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process: Whereas a multistage diffusion process successively removes information to coarsen an image, we train a (short) Markov chain to invert this process. In each stage, the resulting autoregressive ImageBART model progressively incorporates context from previous stages in a coarse-to-fine manner. Experiments show greatly improved image modification capabilities over autoregressive models while also providing high-fidelity image generation, both of which are enabled through efficient training in a compressed latent space. Specifically, our approach can take unrestricted, user-provided masks into account to perform local image editing. Thus, in contrast to pure autoregressive models, it can solve free-form image inpainting and, in the case of conditional models, local, text-guided image modification without requiring mask-specific training.

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

Citations

L-Verse: Bidirectional Generation Between Image and Text

EdiBERT, a generative model for image editing

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Vector Quantized Diffusion Model for Text-to-Image Synthesis

References

Attention is All you Need

ImageNet: A large-scale hierarchical image database

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Auto-Encoding Variational Bayes

Learning Multiple Layers of Features from Tiny Images

Related Papers (5)

Generative image modeling using spatial LSTMs

Generalized Deep Image to Image Regression

Sparse representations and bayesian image inpainting

Degradation identification and model parameter estimation in discontinuity-adaptive visual reconstruction

The theory and practice of Bayesian image labeling