scispace - formally typeset
Search or ask a question
Author

Björn Ommer

Bio: Björn Ommer is an academic researcher from Heidelberg University. The author has contributed to research in topics: Object detection & Object (computer science). The author has an hindex of 31, co-authored 138 publications receiving 2933 citations. Previous affiliations of Björn Ommer include University of Bonn & ETH Zurich.


Papers
More filters
Posted Content
TL;DR: It is demonstrated how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.
Abstract: Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet. Code and pretrained models can be found at this https URL .

744 citations

Journal ArticleDOI
13 Jun 2014-Science
TL;DR: Nearly full recovery of skilled forelimb functions in rats with large strokes are shown when a growth-promoting immunotherapy against a neurite growth–inhibitory protein was applied to boost the sprouting of new fibers, before stabilizing the newly formed circuits by intensive training.
Abstract: The brain exhibits limited capacity for spontaneous restoration of lost motor functions after stroke. Rehabilitation is the prevailing clinical approach to augment functional recovery, but the scientific basis is poorly understood. Here, we show nearly full recovery of skilled forelimb functions in rats with large strokes when a growth-promoting immunotherapy against a neurite growth-inhibitory protein was applied to boost the sprouting of new fibers, before stabilizing the newly formed circuits by intensive training. In contrast, early high-intensity training during the growth phase destroyed the effect and led to aberrant fiber patterns. Pharmacogenetic experiments identified a subset of corticospinal fibers originating in the intact half of the forebrain, side-switching in the spinal cord to newly innervate the impaired limb and restore skilled motor function.

284 citations

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, the authors demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.
Abstract: Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers. Project page at https://git.io/JLlvY.

273 citations

Posted Content
TL;DR: A conditional U-Net is presented for shape-guided image generation, conditioned on the output of a variational autoencoder for appearance, trained end-to-end on images, without requiring samples of the same object with varying pose or appearance.
Abstract: Deep generative models have demonstrated great performance in image synthesis. However, results deteriorate in case of spatial deformations, since they generate images of objects directly, rather than modeling the intricate interplay of their inherent shape and appearance. We present a conditional U-Net for shape-guided image generation, conditioned on the output of a variational autoencoder for appearance. The approach is trained end-to-end on images, without requiring samples of the same object with varying pose or appearance. Experiments show that the model enables conditional image generation and transfer. Therefore, either shape or appearance can be retained from a query image, while freely altering the other. Moreover, appearance can be sampled due to its stochastic latent representation, while preserving shape. In quantitative and qualitative experiments on COCO, DeepFashion, shoes, Market-1501 and handbags, the approach demonstrates significant improvements over the state-of-the-art.

196 citations

Proceedings ArticleDOI
06 Nov 2011
TL;DR: A probabilistic model is presented that localizes abnormalities using statistical inference and outperforms the state-of-the-art to achieve a frame-based abnormality classification performance of 91% and the localization performance improves by 32% to 76%.
Abstract: Detecting abnormalities in video is a challenging problem since the class of all irregular objects and behaviors is infinite and thus no (or by far not enough) abnormal training samples are available. Consequently, a standard setting is to find abnormalities without actually knowing what they are because we have not been shown abnormal examples during training. However, although the training data does not define what an abnormality looks like, the main paradigm in this field is to directly search for individual abnormal local patches or image regions independent of another. To address this problem we parse video frames by establishing a set of hypotheses that jointly explain all the foreground while, at same time, trying to find normal training samples that explain the hypotheses. Consequently, we can avoid a direct detection of abnormalities. They are discovered indirectly as those hypotheses which are needed for covering the foreground without finding an explanation by normal samples for themselves. We present a probabilistic model that localizes abnormalities using statistical inference. On the challenging dataset of [15] it outperforms the state-of-the-art by 7% to achieve a frame-based abnormality classification performance of 91% and the localization performance improves by 32% to 76%.

163 citations


Cited by
More filters
01 Jan 2004
TL;DR: Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance and describes numerous important application areas such as image based rendering and digital libraries.
Abstract: From the Publisher: The accessible presentation of this book gives both a general view of the entire computer vision enterprise and also offers sufficient detail to be able to build useful applications. Users learn techniques that have proven to be useful by first-hand experience and a wide range of mathematical methods. A CD-ROM with every copy of the text contains source code for programming practice, color images, and illustrative movies. Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance. Topics are discussed in substantial and increasing depth. Application surveys describe numerous important application areas such as image based rendering and digital libraries. Many important algorithms broken down and illustrated in pseudo code. Appropriate for use by engineers as a comprehensive reference to the computer vision enterprise.

3,627 citations

01 Jan 2006

3,012 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: In this paper, the spatial context is used as a source of free and plentiful supervisory signal for training a rich visual representation, and the feature representation learned using this within-image context captures visual similarity across images.
Abstract: This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the R-CNN framework [19] and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-the-art performance among algorithms which use only Pascal-provided training set annotations.

2,154 citations