scispace - formally typeset
Search or ask a question

Showing papers by "Shai Avidan published in 2010"


Book ChapterDOI
05 Sep 2010
TL;DR: This work determines the solution of this segmentation problem as the MAP labeling of a higher-order random field, and develops a framework that allows MHVS to achieve spatial and temporal long-range label consistency while operating in an on-line manner.
Abstract: Multiple Hypothesis Video Segmentation (MHVS) is a method for the unsupervised photometric segmentation of video sequences. MHVS segments arbitrarily long video streams by considering only a few frames at a time, and handles the automatic creation, continuation and termination of labels with no user initialization or supervision. The process begins by generating several pre-segmentations per frame and enumerating multiple possible trajectories of pixel regions within a short time window. After assigning each trajectory a score, we let the trajectories compete with each other to segment the sequence. We determine the solution of this segmentation problem as the MAP labeling of a higher-order random field. This framework allows MHVS to achieve spatial and temporal long-range label consistency while operating in an on-line manner. We test MHVS on several videos of natural scenes with arbitrary camera and object motion.

189 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: In this article, the problem of reconstructing an image from a bag of square, non-overlapping image patches, the jigsaw puzzle problem, is considered and a graphical model is developed to solve it.
Abstract: We explore the problem of reconstructing an image from a bag of square, non-overlapping image patches, the jigsaw puzzle problem. Completing jigsaw puzzles is challenging and requires expertise even for humans, and is known to be NP-complete. We depart from previous methods that treat the problem as a constraint satisfaction problem and develop a graphical model to solve it. Each patch location is a node and each patch is a label at nodes in the graph. A graphical model requires a pairwise compatibility term, which measures an affinity between two neighboring patches, and a local evidence term, which we lack. This paper discusses ways to obtain these terms for the jigsaw puzzle problem. We evaluate several patch compatibility metrics, including the natural image statistics measure, and experimentally show that the dissimilarity-based compatibility – measuring the sum-of-squared color difference along the abutting boundary – gives the best results. We compare two forms of local evidence for the graphical model: a sparse-and-accurate evidence and a dense-and-noisy evidence. We show that the sparse-and-accurate evidence, fixing as few as 4 – 6 patches at their correct locations, is enough to reconstruct images consisting of over 400 patches. To the best of our knowledge, this is the largest puzzle solved in the literature. We also show that one can coarsely estimate the low resolution image from a bag of patches, suggesting that a bag of image patches encodes some geometric information about the original image.

181 citations


Journal ArticleDOI
TL;DR: In this paper, a patch assignment problem on a Markov random field (MRF) is posed, where each patch should be used only once and neighboring patches should fit to form a plausible image.
Abstract: The patch transform represents an image as a bag of overlapping patches sampled on a regular grid. This representation allows users to manipulate images in the patch domain, which then seeds the inverse patch transform to synthesize modified images. Possible modifications include the spatial locations of patches, the size of the output image, or the pool of patches from which an image is reconstructed. When no modifications are made, the inverse patch transform reduces to solving a jigsaw puzzle. The inverse patch transform is posed as a patch assignment problem on a Markov random field (MRF), where each patch should be used only once and neighboring patches should fit to form a plausible image. We find an approximate solution to the MRF using loopy belief propagation, introducing an approximation that encourages the solution to use each patch only once. The image reconstruction algorithm scales well with the total number of patches through label pruning. In addition, structural misalignment artifacts are suppressed through a patch jittering scheme that spatially jitters the assigned patches. We demonstrate the patch transform and its effectiveness on natural images.

70 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: This work proposes an effective solution to this problem that is based on replacing the eyes of the user, and shows that this replacement, when done accurately, is enough to achieve a natural looking video.
Abstract: The camera in video conference systems is typically positioned above, or below, the screen, causing the gaze of the users to appear misplaced. We propose an effective solution to this problem that is based on replacing the eyes of the user. This replacement, when done accurately, is enough to achieve a natural looking video. At an initialization stage the user is asked to look straight at the camera. We store these frames, then track the eyes accurately in the video sequence and replace the eyes, taking care of illumination and ghosting artifacts. We have tested the system on a large number of videos demonstrating the effectiveness of the proposed solution.

45 citations


Journal ArticleDOI
27 May 2010
TL;DR: A system for exploring large collections of photos in a virtual 3D space that organize the photos in themes, such as city streets or skylines, and let users navigate within each theme using intuitive 3D controls that include move left/right, zoom and rotate.
Abstract: We present a system for generating “infinite” images from large collections of photos by means of transformed image retrieval. Given a query image, we first transform it to simulate how it would look if the camera moved sideways and then perform image retrieval based on the transformed image. We then blend the query and retrieved images to create a larger panorama. Repeating this process will produce an “infinite” image. The transformed image retrieval model is not limited to simple 2-D left/right image translation, however, and we show how to approximate other camera motions like rotation and forward motion/zoom-in using simple 2-D image transforms. We represent images in the database as a graph where each node is an image and different types of edges correspond to different types of geometric transformations simulating different camera motions. Generating infinite images is thus reduced to following paths in the image graph. Given this data structure we can also generate a panorama that connects two query images, simply by finding the shortest path between the two in the image graph. We call this option the “image taxi.” Our approach does not assume photographs are of a single real 3-D location, nor that they were taken at the same time. Instead, we organize the photos in themes, such as city streets or skylines and synthesize new virtual scenes by combining images from distinct but visually similar locations. There are a number of potential applications to this technology. It can be used to generate long panoramas as well as content aware transitions between reference images or video shots. Finally, the image graph allows users to interactively explore large photo collections for ideation, games, social interaction, and artistic purposes.

42 citations


01 Jan 2010
TL;DR: In this paper, a trans-formed image retrieval model is proposed to generate infinite images from large collections of photos by means of trans- formed image retrieval, which can be used to generate long panoramas as well as content aware transitions between reference images or video shots.
Abstract: We present a system for generating Binfinite( images from large collections of photos by means of trans- formed image retrieval. Given a query image, we first transform it to simulate how it would look if the camera moved sideways and then perform image retrieval based on the transformed image. We then blend the query and retrieved images to create a larger panorama. Repeating this process will produce an Binfinite( image. The transformed image retrieval model is not limited to simple 2-D left/right image translation, however, and we show how to approximate other camera motions like rotation and forward motion/zoom-in using simple 2-D image transforms. We represent images in the database as a graph where each node is an image and different types of edges correspond to different types of geometric transformations simulating different camera motions. Generating infinite images is thus reduced to following paths in the image graph. Given this data structure we can also generate a panorama that connects two query images, simply by finding the shortest path between the two in the image graph. We call this option the Bimage taxi.( Our approach does not assume photographs are of a single real 3-D location, nor that they were taken at the same time. Instead, we organize the photos in themes, such as city streets or skylines and synthesize new virtual scenes by combining images from distinct but visually similar locations. There are a number of potential applications to this technol- ogy. It can be used to generate long panoramas as well as content aware transitions between reference images or video shots. Finally, the image graph allows users to interactively explore large photo collections for ideation, games, social interaction, and artistic purposes.

31 citations


01 Sep 2010
TL;DR: This approach can evaluate the efficacy of different feature sets and parameter settings for the matching paradigm with other image categories, using Amazon Mechanical Turk workers to rank the matches and predictions of different algorithm conditions by comparing each one to the selection of a random image.
Abstract: The paradigm of matching images to a very large dataset has been used for numerous vision tasks and is a powerful one. If the image dataset is large enough, one can expect to find good matches of almost any image to the database, allowing label transfer [3, 15], and image editing or enhancement [6, 11]. Users of this approach will want to know how many images are required, and what features to use for finding semantic relevant matches. Furthermore, for navigation tasks or to exploit context, users will want to know the predictive quality of the dataset: can we predict the image that would be seen under changes in camera position? We address these questions in detail for one category of images: street level views. We have a dataset of images taken from an enumeration of positions and viewpoints within Pittsburgh. We evaluate how well we can match those images, using images from non-Pittsburgh cities, and how well we can predict the images that would be seen under changes in camera position. We compare performance for these tasks for eight different feature sets, finding a feature set that outperforms the others (HOG). A combination of all the features performs better in the prediction task than any individual feature. We used Amazon Mechanical Turk workers to rank the matches and predictions of different algorithm conditions by comparing each one to the selection of a random image. This approach can evaluate the efficacy of different feature sets and parameter settings for the matching paradigm with other image categories.

9 citations