scispace - formally typeset
Search or ask a question

Showing papers by "Andrew Zisserman published in 2003"



Proceedings ArticleDOI
18 Jun 2003
TL;DR: The flexible nature of the model is demonstrated by excellent results over a range of datasets including geometrically constrained classes (e.g. faces, cars) and flexible objects (such as animals).
Abstract: We present a method to learn and recognize object class models from unlabeled and unsegmented cluttered scenes in a scale invariant manner. Objects are modeled as flexible constellations of parts. A probabilistic representation is used for all aspects of the object: shape, appearance, occlusion and relative scale. An entropy-based feature detector is used to select regions and their scale within the image. In learning the parameters of the scale-invariant object model are estimated. This is done using expectation-maximization in a maximum-likelihood setting. In recognition, this model is used in a Bayesian manner to classify images. The flexible nature of the model is demonstrated by excellent results over a range of datasets including geometrically constrained classes (e.g. faces, cars) and flexible objects (such as animals).

2,411 citations


Proceedings ArticleDOI
18 Jun 2003
TL;DR: A novel texton based representation is developed, which is suited to modeling this joint neighborhood distribution for MRFs, and it is demonstrated that textures can be classified using the joint distribution of intensity values over extremely compact neighborhoods.
Abstract: We question the role that large scale filter banks have traditionally played in texture classification. It is demonstrated that textures can be classified using the joint distribution of intensity values over extremely compact neighborhoods (starting from as small as 3 /spl times/ 3 pixels square), and that this outperforms classification using filter banks with large support. We develop a novel texton based representation, which is suited to modeling this joint neighborhood distribution for MRFs. The representation is learnt from training images, and then used to classify novel images (with unknown viewpoint and lighting) into texture classes. The power of the method is demonstrated by classifying over 2800 images of all 61 textures present in the Columbia-Utrecht database. The classification performance surpasses that of recent state-of-the-art filter bank based classifiers such as Leung & Malik, Cula & Dana, and Varma & Zisserman.

504 citations


Journal ArticleDOI
TL;DR: The two key components that are necessary for successful SR restoration are described: the accurate alignment or registration of the LR images and the formulation of an SR estimator that uses a generative image model together with a prior model of the super-resolved image itself.
Abstract: Super-resolution (SR) restoration aims to solve the following problem: given a set of observed images, estimate an image at a higher resolution than is present in any of the individual images. Where the application of this technique differs in computer vision from other fields is in the variety and severity of the registration transformation between the images. In particular this transformation is generally unknown, and a significant component of solving the SR problem in computer vision is the estimation of the transformation. The transformation may have a simple parametric form, or it may be scene dependent and have to be estimated for every point. In either case the transformation is estimated directly and automatically from the images. We describe the two key components that are necessary for successful SR restoration: the accurate alignment or registration of the LR images and the formulation of an SR estimator that uses a generative image model together with a prior model of the super-resolved image itself. As with many other problems in computer vision, these different aspects are tackled in a robust, statistical framework.

296 citations


Proceedings ArticleDOI
08 Sep 2003
TL;DR: An approach to recognizing poorly textured objects, that may contain holes and tubular parts, in cluttered scenes under arbitrary viewing conditions is described and a new edge-based local feature detector that is invariant to similarity transformations is introduced.
Abstract: In this paper we describe an approach to recognizing poorly textured objects, that may contain holes and tubular parts, in cluttered scenes under arbitrary viewing conditions. To this end we develop a number of novel components. First, we introduce a new edge-based local feature detector that is invariant to similarity transformations. The features are localized on edges and a neighbourhood is estimated in a scale invariant manner. Second, the neighbourhood descriptor computed for foreground features is not affected by background clutter, even if the feature is on an object boundary. Third, the descriptor generalizes Lowe's SIFT method to edges. An object model is learnt from a single training image. The object is then recognized in new images in a series of steps which apply progressively tighter geometric restrictions. A final contribution of this work is to allow sufficient flexibility in the geometric representation that objects in the same visual class can be recognized. Results are demonstrated for various object classes including bikes and rackets.

234 citations


Journal ArticleDOI
13 Oct 2003
TL;DR: The paper’s second contribution is to constrain the generated views to lie in the space of images whose texture statistics are those of the input images, which amounts to an image-based prior on the reconstruction which regularizes the solution, yielding realistic synthetic views.
Abstract: Given a set of images acquired from known viewpoints, we describe a method for synthesizing the image which would be seen from a new viewpoint. In contrast to existing techniques, which explicitly reconstruct the 3D geometry of the scene, we transform the problem to the reconstruction of colour rather than depth. This retains the benefits of geometric constraints, but projects out the ambiguities in depth estimation which occur in textureless regions. On the other hand, regularization is still needed in order to generate high-quality images. The paper's second contribution is to constrain the generated views to lie in the space of images whose texture statistics are those of the input images. This amounts to an image-based prior on the reconstruction which regularizes the solution, yielding realistic synthetic views. Examples are given of new view generation for cameras interpolated between the acquisition viewpoints--which enables synthetic steadicam stabilization of a sequence with a high level of realism.

167 citations


Proceedings ArticleDOI
18 Jun 2003
TL;DR: A joint manifold distance (JMD) is developed which measures the distance between two subspaces, where each subspace is invariant to a desired group of transformations, for example affine warping of the image plane.
Abstract: We wish to match sets of images to sets of images where both sets are undergoing various distortions such as viewpoint and lighting changes. To this end we have developed a joint manifold distance (JMD) which measures the distance between two subspaces, where each subspace is invariant to a desired group of transformations, for example affine warping of the image plane. The JMD may be seen as generalizing invariant distance metrics such as tangent distance in two important ways. First, formally representing priors on the image distribution avoids certain difficulties, which in previous work have required ad-hoc correction. The second contribution is the observation that previous distances have been computed using what amounted to "home-grown" nonlinear optimizers, and that more reliable results can be obtained by using generic optimizers which have been developed in the numerical analysis community, and which automatically set the parameters which home-grown methods must set by art. The JMD is used in this work to cluster faces in video. Sets of faces detected in contiguous frames define the subspaces, and distance between the subspaces is computed using JMD. In this way the principal cast of a movie can be 'discovered' as the principal clusters. We demonstrate the method on a feature-length movie.

118 citations


Proceedings Article
09 Dec 2003
TL;DR: This work presents a domain-specific image prior in the form of a p.d.f. based upon sampled images, and shows that for certain types of super-resolution problems, this sample-based prior gives a significant improvement over other common multiple-image super- resolution techniques.
Abstract: Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution images by recovering or inventing plausible high-frequency image content. Typical approaches try to reconstruct a high-resolution image using the sub-pixel displacements of several low-resolution images, usually regularized by a generic smoothness prior over the high-resolution image space. Other methods use training data to learn low-to-high-resolution matches, and have been highly successful even in the single-input-image case. Here we present a domain-specific image prior in the form of a p.d.f. based upon sampled images, and show that for certain types of super-resolution problems, this sample-based prior gives a significant improvement over other common multiple-image super-resolution techniques.

90 citations


Journal ArticleDOI
TL;DR: Progress in matching shots which are images of the same 3D location in a film, and analogues of local spatial consistency, cross-correlation, and epipolar geometry for these tracks are described.

86 citations


Journal ArticleDOI
TL;DR: Three new methods based on fitting a conic locus to corresponding image points over multiple views that are simpler and more robust than determining a fundamental matrix from two views or a trifocal tensor from three views are proposed.
Abstract: Previous algorithms for recovering 3D geometry from an uncalibrated image sequence of a single axis motion of unknown rotation angles are mainly based on the computation of two-view fundamental matrices and three-view trifocal tensors. We propose three new methods that are based on fitting a conic locus to corresponding image points over multiple views. The main advantage is that determining only five parameters of a conic from one corresponding point over at least five views is simpler and more robust than determining a fundamental matrix from two views or a trifocal tensor from three views. It is shown that the geometry of single axis motion can be recovered either by computing one conic locus and one fundamental matrix or by computing at least two conic loci. A maximum likelihood solution based on this parametrization of the single axis motion is also described for optimal estimation using three or more loci. The experiments on real image sequences demonstrate the simplicity, accuracy, and robustness of the new methods.

47 citations


Proceedings ArticleDOI
18 Jun 2003
TL;DR: The method described here allows epipolar curves to be learnt from multiple image pairs acquired by stereo cameras with fixed configuration, and shows that for standard stereo configurations the results are comparable to those obtained from a state of the art parametric model method, despite the significantly weaker constraints on the non-parametric model.
Abstract: We wish to determine the epipolar geometry of a stereo camera pair from image measurements alone. This paper describes a solution to this problem, which does not require a parametric model of the camera system, and consequently applies equally well to a wide class of stereo configurations. Examples in the paper range from a standard pinhole stereo configuration to more exotic systems combining curved mirrors and wide-angle lenses. The method described here allows epipolar curves to be learnt from multiple image pairs acquired by stereo cameras with fixed configuration. By aggregating information over the multiple image pairs, a dense map of the epipolar curves can be determined on the images. The algorithm requires a large number of images, but has the distinct benefit that the correspondence problem does not have to be explicitly solved. We show that for standard stereo configurations the results are comparable to those obtained from a state of the art parametric model method, despite the significantly weaker constraints on the non-parametric model. The new algorithm is simple to implement, so it may easily be employed on a new and possibly complex camera system.

Proceedings ArticleDOI
18 Jun 2003
TL;DR: This work shows that when there is some control over the motion of the camera, a fast linear solution is available without these restrictions, and shows the algorithm to be simple, fast, and accurate.
Abstract: Planar scenes would appear to be ideally suited for self-calibration because, by eliminating the problems of occlusion and parallax, high accuracy two-view relationships can be calculated without restricting motion to pure rotation. Unfortunately, the only monocular solutions so far devised involve costly nonlinear minimizations, which must be initialized with educated guesses for the calibration parameters. So far, this problem has been circumvented by using stereo or a known calibration object. In this work we show that when there is some control over the motion of the camera, a fast linear solution is available without these restrictions. For a camera undergoing a motion about a plane-normal rotation axis (typified for instance by a motion in the plane of the scene), the complex eigenvectors of a plane-induced homography are coincident with the circular points of the motion. Three such homographies provide sufficient information to solve for the image of the absolute conic (IAC), and therefore the calibration parameters. The required situation arises most commonly when the camera is viewing the ground plane, and either moving along it, or rotating about some vertical axis. We demonstrate a number of useful applications, and show the algorithm to be simple, fast, and accurate.

01 Apr 2003
TL;DR: This manuscript outlines the current demonstration system for translating visual Sign to written text based around a broad description of scene activity that naturally generalizes, reducing training requirements and allowing the knowledge base to be explicitly stated.
Abstract: This manuscript outlines our current demonstration system for translating visual Sign to written text. The system is based around a broad description of scene activity that naturally generalizes, reducing training requirements and allowing the knowledge base to be explicitly stated. This allows the same system to be used for different sign languages requiring only a change of the knowledge base.

Book ChapterDOI
01 Jan 2003
TL;DR: This work states that the goal of automatic recovery of camera motion and scene structure from video sequences has been a staple of computer vision research for over a decade and now represents one of the success stories ofComputer vision.
Abstract: The goal of automatic recovery of camera motion and scene structure from video sequences has been a staple of computer vision research for over a decade. As an area of endeavour, it has seen both steady and explosive progress over time, and now represents one of the success stories of computer vision. This task, automatic camera tracking or “matchmoving”, is the sine qua non of modern special effects, allowing the seamless insertion of computer generated objects onto live-action backgrounds (figure 2.1 shows an example). It has moved from a research problem for a small number of uncalibrated images to commercial software which can automatically track cameras through thousands of frames [1]. In addition, camera tracking is an important preprocess for many computer vision algorithms such as multiple-view shape reconstruction, novel view synthesis and autonomous vehicle navigation.

Book ChapterDOI
10 Sep 2003
TL;DR: In this paper, the problem of recovering the generating curve of a surface of revolution from a single uncalibrated perspective view, based solely on the object's outline and two (partly) visible cross-sections, is addressed.
Abstract: This paper addresses the problem of recovering the generating curve of a surface of revolution from a single uncalibrated perspective view, based solely on the object’s outline and two (partly) visible cross-sections. Without calibration of the camera’s internal parameters such recovery is only possible up to a particular transformation of the true shape. This is however sufficient for 3D reconstruction up to a 2 DOF transformation, for recognition of objects, and for transfer between views. We will describe the basic algorithm and show some examples.

Proceedings ArticleDOI
10 Nov 2003
TL;DR: A method for automatically generating accurate piecewise planar models for indoor scenes using a combination of a 2D laser scanner and a camera on a mobile platform that exploits the complementarity of the sensors.
Abstract: We describe a method for automatically generating accurate piecewise planar models for indoor scenes using a combination of a 2D laser scanner and a camera on a mobile platform. The method exploits the complementarity of the sensors. Mapping techniques applied to 2D laser scans simultaneously compute a map and the location of the sensor in the unknown environment. This provides an initial estimate for the vision algorithms by compensating the rotation, foreshortening and the scale change between images. The vision algorithms are then able to compute a very accurate registration (via a plane to plane homography), which is used to segment the model into planar facets, and to improve the estimate of the model and sensor position. Results are demonstrated on a man made scene using a 2D laser scanner and a calibrated camera mounted on a trolley.


01 Jan 2003
TL;DR: The book first illustrates the breadth of application of reconstruction processes in vision with results that the authors' theory and program yield for a variety of problems, and the mathematics of weak continuity and the graduated nonconvexity algorithm are developed carefully and progressively.
Abstract: Visual Reconstruction presents a unified and highly original approach to the treatment of continuity in vision. It introduces, analyzes, and illustrates two new concepts. The first -- the weak continuity constraint -- is a concise, computational formalization of piecewise continuity. It is a mechanism for expressing the expectation that visual quantities such as intensity, surface color, and surface depth vary continuously almost everywhere, but with occasional abrupt changes. The second concept -- the graduated nonconvexity algorithm -- arises naturally from the first. It is an efficient, deterministic (nonrandom) algorithm for fitting piecewise continuous functions to visual data.The book first illustrates the breadth of application of reconstruction processes in vision with results that the authors' theory and program yield for a variety of problems. The mathematics of weak continuity and the graduated nonconvexity (GNC) algorithm are then developed carefully and progressively.Contents: Modeling Piecewise Continuity. Applications of Piecewise Continuous Reconstruction. Introducing Weak Continuity Constraints. Properties of the Weak String and Membrane. Properties of Weak Rod and Plate. The Discrete Problem. The Graduated Nonconvexity (GNC) Algorithm. Appendixes: Energy Calculations for the String and Membrane. Noise Performance of the Weak Elastic String. Energy Calculations for the Rod and Plate. Establishing Convexity. Analysis of the GNC Algorithm.Visual Reconstruction is included in the Artificial Intelligence series, edited by Michael Brady and Patrick Winston.