scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Matching Widely Separated Views Based on Affine Invariant Regions

01 Aug 2004-International Journal of Computer Vision (Kluwer Academic Publishers)-Vol. 59, Iss: 1, pp 61-85
TL;DR: To increase the robustness of the system, two semi-local constraints on combinations of region correspondences are derived (one geometric, the other photometric) allow to test the consistency of correspondences and hence to reject falsely matched regions.
Abstract: ‘Invariant regions’ are self-adaptive image patches that automatically deform with changing viewpoint as to keep on covering identical physical parts of a scene. Such regions can be extracted directly from a single image. They are then described by a set of invariant features, which makes it relatively easy to match them between views, even under wide baseline conditions. In this contribution, two methods to extract invariant regions are presented. The first one starts from corners and uses the nearby edges, while the second one is purely intensity-based. As a matter of fact, the goal is to build an opportunistic system that exploits several types of invariant regions as it sees fit. This yields more correspondences and a system that can deal with a wider range of images. To increase the robustness of the system, two semi-local constraints on combinations of region correspondences are derived (one geometric, the other photometric). They allow to test the consistency of correspondences and hence to reject falsely matched regions. Experiments on images of real-world scenes taken from substantially different viewpoints demonstrate the feasibility of the approach.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

7,057 citations


Cites background or methods from "Matching Widely Separated Views Bas..."

  • ...by Baumberg [2] as well as Schaffalitzky and Zisserman [37]. Tuytelaars and Van Gool [ 42 ]...

    [...]

  • ...in applications such as wide baseline matching [37, 42 ], object recognition [10, 25], texture...

    [...]

Proceedings ArticleDOI
17 Jun 2006
TL;DR: A recognition scheme that scales efficiently to a large number of objects and allows a larger and more discriminatory vocabulary to be used efficiently is presented, which it is shown experimentally leads to a dramatic improvement in retrieval quality.
Abstract: A recognition scheme that scales efficiently to a large number of objects is presented. The efficiency and quality is exhibited in a live demonstration that recognizes CD-covers from a database of 40000 images of popular music CD’s. The scheme builds upon popular techniques of indexing descriptors extracted from local regions, and is robust to background clutter and occlusion. The local region descriptors are hierarchically quantized in a vocabulary tree. The vocabulary tree allows a larger and more discriminatory vocabulary to be used efficiently, which we show experimentally leads to a dramatic improvement in retrieval quality. The most significant property of the scheme is that the tree directly defines the quantization. The quantization and the indexing are therefore fully integrated, essentially being one and the same. The recognition quality is evaluated through retrieval on a database with ground truth, showing the power of the vocabulary tree approach, going as high as 1 million images.

4,024 citations


Cites background from "Matching Widely Separated Views Bas..."

  • ...The recognition quality is evaluated through retrieval on a database with ground truth, showing the power of the vocabulary tree approach, going as high as 1 million images....

    [...]

Proceedings ArticleDOI
18 Jun 2003
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper we compare the performance of interest point descriptors. Many different descriptors have been proposed in the literature. However, it is unclear which descriptors are more appropriate and how their performance depends on the interest point detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the point detector. Our evaluation uses as criterion detection rate with respect to false positive rate and is carried out for different image transformations. We compare SIFT descriptors (Lowe, 1999), steerable filters (Freeman and Adelson, 1991), differential invariants (Koenderink ad van Doorn, 1987), complex filters (Schaffalitzky and Zisserman, 2002), moment invariants (Van Gool et al., 1996) and cross-correlation for different types of interest points. In this evaluation, we observe that the ranking of the descriptors does not depend on the point detector and that SIFT descriptors perform best. Steerable filters come second ; they can be considered a good choice given the low dimensionality.

3,362 citations


Cites background from "Matching Widely Separated Views Bas..."

  • ...Local photometric descriptors computed for interest regions have proved to be very successful in applications such as wide baseline matching [37, 42], object recognition [10, 25], texture...

    [...]

  • ...Tuytelaars and Van Gool [42] construct two types of affine-invariant regions, one based on a combination of interest points and edges and the other one based on image intensities....

    [...]

Journal ArticleDOI
TL;DR: A snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions to establish a reference test set of images and performance software so that future detectors can be evaluated in the same framework.
Abstract: The paper gives a snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions. Six types of detectors are included: detectors based on affine normalization around Harris (Mikolajczyk and Schmid, 2002; Schaffalitzky and Zisserman, 2002) and Hessian points (Mikolajczyk and Schmid, 2002), a detector of `maximally stable extremal regions', proposed by Matas et al. (2002); an edge-based region detector (Tuytelaars and Van Gool, 1999) and a detector based on intensity extrema (Tuytelaars and Van Gool, 2000), and a detector of `salient regions', proposed by Kadir, Zisserman and Brady (2004). The performance is measured against changes in viewpoint, scale, illumination, defocus and image compression. The objective of this paper is also to establish a reference test set of images and performance software, so that future detectors can be evaluated in the same framework.

3,359 citations


Cites methods from "Matching Widely Separated Views Bas..."

  • ...More details about this method can be found in [47, 48]....

    [...]

  • ...A more detailed explanation of this method can be found in [45, 48]....

    [...]

  • ...The detectors are: (i) the ‘Harris-Affine’ detector [24, 27, 34]; (ii) the ‘Hessian-Affine’ detector [24, 27]; (iii) the ‘maximally stable extremal region’ detector (or MSER, for short) [21, 22]; (iv) an edge-based region detector [45, 48] (referred to as EBR); (v) an intensity extrema-based region detector [47, 48] (referred to as IBR); and (vi) an entropy-based region detector [12] (referred to as salient regions)....

    [...]

Journal ArticleDOI
TL;DR: This paper presents structure-from-motion and image-based rendering algorithms that operate on hundreds of images downloaded as a result of keyword-based image search queries like “Notre Dame” or “Trevi Fountain,” and presents these algorithms and results as a first step towards 3D modeled sites, cities, and landscapes from Internet imagery.
Abstract: There are billions of photographs on the Internet, comprising the largest and most diverse photo collection ever assembled. How can computer vision researchers exploit this imagery? This paper explores this question from the standpoint of 3D scene modeling and visualization. We present structure-from-motion and image-based rendering algorithms that operate on hundreds of images downloaded as a result of keyword-based image search queries like "Notre Dame" or "Trevi Fountain." This approach, which we call Photo Tourism, has enabled reconstructions of numerous well-known world sites. This paper presents these algorithms and results as a first step towards 3D modeling of the world's well-photographed sites, cities, and landscapes from Internet imagery, and discusses key open problems and challenges for the research community.

2,207 citations


Cites background from "Matching Widely Separated Views Bas..."

  • ...Furthermore, Internet imagery provides an ideal test bed for developing robust and general computer vision algorithms that can work effectively “in the wild.”...

    [...]

References
More filters
Journal ArticleDOI
TL;DR: There is a natural uncertainty principle between detection and localization performance, which are the two main goals, and with this principle a single operator shape is derived which is optimal at any scale.
Abstract: This paper describes a computational approach to edge detection. The success of the approach depends on the definition of a comprehensive set of goals for the computation of edge points. These goals must be precise enough to delimit the desired behavior of the detector while making minimal assumptions about the form of the solution. We define detection and localization criteria for a class of edges, and present mathematical forms for these criteria as functionals on the operator impulse response. A third criterion is then added to ensure that the detector has only one response to a single edge. We use the criteria in numerical optimization to derive detectors for several common image features, including step edges. On specializing the analysis to step edges, we find that there is a natural uncertainty principle between detection and localization performance, which are the two main goals. With this principle we derive a single operator shape which is optimal at any scale. The optimal detector has a simple approximate implementation in which edges are marked at maxima in gradient magnitude of a Gaussian-smoothed image. We extend this simple detector using operators of several widths to cope with different signal-to-noise ratios in the image. We present a general method, called feature synthesis, for the fine-to-coarse integration of information from operators at different scales. Finally we show that step edge detector performance improves considerably as the operator point spread function is extended along the edge.

28,073 citations


"Matching Widely Separated Views Bas..." refers methods in this paper

  • ...The first method for affine invariant region extraction starts from Harris corner points (Harris and Stephens, 1983) and the edges that can often be found close to such a point (extracted using the Canny edge detector (Canny, 1986))....

    [...]

Journal ArticleDOI
TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.
Abstract: A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing

23,396 citations


"Matching Widely Separated Views Bas..." refers methods in this paper

  • ...…rejected most false matches among the region correspondences using the geometric and photometric constraints described above, we apply RANSAC (Fischler and Bolles, 1981) (a robust method based on random sampling) to find a consistent epipolar geometry and to reject the remaining false…...

    [...]

  • ...After having rejected most false matches among the region correspondences using the geometric and photometric constraints described above, we apply RANSAC (Fischler and Bolles, 1981) (a robust method based on random sampling) to find a consistent epipolar geometry and to reject the remaining false correspondences....

    [...]

  • ...The best known constraint is checking for a consistent epipolar geometry in a robust way, e.g. using RANSAC (Fischler and Bolles, 1981), and rejecting all correspondences not conform with the epipolar geometry found....

    [...]

  • ...using RANSAC (Fischler and Bolles, 1981), and rejecting all correspondences not conform with the epipolar geometry found....

    [...]

  • ...second image (Gruen, 1985; Super and Klarquist, 1997). However, the search that is involved reduces the practicality of this approach. In contrast, our method is based on the extraction and matching of invariant regions, and hence works on the two images separately, without searching over the entire image or applying combinatorics. This is akin to the approach of Pritchett and Zisserman (1998) who start their wide baseline stereo algorithm by extracting quadrangles present in the image and match these based on normalized crosscorrelation to find local homographies, which are then exploited in a search for additional correspondences....

    [...]

Proceedings ArticleDOI
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

16,989 citations


"Matching Widely Separated Views Bas..." refers background or methods in this paper

  • ...In summary, our system differs from other wide baseline stereo methods in that we do not apply a search between images but process each image and each local feature individually (Gruen, 1985; Super and Klarquist, 1997; Schaffalitzky and Zisserman, 2001), in that we fully take into account the affine deformations caused by the change in viewpoint (Lowe, 1999; Montesinos et al., 2000; Schmid and Mohr, 1997; Dufournaud et al., 2000) and in that we can deal with general 3D objects without assuming specific structures to be present in the image (Pritchett and Zisserman, 1998; Tell and Carlsson, 2000)....

    [...]

  • ...For instance, Lowe (1999) uses extrema of a difference of Gaussians filter....

    [...]

  • ...The consistency of the matches found is tested using semi-local constraints, followed by a test on the epipolar geometry using RANSAC. As shown in the experimental results, the feasibility of affine invariance even on a local scale has been demonstrated. Robust matching is quite a generic problem in vision and several other applications can be considered. Object recognition is one, where images of an object can be matched against a small set of reference images of the same object. The sample set can be kept small because of the invariance. Moreover, as the features are local, recognition against variable backgrounds and under occlusion is supported by this method. Another application is grouping, where symmetries can be found as repeated structures. Image database retrieval can also benefit from these regions, where other pictures of the same scene or object can be found. Here, the viewpoint and illumination invariance gives the system the capacity to generalize to a great extent from a single query image. Finally, being able to match a current view against learned views can allow robots to roam extended spaces, without the need for a 3D model. Initial results for such applications can be found in Tuytelaars and Van Gool (1999), Tuytelaars et al....

    [...]

  • ...…Klarquist, 1997; Schaffalitzky and Zisserman, 2001), in that we fully take into account the affine deformations caused by the change in viewpoint (Lowe, 1999; Montesinos et al., 2000; Schmid and Mohr, 1997; Dufournaud et al., 2000) and in that we can deal with general 3D objects without assuming…...

    [...]

  • ...Lowe (1999) has extended these ideas to real scale-invariance, using circular regions that maximize the output of a difference of gaussian filters in scale space, while Hall et al. (1999) not only applied automatic scale selection (based on Lindeberg (1998)), but also retrieved the orientation of…...

    [...]

Proceedings ArticleDOI
01 Jan 1988
TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

13,993 citations


"Matching Widely Separated Views Bas..." refers methods in this paper

  • ...Harris corner points (Harris and Stephens, 1983) are good candidates....

    [...]

  • ...Harris corner points (Harris and Stephens, 1983) are good candidates. Apart from the necessary properties of good anchor points mentioned above, they typically contain a large amount of information (Schmid and Mohr, 1998), resulting in a high distinctive power, and they are well localized, i.e. the position of the corner point is accurately defined (even up to sub-pixel accuracy) (Shi and Tomasi, 1994). Instead of using corners, local extrema of image intensity can serve as anchor points as well. To this end, we first apply some smoothing to the image to reduce the effect of noise, causing too many unstable local extrema. Then, the local extrema are extracted with a non-maximum suppression algorithm. These points cannot be localized as accurately as corner points, since the local extrema in intensity are often rather smooth. However, they can withstand any monotonic intensity transformation and they are less likely to lie close to the border of an object resulting in a non-planar region. This last property is a major drawback when working with corner points. Of course, which kind of anchor points perform best also depends on the method used for the region extraction, and how good this method deals with the shortcomings of the anchor points. For instance, for the corner points, the high chance of a non-planar region can be alleviated by constructing a region that is not centered around the corner point. Similarly, regions starting from local intensity extrema should not depend too much on the exact position of the extremum, to overcome the inaccurate localization of these points. Other types of anchor points could be used as well. For instance, Lowe (1999) uses extrema of a difference of Gaussians filter....

    [...]

  • ...The first method for affine invariant region extraction starts from Harris corner points (Harris and Stephens, 1983) and the edges that can often be found close to such a point (extracted using the Canny edge detector (Canny, 1986))....

    [...]

Proceedings ArticleDOI
21 Jun 1994
TL;DR: A feature selection criterion that is optimal by construction because it is based on how the tracker works, and a feature monitoring method that can detect occlusions, disocclusions, and features that do not correspond to points in the world are proposed.
Abstract: No feature-based vision system can work unless good features can be identified and tracked from frame to frame. Although tracking itself is by and large a solved problem, selecting features that can be tracked well and correspond to physical points in the world is still hard. We propose a feature selection criterion that is optimal by construction because it is based on how the tracker works, and a feature monitoring method that can detect occlusions, disocclusions, and features that do not correspond to points in the world. These methods are based on a new tracking algorithm that extends previous Newton-Raphson style search methods to work under affine image transformations. We test performance with several simulations and experiments. >

8,432 citations


"Matching Widely Separated Views Bas..." refers background in this paper

  • ...the position of the corner point is accurately defined (even up to sub-pixel accuracy) (Shi and Tomasi, 1994)....

    [...]

  • ...…good anchor points mentioned above, they typically contain a large amount of information (Schmid and Mohr, 1998), resulting in a high distinctive power, and they are well localized, i.e. the position of the corner point is accurately defined (even up to sub-pixel accuracy) (Shi and Tomasi, 1994)....

    [...]