scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Comparison of Affine Region Detectors

TL;DR: A snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions to establish a reference test set of images and performance software so that future detectors can be evaluated in the same framework.
Abstract: The paper gives a snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions. Six types of detectors are included: detectors based on affine normalization around Harris (Mikolajczyk and Schmid, 2002; Schaffalitzky and Zisserman, 2002) and Hessian points (Mikolajczyk and Schmid, 2002), a detector of `maximally stable extremal regions', proposed by Matas et al. (2002); an edge-based region detector (Tuytelaars and Van Gool, 1999) and a detector based on intensity extrema (Tuytelaars and Van Gool, 2000), and a detector of `salient regions', proposed by Kadir, Zisserman and Brady (2004). The performance is measured against changes in viewpoint, scale, illumination, defocus and image compression. The objective of this paper is also to establish a reference test set of images and performance software, so that future detectors can be evaluated in the same framework.

Summary (5 min read)

1 Introduction

  • Detecting regions covariant with a class of transformations has now reached some maturity in the computer vision literature.
  • In particular, consider images from two viewpoints and the geometric transformation between the images induced by the viewpoint change.
  • The confusion probably arises from the fact that, even though the regions themselves are covariant, the normalized image pattern they cover and the feature descriptors derived from them are typically invariant.

2 Affine covariant detectors

  • In this section the authors give a brief description of the six region detectors used in the comparison.
  • Sections 2.2 and 2.3 describe methods for detecting edge-based regions and intensity extrema-based regions.
  • The idea is to select the characteristic scale of a local structure, for which a given function attains an extremum over scales .
  • The eigenvalues of the second moment matrix are used to measure the affine shape of the point neighbourhood.
  • The authors can therefore use this technique to estimate the shape of initial regions provided by the Harris and Hessian based detector.

2.2 An edge-based region detector

  • The rationale behind this is that edges are typically rather stable features, that can be detected over a range of viewpoints, scales and/or illumination changes.
  • Since intersections of two straight edges occur quite often, the authors cannot simply neglect this case.
  • To circumvent this problem, the two photometric quantities given in Equation 4 are combined and locations where both functions reach a minimum value are taken to fix the parameters s1 and s2 along the straight edges.
  • Moreover, instead of relying on the correct detection of the Harris corner point, the authors can simply use the straight lines intersection point instead.
  • For easy comparison in the context of this paper, the parallelograms representing the invariant regions are replaced by the enclosed ellipses, as shown in figure 4(b).

2.3 Intensity extrema-based region detector

  • Here the authors describe a method to detect affine covariant regions that starts from intensity extrema (detected at multiple scales), and explores the image around them in a radial way, delineating regions of arbitrary shape, which are then replaced by ellipses.
  • The point for which this function reaches an extremum is invariant under affine geometric and linear photometric transformations (given the ray).
  • The function fI(t) is in itself already invariant.
  • This ellipse-fitting is again an affine covariant construction.
  • Examples of detected regions are displayed in figure 4(a).

2.4 Maximally Stable Extremal region detector

  • The word ‘extremal’ refers to the property that all pixels inside the MSER have either higher (bright extremal regions) or lower (dark extremal regions) intensity than all the pixels on its outer boundary.
  • The ‘maximally stable’ in MSER describes the property optimized in the threshold selection process.
  • This ensures that common photometric changes modelled locally as linear or affine leave E unaffected, even if the camera is non-linear (gamma-corrected).
  • After sorting, pixels are marked in the image (either in decreasing or increasing order) and the list of growing and merging connected components and their areas is maintained using the union-find algorithm [38].
  • Among the extremal regions, the ‘maximally stable’ ones are those corresponding to thresholds were the relative area change as a function of relative change of threshold is at a local minimum.

2.5 Salient region detector

  • This detector is based on the pdf of intensity values computed over an elliptical region.
  • Detection proceeds in two steps: first, at each pixel the entropy of the pdf is evaluated over the three parameter family of ellipses centred on that pixel.
  • The set of entropy extrema over scale and the corresponding ellipse parameters are recorded.
  • Second, the candidate salient regions over the entire image are ranked using the magnitude of the derivative of the pdf with respect to scale.
  • More details about this method can be found in [12].

3 The image data set

  • Figure 9 shows examples from the image sets used to evaluate the detectors.
  • In the cases of viewpoint change, scale change and blur, the same change in imaging conditions is applied to two different scene types.
  • This means that the effect of changing the image conditions can be separated from the effect of changing the scene type.
  • The JPEG sequence is generated using a standard xv image browser with the image quality parameter varying from 40% to 2%.
  • The composition of these two homographies (approximate and residual) gives an accurate homography between the reference and other image.

3.1 Discussion

  • Before the authors compare the performance of the different detectors in more detail in the next section, a few more general observations can already be made, simply by examining the output of the different detectors for the images shown in figures 3 and 4.
  • For the intensity extrema-based region detector, the algorithm finding intensity extrema is O(n), where n is again the number of pixels.
  • The computation times mentioned in this table have all been measured on a Pentium 4 2GHz Linux PC, for the leftmost image shown in figure 9(a), which is 800× 640 pixels.
  • Also, as will be shown in the next section , large regions automatically have better chances of overlapping other regions.
  • Here, the authors focus on the original distinguished regions (except for the ellipse fitting for edge-based and MSER regions, to obtain the same shape for all detectors), as they determine the intrinsic quality of a detector.

4 Overlap comparison using homographies

  • Two important parameters characterize the performance of a region detector: 1. the repeatability, i.e., the average number of corresponding regions detected in images under different geometric and photometric transformations, both in absolute and relative terms (i.e., percentage-wise), and 2. the accuracy of localization and region estimation.
  • Clearly as the scaling goes to zero there is no intersection of the cones, and as the scaling goes to infinity the relative amount of overlap, defined as the ratio of the intersection to the union of the ellipses approaches unity.
  • Moreover, it is not straightforward for all detectors to come up with a single parameter that can be varied to obtain the desired number of regions in a meaningful way, i.e., representing some kind of ‘quality measure’ for the regions.
  • To give an idea of the number of regions, both absolute and relative repeatability scores are given.

4.1 Repeatability measure

  • Two regions are deemed to correspond if the overlap error, defined as the error in the image area covered by the regions, is sufficiently small: 1− Rµa ∩R(HT µbH) (Rµa ∪RHT µbH) < O where Rµ represents the elliptic region defined by xT µx = 1.
  • The union of the regions is Rµa∪R(HT µbH), and Rµa∩R(HT µbH) is their intersection.
  • Then, the authors apply this scale factor to both the region in the reference image and the region detected in the other image which has been mapped onto the reference image, before computing the actual overlap error as described above.
  • The precise procedure is given in the Matlab code on http://www.robots.ox.ac.uk/~vgg/research/affine.
  • Note that an overlap error of 20% is very small as it corresponds to only 10% difference between the regions’ radius.

4.2 Repeatability under various transformations

  • In a first set of experiments, the authors fix the overlap error threshold to 40% and the normalized region size to a radius of 30 pixels, and check the repeatability of the different region detectors for gradually increasing transformations, according to the image sets shown in figure 9.
  • This can be understood by the fact that in most cases larger transformations result in lower quality images and/or smaller commonly visible parts between the reference image and the other image, and hence a smaller number of regions are detected.
  • Figure 13(a) shows the repeatability score and figure 13(b) the absolute number of correspondences.
  • The Hessian-Affine detector performs best, followed by MSER and Harris-Affine detectors.
  • The number of corresponding regions detected on structured scene is much lower than for the textured scene and it changes by a different factor for different detectors.

4.3 More detailed tests

  • To further validate their experimental setup and to obtain a deeper insight in what is actually going on, a more detailed analysis is performed on one image pair with a viewpoint change of 40 degrees, namely the first and third column of the graffiti sequence shown in figure 9(a).
  • Choosing a lower threshold results in more accurate regions, .
  • Figure 21(b) shows how the repeatability scores vary as a function of the normalized region size, with the overlap error threshold fixed to 40%.
  • This results in a plot showing the repeatability scores for different detectors as a function of region size.
  • The results for Hessian-Affine, Harris-Affine and IBR are similar.

5 Matching experiments

  • In the previous section, the performance of the different region detectors is evaluated from a rather theoretical point of view, focusing on the overlap error and repeatability.
  • To this end, the authors compute a descriptor for the regions, and then check to what extent matching with the descriptor gives the correct region match.
  • This descriptor gave the best matching results in an evaluation of different descriptors computed on scale and affine invariant regions [25, 28].
  • To this end, each elliptical region is first mapped to a circular region of 30× 30 pixels, and rotated based on the dominant gradient orientation, to compensate for the affine geometric deformations, as shown in figure 2(e).
  • Note that unlike in section 4, this mapping concerns descriptors; the region size is coincidentally the same (30 pixels).

5.1 Matching score

  • Again the measure is computed between a reference image and the other images in a set.
  • The matching score is computed in two steps.
  • Only a single match is allowed for each region.
  • If the matching results do not follow those of the repeatability test for a particular feature type that means that the distinctiveness of these features differs from the distinctiveness of other detectors.
  • Indeed, rather than taking the original distinguished region, one might also rescale the region first, which typically leads to more discriminative power – certainly for the small regions.

5.2 Matching under various transformations

  • Figures 13 - 20 (c) and (d) give the results of the matching experiment for the different types of transformations.
  • These are basically the same plots as given in figures 13 - 20 (a) and (b) but now focusing on regions that have actually been matched, rather than just corresponding regions.
  • These detectors find several slightly different regions containing the same local structure all of which have a small overlap error.
  • The same change in ranking for Harris-Affine and Hessian-Affine can be observed on the results for other transformations.

6 Conclusions

  • In this paper the authors have presented the state of the art on affine covariant region detectors and have compared their performance.
  • This also holds for IBR since both methods are designed for similar region types.
  • Hessian-Affine and Harris-Affine provide more regions than the other detectors, which is useful in matching scenes with occlusion and clutter.
  • Several detectors should be used simultaneously to obtain the best performance.
  • Naturally, regions are also detected at depth and surface orientation discontinuities of 3D scenes.

Did you find this useful? Give us your feedback

Figures (23)

Content maybe subject to copyright    Report

A Comparison of Affine Region Detectors
K. Mikolajczyk
1
, T. Tuytelaars
2
, C. Schmid
4
, A. Zisserman
1
,
J. Matas
3
, F. Schaffalitzky
1
, T. Kadir
1
, L. Van Gool
2
1
University of Oxford, OX1 3PJ Oxford, United Kingdom
2
University of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
3
Czech Technical University, Karlovo Namesti 13, 121 35, Prague, Czech Republic
4
INRIA, GRAVIR-CNRS, 655, av. de l’Europe, 38330 Montbonnot, France
km@robots.ox.ac.uk, tuytelaa@esat.kuleuven.ac.be, schmid@inrialpes.fr,
az@robots.ox.ac.uk, matas@cmp.felk.cvut.cz, fsm@robots.ox.ac.uk,
tk@robots.ox.ac.uk, vangool@esat.kuleuven.ac.be
Abstract
The pap er gives a snapshot of the state of the art in affine covariant region detectors, and
compares their p erformance on a set of test images under varying imaging conditions. Six
types of detectors are included: detectors based on affine normalization around Harris [24,
34] and Hessian points [24], as proposed by Mikolajczyk and Schmid and by Schaffalitzky
and Zisserman; a detector of ‘maximally stable extremal regions’, proposed by Matas et
al. [21]; an edge-based region detector [45] and a detector based on intensity extrema [47],
prop os ed by Tuytelaars and Van Gool; and a detector of ‘salient regions’, proposed by Kadir,
Zisserman and Brady [12]. The performance is measured against changes in viewpoint, scale,
illumination, defocus and image compression.
The objective of this paper is also to establish a reference test set of images and perfor-
mance software, so that future detectors can be evaluated in the same framework.
1 Introduction
Detecting regions covariant with a class of transformations has now reached some maturity in the
computer vision literature. These regions have been used in quite varied applications including:
wide baseline matching for stereo pairs [1, 21, 31, 47], reconstructing cameras for sets of disparate
views [34], image retrieval from large databases [36, 45], model based recognition [7, 18, 29, 32],
object retrieval in video [39, 40], visual data m ining [41], texture recognition [13, 14], shot lo-
cation [35], robot localization [37] and servoing [46], building panoramas [2], symmetry detec-
tion [44], and object categorization [4, 5, 6, 30].
The requirement for these regions is that they should correspond to the same pre-image for
different viewpoints, i.e., their shape is not fixed but automatically adapts, based on the un-
derlying image intensities, so that they are the projection of the same 3D surface patch. In
particular, consider images from two viewpoints and the geometric transformation between the
images induced by the viewpoint change. Regions detected after the viewpoint change should
be the same, modulo noise, as the transformed versions of the regions detected in the original
image image transformation and region detection commute. As such, even though they have
often been called invariant regions in the literature (e.g., [5, 13, 41, 45]), in principle they should
be termed covariant regions since they change covariantly with the transformation. The confu-
sion probably arises from the fact that, even though the regions themselves are covariant, the
1

(a) (b) (c)
(d) (e) (f)
Figure 1: Class of transformations needed to cope with viewpoint changes. (a) First
viewpoint; (b,c) second viewpoint. Fixed size circular patches (a,b) clearly do not suffice to deal
with general viewpoint changes. What is needed is an anisotropic rescaling, i.e., an affinity (c).
Bottom row shows close-up of the images with surface corresponding patches.
normalized image pattern they cover and the feature descriptors derived from them are typically
invariant.
Note, our use of the term ‘region’ simply refers to a set of pixels, i.e. any subset of the image.
This differs from classical segmentation since the region boundaries do not have to correspond to
changes in image appearance such as colour or te xture. All the detectors presented here produce
simply connected regions, but in general this need not be the case.
For viewpoint changes, the transformation of most interest is an affinity. This is illustrated in
figure 1. Clearly, a region with fixed shape (a circular example is shown in figure 1(a) and (b))
cannot cope with the geometric deformations caused by the change in viewpoint. We can observe
that the circle does not cover the same image content, i.e., the same physical surface. Instead, the
shape of the region has to be adaptive, or covariant with respect to affinities (figure 1(c) close-
ups shown in figure 1(d)–(f)). Indeed, an affinity is sufficient to locally model image distortions
arising from viewpoint changes, provided that (1) the scene surface c an be locally approximated
by a plane or in case of a rotating camera, and (2) perspective effects are ignored, which are
typically small on a local scale anyway. Aside from the geometric deformations, also photometric
deformations need to be taken into account. These can be modeled by a linear transformation
of the intensities.
To further illustrate these issues, and how affine covariant regions can be exploited to cope
with the geometric and photometric deformation between wide baseline images, consider the
example shown in figure 2. Unlike the example of figure 1 (where a circular region was chosen for
one viewpoint) the elliptical image regions here are detected independently in each viewpoint. As
is evident, the pre-image of these affine covariant regions correspond to the same surface region.
Given such an affine covariant region, it is then p os sible to normalize against the geometric and
photometric deformations (shown in figure 2(d)(e)) and to obtain a viewpoint and illumination
2

(a) (b) (c) (d) (e)
Figure 2: Affine covariant regions offer a solution to viewpoint and illumination
changes. First row: one viewpoint; second row: other viewpoint. (a) Original images, (b)
detected affine covariant regions, (c) close-up of the detected regions. (d) Geometric normal-
ization to circles. The regions are the same up to rotation. (e) Photometric and geometric
normalization. The slight residual difference in rotation is due to an estimation error.
invariant description of the intensity pattern within the region.
In a typical matching application, the regions are used as follows. First, a set of covariant
regions is detected in an image. Often a large number, perhaps hundreds or thousands, of
possibly overlapping regions are obtained. A vector descriptor is then associated with each
region, computed from the intensity pattern within the region. This descriptor is chosen to be
invariant to viewpoint changes and, to some extent, illumination changes, and to discriminate
between the regions. Correspondences may then be established with another image of the same
scene, by first detecting and representing regions (independently) in the new image; and then
matching the regions based on their descriptors. By design the regions commute with viewpoint
change, so by design, corresponding regions in the two images will have s imilar (ideally identical)
vector descriptors. The benefits are that correspondences can then be easily established and,
since there are multiple regions, the method is robust to partial occlusions.
This paper gives a snapshot of the state of the art in affine covariant region detection. We will
describe and compare six methods of detecting these regions on images. These detectors have been
designed and implemented by a number of researchers and the comparison is carried out using
binaries supplied by the authors. The detectors are: (i) the ‘Harris-Affine’ dete ctor [24, 27, 34];
(ii) the ‘Hessian-Affine’ detector [24, 27]; (iii) the ‘maximally stable extremal region’ detector (or
MSER, for short) [21, 22]; (iv) an edge-based region detector [45, 48] (referred to as EBR); (v) an
intensity extrema-based region detector [47, 48] (referred to as IBR); and (vi) an entropy-based
region detector [12] (referred to as salient regions).
To limit the scope of the paper we have not included methods for detecting regions which
are covariant only to similarity transformations (i.e., in particular sc ale), such as [18, 19, 23,
26], or other methods of computing affine invariant descriptors, such as image lines connecting
interest points [20, 42, 43] or invariant vertical line segments [9]. Also the detectors proposed
by Lindeberg [16] and Baumberg [1] have not been included, as they come very close to the
Harris-Affine and Hessian-Affine detectors.
The six detectors are described in section 2. They are compared on the data set shown
in figure 9. This data set includes structured and textured scenes as well as different types
of transformations: viewpoint changes, scale changes, illumination changes, blur and JPEG
3

compression. It is described in more detail in section 3. Two types of comparisons are carried
out. First, in section 4, the repeatability of the detector is measured: how well does the detector
determine corresponding sc ene regions? This is measured by comparing the overlap between
the ground truth and detected regions, in a manner similar to the evaluation test used in [24],
but with special attention paid to the effect of the different scales (region sizes) of the various
detectors’ output. Here, we also measure the accuracy of the regions shape, scale and localization.
Second, the distinctiveness of the detected regions is assessed: how distinguishable are the regions
detected? Following [25, 28], we use the SIFT descriptor developed by Lowe [18], which is an 128-
dimensional vector, to describe the intensity pattern within the image regions. This descriptor
has been demonstrated to be superior to others used in literature on a number of measures [25].
Our intention is that the image s and tests described here will be a benchmark against which
future affine covariant region detectors can be assessed. The images, Matlab code to carry out
the performance tests, and binaries of the detectors are available from
http://www.robots.ox.ac.uk/
˜
vgg/research/affine.
2 Affine covariant detectors
In this section we give a brief description of the six region detectors used in the comparison.
Section 2.1 describes the related methods Harris-Affine and Hessian-Affine. Sections 2.2 and 2.3
describe methods for detecting edge-based regions and intensity extrema-based regions. Finally,
sections 2.4 and 2.5 describe MSER and salient regions.
For the purpose of the comparisons the output region of all detector types are represented by
a common shape, which is an ellipse. Figures 3 and 4 show the ellipses for all detectors on one
pair of images. In order not to overload the images, only some of the corresponding regions that
were actually detected in both images have been shown. This selection is obtained by increasing
the threshold.
In fact, for most of the detectors the output shape is an ellipse. Howeve r, for two of the
detectors (edge-based regions and MSER) it is not, and information is lost by this representation,
as ellipses can only be matched up to a rotational degree of freedom. Examples of the original
regions detected by these two methods are given in figure 5. These are parallelogram-shaped
regions for the edge-based region detector, and arbitrarily shap ed regions for the MSER detector.
In the following the representing ellipse is chosen to have the same first and second moments as
the originally detected region, which is an affine covariant construction method.
2.1 Detectors based on affine normalization Harris-Affine & Hessian-
Affine
We describe here two related methods which detect interest points in scale-space, and then
determine an elliptical region for each point. Interest points are either detected with the Harris
detector or with a detector based on the Hessian matrix. In both cases scale-selection is based
on the Laplacian, and the shape of the elliptical region is determined with the second moment
matrix of the intensity gradient [1, 16].
The second moment matrix, also called the auto-correlation matrix, is often used for feature
detection or for describing local image structures. Here it is used both in the Harris detector
and the elliptical shape estimation. This matrix describes the gradient distribution in a local
neighbourhood of a point:
M = µ(x, σ
I
, σ
D
) =
µ
11
µ
12
µ
21
µ
22
= σ
2
D
g(σ
I
)
I
2
x
(x, σ
D
) I
x
I
y
(x, σ
D
)
I
x
I
y
(x, σ
D
) I
2
y
(x, σ
D
)
(1)
The local image derivatives are computed with Gaussian kernels of scale σ
D
(differentiation
scale). The derivatives are then averaged in the neighbourhood of the point by smoothing with
a Gaussian window of scale σ
I
(integration scale). The eigenvalues of this matrix represent two
4

(a) Harris-Affine
(b) Hessian-Affine
(c) MSER
Figure 3: Regions generated by different detectors on corresponding sub-parts of the
first and third graffiti images of figure 9(a). The ellipses show the original detection size.
5

Citations
More filters
Book ChapterDOI
07 May 2006
TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Abstract: In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF's strong performance.

13,011 citations


Cites background or methods from "A Comparison of Affine Region Detec..."

  • ...For the detectors, we use the repeatability score, as described in [9]....

    [...]

  • ...Also, detailed comparisons and evaluations on benchmarking datasets have been performed [7, 8, 9]....

    [...]

Journal ArticleDOI
TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.

12,449 citations


Cites background from "A Comparison of Affine Region Detec..."

  • ...[31]), although this will have an impact on the computation time....

    [...]

  • ...Also, detailed comparisons and evaluations on benchmarking datasets have been performed [28,30,31]....

    [...]

Journal ArticleDOI
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

7,057 citations

Book
30 Sep 2010
TL;DR: Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images and takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene.
Abstract: Humans perceive the three-dimensional structure of the world with apparent ease. However, despite all of the recent advances in computer vision research, the dream of having a computer interpret an image at the same level as a two-year old remains elusive. Why is computer vision such a challenging problem and what is the current state of the art? Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images. It also describes challenging real-world applications where vision is being successfully used, both for specialized applications such as medical imaging, and for fun, consumer-level tasks such as image editing and stitching, which students can apply to their own personal photos and videos. More than just a source of recipes, this exceptionally authoritative and comprehensive textbook/reference also takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene. These problems are also analyzed using statistical models and solved using rigorous engineering techniques Topics and features: structured to support active curricula and project-oriented courses, with tips in the Introduction for using the book in a variety of customized courses; presents exercises at the end of each chapter with a heavy emphasis on testing algorithms and containing numerous suggestions for small mid-term projects; provides additional material and more detailed mathematical topics in the Appendices, which cover linear algebra, numerical techniques, and Bayesian estimation theory; suggests additional reading at the end of each chapter, including the latest research in each sub-field, in addition to a full Bibliography at the end of the book; supplies supplementary course material for students at the associated website, http://szeliski.org/Book/. Suitable for an upper-level undergraduate or graduate-level course in computer science or engineering, this textbook focuses on basic techniques that work under real-world conditions and encourages students to push their creative boundaries. Its design and exposition also make it eminently suitable as a unique reference to the fundamental techniques and current research literature in computer vision.

4,146 citations


Cites methods from "A Comparison of Affine Region Detec..."

  • ...14: Affine normalization using the second moment matrices, as described in (Mikolajczyk et al. 2005)....

    [...]

  • ...In the area of feature detectors (Mikolajczyk et al. 2005), in addition to such classic approaches as Förstner-Harris (Förstner 1986, Harris and Stephens 1988) and difference of Gaussians (Lindeberg 1993, Lindeberg 1998b, Lowe 2004), maximially stable extremal regions (MSERs) are widely used for applications that require affine invariance (Matas et al....

    [...]

Proceedings ArticleDOI
17 Jun 2006
TL;DR: A recognition scheme that scales efficiently to a large number of objects and allows a larger and more discriminatory vocabulary to be used efficiently is presented, which it is shown experimentally leads to a dramatic improvement in retrieval quality.
Abstract: A recognition scheme that scales efficiently to a large number of objects is presented. The efficiency and quality is exhibited in a live demonstration that recognizes CD-covers from a database of 40000 images of popular music CD’s. The scheme builds upon popular techniques of indexing descriptors extracted from local regions, and is robust to background clutter and occlusion. The local region descriptors are hierarchically quantized in a vocabulary tree. The vocabulary tree allows a larger and more discriminatory vocabulary to be used efficiently, which we show experimentally leads to a dramatic improvement in retrieval quality. The most significant property of the scheme is that the tree directly defines the quantization. The quantization and the indexing are therefore fully integrated, essentially being one and the same. The recognition quality is evaluated through retrieval on a database with ground truth, showing the power of the vocabulary tree approach, going as high as 1 million images.

4,024 citations


Cites methods from "A Comparison of Affine Region Detec..."

  • ...In the current implementation of the proposed scheme, feature extraction on a 640 × 480 video frame takes around 0.2 seconds and the database query takes 25ms on a database with 50000 images....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Journal ArticleDOI
TL;DR: There is a natural uncertainty principle between detection and localization performance, which are the two main goals, and with this principle a single operator shape is derived which is optimal at any scale.
Abstract: This paper describes a computational approach to edge detection. The success of the approach depends on the definition of a comprehensive set of goals for the computation of edge points. These goals must be precise enough to delimit the desired behavior of the detector while making minimal assumptions about the form of the solution. We define detection and localization criteria for a class of edges, and present mathematical forms for these criteria as functionals on the operator impulse response. A third criterion is then added to ensure that the detector has only one response to a single edge. We use the criteria in numerical optimization to derive detectors for several common image features, including step edges. On specializing the analysis to step edges, we find that there is a natural uncertainty principle between detection and localization performance, which are the two main goals. With this principle we derive a single operator shape which is optimal at any scale. The optimal detector has a simple approximate implementation in which edges are marked at maxima in gradient magnitude of a Gaussian-smoothed image. We extend this simple detector using operators of several widths to cope with different signal-to-noise ratios in the image. We present a general method, called feature synthesis, for the fine-to-coarse integration of information from operators at different scales. Finally we show that step edge detector performance improves considerably as the operator point spread function is extended along the edge.

28,073 citations


"A Comparison of Affine Region Detec..." refers methods in this paper

  • ...In the following the representing ellipse is chosen to have the same first and second moments as the originally detected region, which is an affine covariant construction method....

    [...]

  • ...In practice, we start from a Harris corner point p (Harris and Stephens, 1988) and a nearby edge, extracted with the Canny edge detector (Canny, 1986)....

    [...]

Proceedings ArticleDOI
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

16,989 citations


"A Comparison of Affine Region Detec..." refers background or methods in this paper

  • ...The regions are similar to those detected by a Laplacian operator (trace) (Lindeberg, 1998; Lowe, 1999) but a function based on the determinant of the Hessian matrix penalizes very long structures for which the second derivative in one particular orientation is very small....

    [...]

  • ...…2002), image retrieval from large databases (Schmid and Mohr, 1997; Tuytelaars and Van Gool, 1999), model based recognition (Ferrari et al., 2004; Lowe, 1999; Obdržálek and Matas, 2002; Rothganger et al., 2003), object retrieval in video (Sivic and Zisserman, 2003; Sivic et al., 2004), visual…...

    [...]

  • ...Here we use the SIFT descriptor of Lowe (1999)....

    [...]

  • ...Following (Mikolajczyk and Schmid, 2003, 2005), we use the SIFT descriptor developed by Lowe (1999), which is an 128-dimensional vector, to describe the intensity pattern within the image regions....

    [...]

  • ...…we have not included methods for detecting regions which are covariant only to similarity transformations (i.e., in particular scale), such as (Lowe, 1999, 2004; Mikolajczyk and Schmid, 2001; Mikolajczyk et al., 2003), or other methods of computing affine invariant descriptors, such as image…...

    [...]

Book
01 Jan 2000
TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.
Abstract: From the Publisher: A basic problem in computer vision is to understand the structure of a real world scene given several images of it. Recent major developments in the theory and practice of scene reconstruction are described in detail in a unified framework. The book covers the geometric principles and how to represent objects algebraically so they can be computed and applied. The authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly.

15,558 citations

01 Jan 2001
TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.
Abstract: Downloading the book in this website lists can give you more advantages. It will show you the best book collections and completed collections. So many books can be found in this website. So, this is not only this multiple view geometry in computer vision. However, this book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts. This is simple, read the soft file of the book and you get it.

14,282 citations


"A Comparison of Affine Region Detec..." refers methods in this paper

  • ...Second, a standard small-baseline robust homography estimation algorithm is used to compute an accurate residual homography between the reference and warped image (using hundreds of automatically detected and matched interest points) (Hartley and Zisserman, 2004)....

    [...]

  • ...Second, a standard small-baseline robust homography estimation algorithm is used to compute an accurate residual homography between the reference and warped image (using hundreds of automatically detected and matched interest points) [11]....

    [...]

Frequently Asked Questions (11)
Q1. What is the point for which the intensity function reaches an extremum?

The point for which this function reaches an extremum is invariant under affine geometric and linear photometric transformations (given the ray). 

If only a very small number of matches is needed (e.g. for computing epipolar geometry), the MSER or IBR detector is the best choice for this type of scene. 

an affinity is sufficient to locally model image distortions arising from viewpoint changes, provided that (1) the scene surface can be locally approximated by a plane or in case of a rotating camera, and (2) perspective effects are ignored, which are typically small on a local scale anyway. 

Note that rotation preserves the eigenvalue ratio for an image patch, therefore, the affine deformation can be determined up to a rotation factor. 

The reasons for this lack of 100% performance are sometimes specific to detectors and scene types (discussed below), and sometimes general – the transformation is outside the range for which the detector is designed, e.g. discretization errors, noise, non-linear illumination changes, projective deformations etc. 

Also the region density, i.e., the number of detected regions per fixed amount of pixel area, may have an effect on the repeatability score of a detector. 

The number of regions also strongly depends on the scene type, e.g. for the MSER detector there are about 2600 regions for the textured blur scene (figure 9(f)) and only 230 for the light change scene (figure 9(h)). 

The complexity of the automatic scale selection and shape adaptation algorithm is O((m + k)p), where p is the number of initial points, m is a number of investigated scales in the automatic scale selection and k is a number of iterations in the shape adaptation algorithm. 

The following function is evaluated along each ray:fI(t) = abs(I(t)− I0) max ( R t0 abs(I(t)−I0)dt t , d ) with t an arbitrary parameter along the ray, I(t) the intensity at position t, I0 the intensity value at the extremum and d a small number which has been added to prevent a division by zero. 

As the threshold, therefore the number of matches increases (figure 22(a)), the number of correct as well as false matches also increases, but the number of false matches increases faster, hence the percentage of correct matches drops. 

The basic measure of accuracy and repeatability the authors use is the relative amount of overlap between the detected region in the reference image and the region detected in the other image, projected onto the reference image using the homography relating the images.