Journal Article•DOI•

A Comparison of Affine Region Detectors

Q: What is the detector for a particular type of scene?

If only a very small number of matches is needed (e.g. for computing epipolar geometry), the MSER or IBR detector is the best choice for this type of scene.

Q: How can an affinity be used to model image distortions arising from viewpoint changes?

an affinity is sufficient to locally model image distortions arising from viewpoint changes, provided that (1) the scene surface can be locally approximated by a plane or in case of a rotating camera, and (2) perspective effects are ignored, which are typically small on a local scale anyway.

Q: How can the authors determine the affine shape of an image patch?

Note that rotation preserves the eigenvalue ratio for an image patch, therefore, the affine deformation can be determined up to a rotation factor.

Q: What are the reasons for the lack of 100% performance of a detector?

The reasons for this lack of 100% performance are sometimes specific to detectors and scene types (discussed below), and sometimes general – the transformation is outside the range for which the detector is designed, e.g. discretization errors, noise, non-linear illumination changes, projective deformations etc.

Q: What is the effect of region density on the repeatability score of a detector?

Also the region density, i.e., the number of detected regions per fixed amount of pixel area, may have an effect on the repeatability score of a detector.

Q: How many regions are there for the textured blur scene?

The number of regions also strongly depends on the scene type, e.g. for the MSER detector there are about 2600 regions for the textured blur scene (figure 9(f)) and only 230 for the light change scene (figure 9(h)).

Q: What is the complexity of the automatic scale selection and shape adaptation algorithm?

The complexity of the automatic scale selection and shape adaptation algorithm is O((m + k)p), where p is the number of initial points, m is a number of investigated scales in the automatic scale selection and k is a number of iterations in the shape adaptation algorithm.

Q: What is the function fI(t) evaluated along the ray?

The following function is evaluated along each ray:fI(t) = abs(I(t)− I0) max ( R t0 abs(I(t)−I0)dt t , d ) with t an arbitrary parameter along the ray, I(t) the intensity at position t, I0 the intensity value at the extremum and d a small number which has been added to prevent a division by zero.

Q: What is the percentage of correct matches?

As the threshold, therefore the number of matches increases (figure 22(a)), the number of correct as well as false matches also increases, but the number of false matches increases faster, hence the percentage of correct matches drops.

Krystian Mikolajczyk¹, Tinne Tuytelaars², Cordelia Schmid³, Andrew Zisserman¹, Jiri Matas⁴, Frederik Schaffalitzky¹, Timor Kadir¹, L. Van Gool² - Show less +4 more•Institutions (4)

University of Oxford¹, Katholieke Universiteit Leuven², French Institute for Research in Computer Science and Automation³, Czech Technical University in Prague⁴

01 Nov 2005-International Journal of Computer Vision (Kluwer Academic Publishers)-Vol. 65, Iss: 1, pp 43-72

TL;DR: A snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions to establish a reference test set of images and performance software so that future detectors can be evaluated in the same framework.

read less

Abstract: The paper gives a snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions. Six types of detectors are included: detectors based on affine normalization around Harris (Mikolajczyk and Schmid, 2002; Schaffalitzky and Zisserman, 2002) and Hessian points (Mikolajczyk and Schmid, 2002), a detector of `maximally stable extremal regions', proposed by Matas et al. (2002); an edge-based region detector (Tuytelaars and Van Gool, 1999) and a detector based on intensity extrema (Tuytelaars and Van Gool, 2000), and a detector of `salient regions', proposed by Kadir, Zisserman and Brady (2004). The performance is measured against changes in viewpoint, scale, illumination, defocus and image compression. The objective of this paper is also to establish a reference test set of images and performance software, so that future detectors can be evaluated in the same framework.

...read moreread less

Summary (5 min read)

Jump to: [1 Introduction] – [2 Affine covariant detectors] – [2.2 An edge-based region detector] – [2.3 Intensity extrema-based region detector] – [2.4 Maximally Stable Extremal region detector] – [2.5 Salient region detector] – [3 The image data set] – [3.1 Discussion] – [4 Overlap comparison using homographies] – [4.1 Repeatability measure] – [4.2 Repeatability under various transformations] – [4.3 More detailed tests] – [5 Matching experiments] – [5.1 Matching score] – [5.2 Matching under various transformations] and [6 Conclusions]

1 Introduction

Detecting regions covariant with a class of transformations has now reached some maturity in the computer vision literature.
In particular, consider images from two viewpoints and the geometric transformation between the images induced by the viewpoint change.
The confusion probably arises from the fact that, even though the regions themselves are covariant, the normalized image pattern they cover and the feature descriptors derived from them are typically invariant.

2 Affine covariant detectors

In this section the authors give a brief description of the six region detectors used in the comparison.
Sections 2.2 and 2.3 describe methods for detecting edge-based regions and intensity extrema-based regions.
The idea is to select the characteristic scale of a local structure, for which a given function attains an extremum over scales .
The eigenvalues of the second moment matrix are used to measure the affine shape of the point neighbourhood.
The authors can therefore use this technique to estimate the shape of initial regions provided by the Harris and Hessian based detector.

2.2 An edge-based region detector

The rationale behind this is that edges are typically rather stable features, that can be detected over a range of viewpoints, scales and/or illumination changes.
Since intersections of two straight edges occur quite often, the authors cannot simply neglect this case.
To circumvent this problem, the two photometric quantities given in Equation 4 are combined and locations where both functions reach a minimum value are taken to fix the parameters s1 and s2 along the straight edges.
Moreover, instead of relying on the correct detection of the Harris corner point, the authors can simply use the straight lines intersection point instead.
For easy comparison in the context of this paper, the parallelograms representing the invariant regions are replaced by the enclosed ellipses, as shown in figure 4(b).

2.3 Intensity extrema-based region detector

Here the authors describe a method to detect affine covariant regions that starts from intensity extrema (detected at multiple scales), and explores the image around them in a radial way, delineating regions of arbitrary shape, which are then replaced by ellipses.
The point for which this function reaches an extremum is invariant under affine geometric and linear photometric transformations (given the ray).
The function fI(t) is in itself already invariant.
This ellipse-fitting is again an affine covariant construction.
Examples of detected regions are displayed in figure 4(a).

2.4 Maximally Stable Extremal region detector

The word ‘extremal’ refers to the property that all pixels inside the MSER have either higher (bright extremal regions) or lower (dark extremal regions) intensity than all the pixels on its outer boundary.
The ‘maximally stable’ in MSER describes the property optimized in the threshold selection process.
This ensures that common photometric changes modelled locally as linear or affine leave E unaffected, even if the camera is non-linear (gamma-corrected).
After sorting, pixels are marked in the image (either in decreasing or increasing order) and the list of growing and merging connected components and their areas is maintained using the union-find algorithm [38].
Among the extremal regions, the ‘maximally stable’ ones are those corresponding to thresholds were the relative area change as a function of relative change of threshold is at a local minimum.

2.5 Salient region detector

This detector is based on the pdf of intensity values computed over an elliptical region.
Detection proceeds in two steps: first, at each pixel the entropy of the pdf is evaluated over the three parameter family of ellipses centred on that pixel.
The set of entropy extrema over scale and the corresponding ellipse parameters are recorded.
Second, the candidate salient regions over the entire image are ranked using the magnitude of the derivative of the pdf with respect to scale.
More details about this method can be found in [12].

3 The image data set

Figure 9 shows examples from the image sets used to evaluate the detectors.
In the cases of viewpoint change, scale change and blur, the same change in imaging conditions is applied to two different scene types.
This means that the effect of changing the image conditions can be separated from the effect of changing the scene type.
The JPEG sequence is generated using a standard xv image browser with the image quality parameter varying from 40% to 2%.
The composition of these two homographies (approximate and residual) gives an accurate homography between the reference and other image.

3.1 Discussion

Before the authors compare the performance of the different detectors in more detail in the next section, a few more general observations can already be made, simply by examining the output of the different detectors for the images shown in figures 3 and 4.
For the intensity extrema-based region detector, the algorithm finding intensity extrema is O(n), where n is again the number of pixels.
The computation times mentioned in this table have all been measured on a Pentium 4 2GHz Linux PC, for the leftmost image shown in figure 9(a), which is 800× 640 pixels.
Also, as will be shown in the next section , large regions automatically have better chances of overlapping other regions.
Here, the authors focus on the original distinguished regions (except for the ellipse fitting for edge-based and MSER regions, to obtain the same shape for all detectors), as they determine the intrinsic quality of a detector.

4 Overlap comparison using homographies

Two important parameters characterize the performance of a region detector: 1. the repeatability, i.e., the average number of corresponding regions detected in images under different geometric and photometric transformations, both in absolute and relative terms (i.e., percentage-wise), and 2. the accuracy of localization and region estimation.
Clearly as the scaling goes to zero there is no intersection of the cones, and as the scaling goes to infinity the relative amount of overlap, defined as the ratio of the intersection to the union of the ellipses approaches unity.
Moreover, it is not straightforward for all detectors to come up with a single parameter that can be varied to obtain the desired number of regions in a meaningful way, i.e., representing some kind of ‘quality measure’ for the regions.
To give an idea of the number of regions, both absolute and relative repeatability scores are given.

4.1 Repeatability measure

Two regions are deemed to correspond if the overlap error, defined as the error in the image area covered by the regions, is sufficiently small: 1− Rµa ∩R(HT µbH) (Rµa ∪RHT µbH) < O where Rµ represents the elliptic region defined by xT µx = 1.
The union of the regions is Rµa∪R(HT µbH), and Rµa∩R(HT µbH) is their intersection.
Then, the authors apply this scale factor to both the region in the reference image and the region detected in the other image which has been mapped onto the reference image, before computing the actual overlap error as described above.
The precise procedure is given in the Matlab code on http://www.robots.ox.ac.uk/~vgg/research/affine.
Note that an overlap error of 20% is very small as it corresponds to only 10% difference between the regions’ radius.

4.2 Repeatability under various transformations

In a first set of experiments, the authors fix the overlap error threshold to 40% and the normalized region size to a radius of 30 pixels, and check the repeatability of the different region detectors for gradually increasing transformations, according to the image sets shown in figure 9.
This can be understood by the fact that in most cases larger transformations result in lower quality images and/or smaller commonly visible parts between the reference image and the other image, and hence a smaller number of regions are detected.
Figure 13(a) shows the repeatability score and figure 13(b) the absolute number of correspondences.
The Hessian-Affine detector performs best, followed by MSER and Harris-Affine detectors.
The number of corresponding regions detected on structured scene is much lower than for the textured scene and it changes by a different factor for different detectors.

4.3 More detailed tests

To further validate their experimental setup and to obtain a deeper insight in what is actually going on, a more detailed analysis is performed on one image pair with a viewpoint change of 40 degrees, namely the first and third column of the graffiti sequence shown in figure 9(a).
Choosing a lower threshold results in more accurate regions, .
Figure 21(b) shows how the repeatability scores vary as a function of the normalized region size, with the overlap error threshold fixed to 40%.
This results in a plot showing the repeatability scores for different detectors as a function of region size.
The results for Hessian-Affine, Harris-Affine and IBR are similar.

5 Matching experiments

In the previous section, the performance of the different region detectors is evaluated from a rather theoretical point of view, focusing on the overlap error and repeatability.
To this end, the authors compute a descriptor for the regions, and then check to what extent matching with the descriptor gives the correct region match.
This descriptor gave the best matching results in an evaluation of different descriptors computed on scale and affine invariant regions [25, 28].
To this end, each elliptical region is first mapped to a circular region of 30× 30 pixels, and rotated based on the dominant gradient orientation, to compensate for the affine geometric deformations, as shown in figure 2(e).
Note that unlike in section 4, this mapping concerns descriptors; the region size is coincidentally the same (30 pixels).

5.1 Matching score

Again the measure is computed between a reference image and the other images in a set.
The matching score is computed in two steps.
Only a single match is allowed for each region.
If the matching results do not follow those of the repeatability test for a particular feature type that means that the distinctiveness of these features differs from the distinctiveness of other detectors.
Indeed, rather than taking the original distinguished region, one might also rescale the region first, which typically leads to more discriminative power – certainly for the small regions.

5.2 Matching under various transformations

Figures 13 - 20 (c) and (d) give the results of the matching experiment for the different types of transformations.
These are basically the same plots as given in figures 13 - 20 (a) and (b) but now focusing on regions that have actually been matched, rather than just corresponding regions.
These detectors find several slightly different regions containing the same local structure all of which have a small overlap error.
The same change in ranking for Harris-Affine and Hessian-Affine can be observed on the results for other transformations.

6 Conclusions

In this paper the authors have presented the state of the art on affine covariant region detectors and have compared their performance.
This also holds for IBR since both methods are designed for similar region types.
Hessian-Affine and Harris-Affine provide more regions than the other detectors, which is useful in matching scenes with occlusion and clutter.
Several detectors should be used simultaneously to obtain the best performance.
Naturally, regions are also detected at depth and surface orientation discontinuities of 3D scenes.

Did you find this useful? Give us your feedback

Figures (23)

Figure 4: Regions generated by different detectors continued.

Figure 15: Scale change for the structured scene (Boat sequence figure 9(c)). (a) Repeatability score for scale change (default settings). (b) Number of corresponding regions. (c) Matching score. (d) Number of correct nearest neighbour matches.

Figure 16: Scale change for the textured scene (Bark sequence figure 9(d)). (a) Repeatability score for scale change (default settings). (b) Number of corresponding regions. (c) Matching score. (d) Number of correct nearest neighbour matches.

Table 1: Computation times for the different detectors for the leftmost image of figure 9(a) (size 800x640).

Figure 20: Illumination change (Leuven sequence figure 9(h)). (a) Repeatability score for different illumination (default settings). (b) Number of corresponding regions. (c) Matching score. (d) Number of correct nearest neighbour matches.

Figure 1: Class of transformations needed to cope with viewpoint changes. (a) First viewpoint; (b,c) second viewpoint. Fixed size circular patches (a,b) clearly do not suffice to deal with general viewpoint changes. What is needed is an anisotropic rescaling, i.e., an affinity (c). Bottom row shows close-up of the images with surface corresponding patches.

Figure 5: Originally detected region shapes for the regions shown in figures 3(c) and 4(b).

Figure 19: JPEG compression (UBC sequence figure 9(g)). (a) Repeatability score for different JPEG compression (default settings). (b) Number of corresponding regions. (c) Matching score. (d) Number of correct nearest neighbour matches.

Figure 8: Construction methods for EBR and IBR. (a) The edge-based region detector starts from a corner point p and exploits nearby edge information; (b) The intensity extremabased region detector starts from an intensity extremum and studies the intensity pattern along rays emanating from this point.

Figure 9: Data set. (a), (b) Viewpoint change, (c),(d) Zoom+rotation, (e),(f) Image blur, (g) JPEG compression, (h) Light change. In the case of viewpoint change, scale change and blur, the same change in imaging conditions is applied to two different scene types: structured and textured scenes. In the experimental comparisons, the left most image of each set is used as the reference image.

Figure 2: Affine covariant regions offer a solution to viewpoint and illumination changes. First row: one viewpoint; second row: other viewpoint. (a) Original images, (b) detected affine covariant regions, (c) close-up of the detected regions. (d) Geometric normalization to circles. The regions are the same up to rotation. (e) Photometric and geometric normalization. The slight residual difference in rotation is due to an estimation error.

Figure 12: Overlap error O. Examples of ellipses projected on the corresponding ellipse with the ground truth transformation. (bottom) Overlap error for above displayed ellipses. Note that the overlap error comes from different size, orientation and position of the ellipses.

Figure 13: Viewpoint change for the structured scene (Graffiti sequence figure 9(a)). (a) Repeatability score for viewpoint change (default settings- overlap 40%,normalized size=30 pixels). (b) Number of corresponding regions. (c) Matching score. (d) Number of correct nearest neighbour matches.

Figure 18: Blur for the textured scene (Trees sequence figure 9(f)). (a) Repeatability score for blur change (default settings). (b) Number of corresponding regions. (c) Matching score. (d) Number of correct nearest neighbour matches.

Figure 6: Example of characteristic scales. Top row shows images taken with different zoom. Bottom row shows the responses of the Laplacian over scales. The characteristic scales are 10.1 and 3.9 for the left and right image, respectively. The ratio of scales corresponds to the scale factor (2.5) between the two images. The radius of displayed regions in the top row is equal to 3 times the selected scales.

Figure 11: Rescaling regions has an effect on their overlap.

Figure 22: Viewpoint change (Graffiti image pair - 1st and 3rd column in figure 9(a)). (a) Percentage of correct matches versus total number of nearest neighbour matches. (b) Number of correct matches versus total number of nearest neighbour matches. (c) Matching score for different size of measurement region. Region size factor is the ratio measurement / detected region size.

Figure 17: Blur for the structured scene (Bikes sequence figure 9(e)). (a) Repeatability score for blur change (default settings). (b) Number of corresponding regions. (c) Matching score. (d) Number of correct nearest neighbour matches.

Figure 14: Viewpoint change for the textured scene (Wall sequence figure 9(b)). (a) Repeatability score for viewpoint change (default settings). (b) Number of corresponding regions. (c) Matching score. (d) Number of correct nearest neighbour matches.

Figure 7: Diagram illustrating the affine normalization using the second moment matrices. Image coordinates are transformed with matrices M−1/2L and M −1/2

Figure 21: Viewpoint change (Graffiti image pair - 1st and 3rd column in figure 9(a)). (a) Repeatability score for different overlap error for one pair (normalized size=30 pixels) . (b) Repeatability score for different normalized region size (overlap error < 40%). (c) Repeatability score for different number of detected regions (overlap error= 40%, normalized size=30 pixels). (d) Repeatability score as a function of region size.

Content maybe subject to copyright Report

A Comparison of Aﬃne Region Detectors

K. Mikolajczyk

, T. Tuytelaars

, C. Schmid

, A. Zisserman

J. Matas

, F. Schaﬀalitzky

, T. Kadir

, L. Van Gool

University of Oxford, OX1 3PJ Oxford, United Kingdom

University of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

Czech Technical University, Karlovo Namesti 13, 121 35, Prague, Czech Republic

INRIA, GRAVIR-CNRS, 655, av. de l’Europe, 38330 Montbonnot, France

km@robots.ox.ac.uk, tuytelaa@esat.kuleuven.ac.be, schmid@inrialpes.fr,

az@robots.ox.ac.uk, matas@cmp.felk.cvut.cz, fsm@robots.ox.ac.uk,

tk@robots.ox.ac.uk, vangool@esat.kuleuven.ac.be

Abstract

The pap er gives a snapshot of the state of the art in aﬃne covariant region detectors, and

compares their p erformance on a set of test images under varying imaging conditions. Six

types of detectors are included: detectors based on aﬃne normalization around Harris [24,

34] and Hessian points [24], as proposed by Mikolajczyk and Schmid and by Schaﬀalitzky

and Zisserman; a detector of ‘maximally stable extremal regions’, proposed by Matas et

al. [21]; an edge-based region detector [45] and a detector based on intensity extrema [47],

prop os ed by Tuytelaars and Van Gool; and a detector of ‘salient regions’, proposed by Kadir,

Zisserman and Brady [12]. The performance is measured against changes in viewpoint, scale,

illumination, defocus and image compression.

The objective of this paper is also to establish a reference test set of images and perfor-

mance software, so that future detectors can be evaluated in the same framework.

1 Introduction

Detecting regions covariant with a class of transformations has now reached some maturity in the

computer vision literature. These regions have been used in quite varied applications including:

wide baseline matching for stereo pairs [1, 21, 31, 47], reconstructing cameras for sets of disparate

views [34], image retrieval from large databases [36, 45], model based recognition [7, 18, 29, 32],

object retrieval in video [39, 40], visual data m ining [41], texture recognition [13, 14], shot lo-

cation [35], robot localization [37] and servoing [46], building panoramas [2], symmetry detec-

tion [44], and object categorization [4, 5, 6, 30].

The requirement for these regions is that they should correspond to the same pre-image for

diﬀerent viewpoints, i.e., their shape is not ﬁxed but automatically adapts, based on the un-

derlying image intensities, so that they are the projection of the same 3D surface patch. In

particular, consider images from two viewpoints and the geometric transformation between the

images induced by the viewpoint change. Regions detected after the viewpoint change should

be the same, modulo noise, as the transformed versions of the regions detected in the original

image – image transformation and region detection commute. As such, even though they have

often been called invariant regions in the literature (e.g., [5, 13, 41, 45]), in principle they should

be termed covariant regions since they change covariantly with the transformation. The confu-

sion probably arises from the fact that, even though the regions themselves are covariant, the

(a) (b) (c)

(d) (e) (f)

Figure 1: Class of transformations needed to cope with viewpoint changes. (a) First

viewpoint; (b,c) second viewpoint. Fixed size circular patches (a,b) clearly do not suﬃce to deal

with general viewpoint changes. What is needed is an anisotropic rescaling, i.e., an aﬃnity (c).

Bottom row shows close-up of the images with surface corresponding patches.

normalized image pattern they cover and the feature descriptors derived from them are typically

invariant.

Note, our use of the term ‘region’ simply refers to a set of pixels, i.e. any subset of the image.

This diﬀers from classical segmentation since the region boundaries do not have to correspond to

changes in image appearance such as colour or te xture. All the detectors presented here produce

simply connected regions, but in general this need not be the case.

For viewpoint changes, the transformation of most interest is an aﬃnity. This is illustrated in

ﬁgure 1. Clearly, a region with ﬁxed shape (a circular example is shown in ﬁgure 1(a) and (b))

cannot cope with the geometric deformations caused by the change in viewpoint. We can observe

that the circle does not cover the same image content, i.e., the same physical surface. Instead, the

shape of the region has to be adaptive, or covariant with respect to aﬃnities (ﬁgure 1(c) – close-

ups shown in ﬁgure 1(d)–(f)). Indeed, an aﬃnity is suﬃcient to locally model image distortions

arising from viewpoint changes, provided that (1) the scene surface c an be locally approximated

by a plane or in case of a rotating camera, and (2) perspective eﬀects are ignored, which are

typically small on a local scale anyway. Aside from the geometric deformations, also photometric

deformations need to be taken into account. These can be modeled by a linear transformation

of the intensities.

To further illustrate these issues, and how aﬃne covariant regions can be exploited to cope

with the geometric and photometric deformation between wide baseline images, consider the

example shown in ﬁgure 2. Unlike the example of ﬁgure 1 (where a circular region was chosen for

one viewpoint) the elliptical image regions here are detected independently in each viewpoint. As

is evident, the pre-image of these aﬃne covariant regions correspond to the same surface region.

Given such an aﬃne covariant region, it is then p os sible to normalize against the geometric and

photometric deformations (shown in ﬁgure 2(d)(e)) and to obtain a viewpoint and illumination

(a) (b) (c) (d) (e)

Figure 2: Aﬃne covariant regions oﬀer a solution to viewpoint and illumination

changes. First row: one viewpoint; second row: other viewpoint. (a) Original images, (b)

detected aﬃne covariant regions, (c) close-up of the detected regions. (d) Geometric normal-

ization to circles. The regions are the same up to rotation. (e) Photometric and geometric

normalization. The slight residual diﬀerence in rotation is due to an estimation error.

invariant description of the intensity pattern within the region.

In a typical matching application, the regions are used as follows. First, a set of covariant

regions is detected in an image. Often a large number, perhaps hundreds or thousands, of

possibly overlapping regions are obtained. A vector descriptor is then associated with each

region, computed from the intensity pattern within the region. This descriptor is chosen to be

invariant to viewpoint changes and, to some extent, illumination changes, and to discriminate

between the regions. Correspondences may then be established with another image of the same

scene, by ﬁrst detecting and representing regions (independently) in the new image; and then

matching the regions based on their descriptors. By design the regions commute with viewpoint

change, so by design, corresponding regions in the two images will have s imilar (ideally identical)

vector descriptors. The beneﬁts are that correspondences can then be easily established and,

since there are multiple regions, the method is robust to partial occlusions.

This paper gives a snapshot of the state of the art in aﬃne covariant region detection. We will

describe and compare six methods of detecting these regions on images. These detectors have been

designed and implemented by a number of researchers and the comparison is carried out using

binaries supplied by the authors. The detectors are: (i) the ‘Harris-Aﬃne’ dete ctor [24, 27, 34];

(ii) the ‘Hessian-Aﬃne’ detector [24, 27]; (iii) the ‘maximally stable extremal region’ detector (or

MSER, for short) [21, 22]; (iv) an edge-based region detector [45, 48] (referred to as EBR); (v) an

intensity extrema-based region detector [47, 48] (referred to as IBR); and (vi) an entropy-based

region detector [12] (referred to as salient regions).

To limit the scope of the paper we have not included methods for detecting regions which

are covariant only to similarity transformations (i.e., in particular sc ale), such as [18, 19, 23,

26], or other methods of computing aﬃne invariant descriptors, such as image lines connecting

interest points [20, 42, 43] or invariant vertical line segments [9]. Also the detectors proposed

by Lindeberg [16] and Baumberg [1] have not been included, as they come very close to the

Harris-Aﬃne and Hessian-Aﬃne detectors.

The six detectors are described in section 2. They are compared on the data set shown

in ﬁgure 9. This data set includes structured and textured scenes as well as diﬀerent types

of transformations: viewpoint changes, scale changes, illumination changes, blur and JPEG

compression. It is described in more detail in section 3. Two types of comparisons are carried

out. First, in section 4, the repeatability of the detector is measured: how well does the detector

determine corresponding sc ene regions? This is measured by comparing the overlap between

the ground truth and detected regions, in a manner similar to the evaluation test used in [24],

but with special attention paid to the eﬀect of the diﬀerent scales (region sizes) of the various

detectors’ output. Here, we also measure the accuracy of the regions shape, scale and localization.

Second, the distinctiveness of the detected regions is assessed: how distinguishable are the regions

detected? Following [25, 28], we use the SIFT descriptor developed by Lowe [18], which is an 128-

dimensional vector, to describe the intensity pattern within the image regions. This descriptor

has been demonstrated to be superior to others used in literature on a number of measures [25].

Our intention is that the image s and tests described here will be a benchmark against which

future aﬃne covariant region detectors can be assessed. The images, Matlab code to carry out

the performance tests, and binaries of the detectors are available from

http://www.robots.ox.ac.uk/

vgg/research/aﬃne.

2 Aﬃne covariant detectors

In this section we give a brief description of the six region detectors used in the comparison.

Section 2.1 describes the related methods Harris-Aﬃne and Hessian-Aﬃne. Sections 2.2 and 2.3

describe methods for detecting edge-based regions and intensity extrema-based regions. Finally,

sections 2.4 and 2.5 describe MSER and salient regions.

For the purpose of the comparisons the output region of all detector types are represented by

a common shape, which is an ellipse. Figures 3 and 4 show the ellipses for all detectors on one

pair of images. In order not to overload the images, only some of the corresponding regions that

were actually detected in both images have been shown. This selection is obtained by increasing

the threshold.

In fact, for most of the detectors the output shape is an ellipse. Howeve r, for two of the

detectors (edge-based regions and MSER) it is not, and information is lost by this representation,

as ellipses can only be matched up to a rotational degree of freedom. Examples of the original

regions detected by these two methods are given in ﬁgure 5. These are parallelogram-shaped

regions for the edge-based region detector, and arbitrarily shap ed regions for the MSER detector.

In the following the representing ellipse is chosen to have the same ﬁrst and second moments as

the originally detected region, which is an aﬃne covariant construction method.

2.1 Detectors based on aﬃne normalization – Harris-Aﬃne & Hessian-

Aﬃne

We describe here two related methods which detect interest points in scale-space, and then

determine an elliptical region for each point. Interest points are either detected with the Harris

detector or with a detector based on the Hessian matrix. In both cases scale-selection is based

on the Laplacian, and the shape of the elliptical region is determined with the second moment

matrix of the intensity gradient [1, 16].

The second moment matrix, also called the auto-correlation matrix, is often used for feature

detection or for describing local image structures. Here it is used both in the Harris detector

and the elliptical shape estimation. This matrix describes the gradient distribution in a local

neighbourhood of a point:

M = µ(x, σ

, σ

) =





= σ

g(σ

) ∗



(x, σ

) I

(x, σ

)

(x, σ

) I

(x, σ

)



(1)

The local image derivatives are computed with Gaussian kernels of scale σ

(diﬀerentiation

scale). The derivatives are then averaged in the neighbourhood of the point by smoothing with

a Gaussian window of scale σ

(integration scale). The eigenvalues of this matrix represent two

(a) Harris-Aﬃne

(b) Hessian-Aﬃne

Figure 3: Regions generated by diﬀerent detectors on corresponding sub-parts of the

ﬁrst and third graﬃti images of ﬁgure 9(a). The ellipses show the original detection size.

HTML Viewer

Frequently Asked Questions (11)

Q1. What is the point for which the intensity function reaches an extremum?

The point for which this function reaches an extremum is invariant under affine geometric and linear photometric transformations (given the ray).

Q2. What is the detector for a particular type of scene?

If only a very small number of matches is needed (e.g. for computing epipolar geometry), the MSER or IBR detector is the best choice for this type of scene.

Q3. How can an affinity be used to model image distortions arising from viewpoint changes?

an affinity is sufficient to locally model image distortions arising from viewpoint changes, provided that (1) the scene surface can be locally approximated by a plane or in case of a rotating camera, and (2) perspective effects are ignored, which are typically small on a local scale anyway.

Q4. How can the authors determine the affine shape of an image patch?

Note that rotation preserves the eigenvalue ratio for an image patch, therefore, the affine deformation can be determined up to a rotation factor.

Q5. What are the reasons for the lack of 100% performance of a detector?

The reasons for this lack of 100% performance are sometimes specific to detectors and scene types (discussed below), and sometimes general – the transformation is outside the range for which the detector is designed, e.g. discretization errors, noise, non-linear illumination changes, projective deformations etc.

Q6. What is the effect of region density on the repeatability score of a detector?

Also the region density, i.e., the number of detected regions per fixed amount of pixel area, may have an effect on the repeatability score of a detector.

Q7. How many regions are there for the textured blur scene?

The number of regions also strongly depends on the scene type, e.g. for the MSER detector there are about 2600 regions for the textured blur scene (figure 9(f)) and only 230 for the light change scene (figure 9(h)).

Q8. What is the complexity of the automatic scale selection and shape adaptation algorithm?

The complexity of the automatic scale selection and shape adaptation algorithm is O((m + k)p), where p is the number of initial points, m is a number of investigated scales in the automatic scale selection and k is a number of iterations in the shape adaptation algorithm.

Q9. What is the function fI(t) evaluated along the ray?

The following function is evaluated along each ray:fI(t) = abs(I(t)− I0) max ( R t0 abs(I(t)−I0)dt t , d ) with t an arbitrary parameter along the ray, I(t) the intensity at position t, I0 the intensity value at the extremum and d a small number which has been added to prevent a division by zero.

Q10. What is the percentage of correct matches?

As the threshold, therefore the number of matches increases (figure 22(a)), the number of correct as well as false matches also increases, but the number of false matches increases faster, hence the percentage of correct matches drops.

Q11. What is the basic measure of accuracy and repeatability of a region detector?

The basic measure of accuracy and repeatability the authors use is the relative amount of overlap between the detected region in the reference image and the region detected in the other image, projected onto the reference image using the homography relating the images.

A Comparison of Affine Region Detectors

Summary (5 min read)

1 Introduction

2 Affine covariant detectors

2.2 An edge-based region detector

2.3 Intensity extrema-based region detector

2.4 Maximally Stable Extremal region detector

2.5 Salient region detector

3 The image data set

3.1 Discussion

4 Overlap comparison using homographies

4.1 Repeatability measure

4.2 Repeatability under various transformations

4.3 More detailed tests

5 Matching experiments

5.1 Matching score

5.2 Matching under various transformations

6 Conclusions

Figures (23)

Citations

Cites background or methods from "A Comparison of Affine Region Detec..."

Cites background from "A Comparison of Affine Region Detec..."

Cites methods from "A Comparison of Affine Region Detec..."

Cites methods from "A Comparison of Affine Region Detec..."

References

"A Comparison of Affine Region Detec..." refers methods in this paper

"A Comparison of Affine Region Detec..." refers background or methods in this paper

"A Comparison of Affine Region Detec..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (11)

Q1. What is the point for which the intensity function reaches an extremum?

Q2. What is the detector for a particular type of scene?

Q3. How can an affinity be used to model image distortions arising from viewpoint changes?

Q4. How can the authors determine the affine shape of an image patch?

Q5. What are the reasons for the lack of 100% performance of a detector?

Q6. What is the effect of region density on the repeatability score of a detector?

Q7. How many regions are there for the textured blur scene?

Q8. What is the complexity of the automatic scale selection and shape adaptation algorithm?

Q9. What is the function fI(t) evaluated along the ray?

Q10. What is the percentage of correct matches?

Q11. What is the basic measure of accuracy and repeatability of a region detector?