Book Chapter•DOI•

SURF: speeded up robust features

Herbert Bay¹, Tinne Tuytelaars², Luc Van Gool¹•Institutions (2)

ETH Zurich¹, Katholieke Universiteit Leuven²

07 May 2006-Vol. 1, pp 404-417

TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.

read less

Abstract: In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF's strong performance.

...read moreread less

Summary (2 min read)

Jump to: [1 Introduction] – [2 Related Work] – [3 Fast-Hessian Detector] – [4 SURF Descriptor] – [4.1 Orientation Assignment] – [4.2 Descriptor Components] – [5 Experimental Results] and [6 Conclusion]

1 Introduction

The task of finding correspondences between two images of the same scene or object is part of many computer vision applications.
It has been their goal to develop both a detector and descriptor, which in comparison to the state-of-the-art are faster to compute, while not sacrificing performance.
Concerning the photometric deformations, the authors assume a simple linear model with a scale factor and offset.
Section 2 describes related work, on which their results are founded.

3 Fast-Hessian Detector

The authors base their detector on the Hessian matrix because of its good performance in computation time and accuracy.
Therefore, the scale space is analysed by up-scaling the filter size rather than iteratively reducing the image size.
At larger scales, the step between consecutive filter sizes should also scale accordingly.
As the ratios of their filter layout remain constant after scaling, the approximated Gaussian derivatives scale accordingly.
Fig. 2 (left) shows an example of the detected interest points using their ’Fast-Hessian’ detector.

4 SURF Descriptor

The good performance of SIFT compared to other descriptors [8] is remarkable.
Its mixing of crudely localised information and the distribution of gradient related features seems to yield good distinctive power while fending off the effects of localisation errors in terms of scale or space.
The proposed SURF descriptor is based on similar properties, with a complexity stripped down even further.
The first step consists of fixing a reproducible orientation based on information from a circular region around the interest point.
These two steps are now explained in turn.

4.1 Orientation Assignment

For that purpose, the authors first calculate the Haar-wavelet responses in x and y direction, shown in Fig. 2, and this in a circular neighbourhood of radius 6s around the interest point, with s the scale at which the interest point was detected.
Therefore, the authors use again integral images for fast filtering.
The horizontal and vertical responses within the window are summed.
The longest such vector lends its orientation to the interest point.
Small sizes fire on single dominating wavelet responses, large sizes yield maxima in vector length that are not outspoken.

4.2 Descriptor Components

For the extraction of the descriptor, the first step consists of constructing a square region centered around the interest point, and oriented along the orientation selected in the previous section.
The wavelet responses are invariant to a bias in illumination .
The extended descriptor for 4 × 4 subregions (SURF-128) comes out to perform best.
Hence, this minimal information allows for faster matching and gives a slight increase in performance.

5 Experimental Results

First, the authors present results on a standard evaluation set, fot both the detector and the descriptor.
For the detector comparison, the authors selected the two viewpoint changes (Graffiti and Wall), one zoom and rotation (Boat) and lighting changes (see Fig. 6, discussed below).
The SURF descriptor outperforms the other descriptors in a systematic and significant way, with sometimes more than 10% improvement in recall for the same level of precision.
The timings were evaluated on a standard Linux PC (Pentium IV, 3GHz).
The object shown on the reference image with the highest number of matches with respect to the test image is chosen as the recognised object.

6 Conclusion

The authors have presented a fast and performant interest point detection-description scheme which outperforms the current state-of-the art, both in speed and accuracy.
The descriptor is easily extendable for the description of affine invariant regions.
The authors gratefully acknowledge the support from Swiss SNF NCCR project IM2, Toyota-TME and the Flemish Fund for Scientific Research.

Did you find this useful? Give us your feedback

Figures (9)

Fig. 2. Left: Detected interest points for a Sunflower field. This kind of scenes shows clearly the nature of the features from Hessian-based detectors. Middle: Haar wavelet types used for SURF. Right: Detail of the Graffiti scene showing the size of the descriptor window at different scales.

Fig. 5. An example image from the reference set (left) and the test set (right). Note the difference in viewpoint and colours.

Fig. 6. Repeatability score for image sequences, from left to right and top to bottom, Wall and Graffiti (Viewpoint Change), Leuven (Lighting Change) and Boat (Zoom and Rotation)

Table 1. Thresholds, number of detected points and calculation time for the detectors in our comparison. (First image of Graffiti scene, 800 × 640).

Table 2. Computation times for the joint detector - descriptor implementations, tested on the first image of the Graffiti sequence. The thresholds are adapted in order to detect the same number of interest points for all methods. These relative speeds are also representative for other images.

Fig. 4. The recall vs. (1-precision) graph for different binning methods and two different matching strategies tested on the ’Graffiti’ sequence (image 1 and 3) with a view change of 30 degrees, compared to the current descriptors. The interest points are computed with our ’Fast Hessian’ detector. Note that the interest points are not affine invariant. The results are therefore not comparable to the ones in [8]. SURF-128 corresponds to the extended descriptor. Left: Similarity-threshold-based matching strategy. Right: Nearest-neighbour-ratio matching strategy (See section 5).

Fig. 3. The descriptor entries of a sub-region represent the nature of the underlying intensity pattern. Left: In case of a homogeneous region, all values are relatively low. Middle: In presence of frequencies in x direction, the value of ∑ |dx| is high, but all others remain low. If the intensity is gradually increasing in x direction, both values∑ dx and ∑ |dx| are high.

Fig. 7. Recall, 1-Precision graphs for, from left to right and top to bottom, Viewpoint change of 50 (Wall) degrees, scale factor 2 (Boat), image blur (Bikes and Trees), brightness change (Leuven) and JPEG compression (Ubc)

Fig. 1. Left to right: The (discretised and cropped) Gaussian second order partial derivatives in y-direction and xy-direction, and our approximations thereof using box filters. The grey regions are equal to zero.

Content maybe subject to copyright Report

SURF: Speeded Up Robust Features

Herbert Bay

, Tinne Tuytelaars

, and Luc Van Gool

1,2

ETH Zurich

{bay, vangool}@vision.ee.ethz.ch

Katholieke Universiteit Leuven

{Tinne.Tuytelaars, Luc.Vangool}@esat.kuleuven.be

Abstract. In this paper, we present a novel scale- and rotation-invariant

interest point detector and descriptor, coined SURF (Speeded Up Ro-

bust Features). It approximates or even outperforms previously proposed

schemes with respect to repeatability, distinctiveness, and robustness, yet

can be computed and compared much faster.

This is achieved by relying on integral images for image convolutions;

by building on the strengths of the leading existing detectors and descrip-

tors (in casu, using a Hessian matrix-based measure for the detector, and

a distribution-based descriptor); and by simplifying these methods to the

essential. This leads to a combination of novel detection, description, and

matching steps. The paper presents experimental results on a standard

evaluation set, as well as on imagery obtained in the context of a real-life

object recognition application. Both show SURF’s strong performance.

1 Introduction

The task of ﬁnding correspondences between two images of the same scene or

object is part of many computer vision applications. Camera calibration, 3D

reconstruction, image registration, and object recognition are just a few. The

search for discrete image correspondences – the goal of this work – can be di-

vided into three main steps. First, ‘interest points’ are selected at distinctive

locations in the image, such as corners, blobs, and T-junctions. The most valu-

able property of an interest point detector is its repeatability, i.e. whether it

reliably ﬁnds the same interest points under diﬀerent viewing conditions. Next,

the neighbourhood of every interest point is represented by a feature vector. This

descriptor has to be distinctive and, at the same time, robust to noise, detec-

tion errors, and geometric and photometric deformations. Finally, the descriptor

vectors are matched between diﬀerent images. The matching is often based on a

distance between the vectors, e.g. the Mahanalobis or Euclidean distance. The

dimension of the descriptor has a direct impact on the time this takes, and a

lower number of dimensions is therefore desirable.

It has been our goal to develop both a detector and descriptor, which in

comparison to the state-of-the-art are faster to compute, while not sacriﬁcing

performance. In order to succeed, one has to strike a balance between the above

A. Leonardis, H. Bischof, and A. Pinz (Eds.): ECCV 2006, Part I, LNCS 3951, pp. 404–417, 2006.

 Springer-Verlag Berlin Heidelberg 2006

SURF: Speeded Up Robust Features 405

requirements, like reducing the descriptor’s dimension and complexity, while

keeping it suﬃciently distinctive.

A wide variety of detectors and descriptors have already been proposed in

the literature (e.g. [1, 2, 3,4, 5, 6]). Also, detailed comparisons and evaluations on

benchmarking datasets have been performed [7, 8, 9]. While constructing our fast

detector and descriptor, we built on the insights gained from this previous work

in order to get a feel for what are the aspects contributing to performance. In

our experiments on benchmark image sets as well as on a real object recognition

application, the resulting detector and descriptor are not only faster, but also

more distinctive and equally repeatable.

When working with local features, a ﬁrst issue that needs to be settled is

the required level of invariance. Clearly, this depends on the expected geomet-

ric and photometric deformations, which in turn are determined by the possible

changes in viewing conditions. Here, we focus on scale and image rotation invari-

ant detectors and descriptors. These seem to oﬀer a good compromise between

feature complexity and robustness to commonly occurring deformations. Skew,

anisotropic scaling, and perspective eﬀects are assumed to be second-order ef-

fects, that are covered to some degree by the overall robustness of the descriptor.

As also claimed by Lowe [2], the additional complexity of full aﬃne-invariant fea-

tures often has a negative impact on their robustness and does not pay oﬀ, unless

really large viewpoint changes are to be expected. In some cases, even rotation

invariance can be left out, resulting in a scale-invariant only version of our de-

scriptor, which we refer to as ’upright SURF’ (U-SURF). Indeed, in quite a few

applications, like mobile robot navigation or visual tourist guiding, the camera

often only rotates about the vertical axis. The beneﬁt of avoiding the overkill of

rotation invariance in such cases is not only increased speed, but also increased

discriminative power. Concerning the photometric deformations, we assume a

simple linear model with a scale factor and oﬀset. Notice that our detector and

descriptor don’t use colour.

The paper is organised as follows. Section 2 describes related work, on which

our results are founded. Section 3 describes the interest point detection scheme.

In section 4, the new descriptor is presented. Finally, section 5 shows the exper-

imental results and section 6 concludes the paper.

2 Related Work

Interest Point Detectors. The most widely used detector probably is the Har-

ris corner detector [10], proposed back in 1988, based on the eigenvalues of the

second-moment matrix. However, Harris corners are not scale-invariant. Lin-

deberg introduced the concept of automatic scale selection [1]. This allows to

detect interest points in an image, each with their own characteristic scale.

He experimented with both the determinant of the Hessian matrix as well as

the Laplacian (which corresponds to the trace of the Hessian matrix) to detect

blob-like structures. Mikolajczyk and Schmid reﬁned this method, creating ro-

bust and scale-invariant feature detectors with high repeatability, which they

406 H. Bay, T. Tuytelaars, and L. Van Gool

coined Harris-Laplace and Hessian-Laplace [11]. They used a (scale-adapted)

Harris measure or the determinant of the Hessian matrix to select the location,

and the Laplacian to select the scale. Focusing on speed, Lowe [12] approxi-

mated the Laplacian of Gaussian (LoG) by a Diﬀerence of Gaussians (DoG)

ﬁlter.

Several other scale-invariant interest point detectors have been proposed. Ex-

amples are the salient region detector proposed by Kadir and Brady [13], which

maximises the entropy within the region, and the edge-based region detector pro-

posed by Jurie et al. [14]. They seem less amenable to acceleration though. Also,

several aﬃne-invariant feature detectors have been proposed that can cope with

longer viewpoint changes. However, these fall outside the scope of this paper.

By studying the existing detectors and from published comparisons [15, 8],

we can conclude that (1) Hessian-based detectors are more stable and repeat-

able than their Harris-based counterparts. Using the determinant of the Hessian

matrix rather than its trace (the Laplacian) seems advantageous, as it ﬁres less

on elongated, ill-localised structures. Also, (2) approximations like the DoG can

bring speed at a low cost in terms of lost accuracy.

Feature Descriptors. An even larger variety of feature descriptors has been

proposed, like Gaussian derivatives [16], moment invariants [17], complex fea-

tures [18, 19], steerable ﬁlters [20], phase-based local features [21], and descrip-

tors representing the distribution of smaller-scale features within the interest

point neighbourhood. The latter, introduced by Lowe [2], have been shown to

outperform the others [7]. This can be explained by the fact that they capture

a substantial amount of information about the spatial intensity patterns, while

at the same time being robust to small deformations or localisation errors. The

descriptor in [2], called SIFT for short, computes a histogram of local oriented

gradients around the interest point and stores the bins in a 128-dimensional

vector (8 orientation bins for each of the 4 × 4 location bins).

Various reﬁnements on this basic scheme have been proposed. Ke and Suk-

thankar [4] applied PCA on the gradient image. This PCA-SIFT yields a 36-

dimensional descriptor which is fast for matching, but proved to be less distinc-

tive than SIFT in a second comparative study by Mikolajczyk et al. [8] and slower

feature computation reduces the eﬀect of fast matching. In the same paper [8],

the authors have proposed a variant of SIFT, called GLOH, which proved to be

even more distinctive with the same number of dimensions. However, GLOH is

computationally more expensive.

The SIFT descriptor still seems to be the most appealing descriptor for prac-

tical uses, and hence also the most widely used nowadays. It is distinctive and

relatively fast, which is crucial for on-line applications. Recently, Se et al. [22]

implemented SIFT on a Field Programmable Gate Array (FPGA) and improved

its speed by an order of magnitude. However, the high dimensionality of the de-

scriptor is a drawback of SIFT at the matching step. For on-line applications

on a regular PC, each one of the three steps (detection, description, matching)

should be faster still. Lowe proposed a best-bin-ﬁrst alternative [2] in order to

speed up the matching step, but this results in lower accuracy.

SURF: Speeded Up Robust Features 407

Our approach. In this paper, we propose a novel detector-descriptor scheme,

coined SURF (Speeded-Up Robust Features). The detector is based on the Hes-

sian matrix [11, 1], but uses a very basic approximation, just as DoG [2] is a

very basic Laplacian-based detector. It relies on integral images to reduce the

computation time and we therefore call it the ’Fast-Hessian’ detector. The de-

scriptor, on the other hand, describes a distribution of Haar-wavelet responses

within the interest point neighbourhood. Again, we exploit integral images for

speed. Moreover, only 64 dimensions are used, reducing the time for feature com-

putation and matching, and increasing simultaneously the robustness. We also

present a new indexing step based on the sign of the Laplacian, which increases

not only the matching speed, but also the robustness of the descriptor.

In order to make the paper more self-contained, we succinctly discuss the con-

cept of integral images, as deﬁned by [23]. They allow for the fast implementation

of box type convolution ﬁlters. The entry of an integral image I

(x)atalocation

x =(x, y) represents the sum of all pixels in the input image I of a rectangular

region formed by the point x and the origin, I

(x)=



i≤x

i=0



j≤y

j=0

I(i, j). With

calculated, it only takes four additions to calculate the sum of the intensities

over any upright, rectangular area, independent of its size.

3 Fast-Hessian Detector

We base our detector on the Hessian matrix because of its good performance in

computation time and accuracy. However, rather than using a diﬀerent measure

for selecting the location and the scale (as was done in the Hessian-Laplace

detector [11]), we rely on the determinant of the Hessian for both. Given a point

x =(x, y)inanimageI, the Hessian matrix H(x,σ)inx at scale σ is deﬁned

as follows

H(x,σ)=



(x,σ) L

(x,σ)

(x,σ) L

(x,σ)



, (1)

where L

(x,σ) is the convolution of the Gaussian second order derivative

∂

∂x

g(σ) with the image I in point x, and similarly for L

(x,σ)andL

(x,σ).

Gaussians are optimal for scale-space analysis, as shown in [24]. In practice,

however, the Gaussian needs to be discretised and cropped (Fig. 1 left half), and

even with Gaussian ﬁlters aliasing still occurs as soon as the resulting images are

sub-sampled. Also, the property that no new structures can appear while going to

lower resolutions may have been proven in the 1D case, but is known to not apply

in the relevant 2D case [25]. Hence, the importance of the Gaussian seems to have

been somewhat overrated in this regard, and here we test a simpler alternative.

As Gaussian ﬁlters are non-ideal in any case, and given Lowe’s success with LoG

approximations, we push the approximation even further with box ﬁlters (Fig. 1

right half). These approximate second order Gaussian derivatives, and can be

evaluated very fast using integral images, independently of size. As shown in the

results section, the performance is comparable to the one using the discretised

and cropped Gaussians.

408 H. Bay, T. Tuytelaars, and L. Van Gool

Fig. 1. Left to right: The (discretised and cropped) Gaussian second order partial

derivatives in y-direction and xy-direction, and our approximations thereof using box

ﬁlters. The grey regions are equal to zero.

The 9 × 9 box ﬁlters in Fig. 1 are approximations for Gaussian second order

derivatives with σ =1.2 and represent our lowest scale (i.e. highest spatial

resolution). We denote our approximations by D

, D

,andD

.Theweights

applied to the rectangular regions are kept simple for computational eﬃciency,

but we need to further balance the relative weights in the expression for the

Hessian’s determinant with

(1.2)|

(9)|

(1.2)|

(9)|

=0.912...  0.9, where |x|

the Frobenius norm. This yields

det(H

approx

)=D

− (0.9D

)

. (2)

Furthermore, the ﬁlter responses are normalised with respect to the mask size.

This guarantees a constant Frobenius norm for any ﬁlter size.

Scale spaces are usually implemented as image pyramids. The images are

repeatedly smoothed with a Gaussian and subsequently sub-sampled in order to

achieve a higher level of the pyramid. Due to the use of box ﬁlters and integral

images, we do not have to iteratively apply the same ﬁlter to the output of a

previously ﬁltered layer, but instead can apply such ﬁlters of any size at exactly

the same speed directly on the original image, and even in parallel (although the

latter is not exploited here). Therefore, the scale space is analysed by up-scaling

the ﬁlter size rather than iteratively reducing the image size. The output of the

above 9 ×9 ﬁlter is considered as the initial scale layer, to which we will refer as

scale s =1.2 (corresponding to Gaussian derivatives with σ =1.2). The following

layers are obtained by ﬁltering the image with gradually bigger masks, taking

into account the discrete nature of integral images and the speciﬁc structure of

our ﬁlters. Speciﬁcally, this results in ﬁlters of size 9×9, 15×15, 21×21, 27×27,

etc. At larger scales, the step between consecutive ﬁlter sizes should also scale

accordingly. Hence, for each new octave, the ﬁlter size increase is doubled (going

from 6 to 12 to 24). Simultaneously, the sampling intervals for the extraction of

the interest points can be doubled as well.

As the ratios of our ﬁlter layout remain constant after scaling, the approx-

imated Gaussian derivatives scale accordingly. Thus, for example, our 27 × 27

ﬁlter corresponds to σ =3×1.2=3.6=s. Furthermore, as the Frobenius norm

remains constant for our ﬁlters, they are already scale normalised [26].

In order to localise interest points in the image and over scales, a non-

maximum suppression in a 3 × 3 × 3 neighbourhood is applied. The maxima

of the determinant of the Hessian matrix are then interpolated in scale and

HTML Viewer

Frequently Asked Questions (18)

Q1. What contributions have the authors mentioned in the paper "Surf: speeded up robust features" ?

In this paper, the authors present a novel scaleand rotation-invariant interest point detector and descriptor, coined SURF ( Speeded Up Robust Features ). The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application.

Q2. What are the future works in "Surf: speeded up robust features" ?

Future work will aim at optimising the code for additional speed up.

Q3. What is the benefit of avoiding the overkill of rotation invariance in such cases?

The benefit of avoiding the overkill of rotation invariance in such cases is not only increased speed, but also increased discriminative power.

Q4. How many dimensions are used in the SURF scheme?

only 64 dimensions are used, reducing the time for feature computation and matching, and increasing simultaneously the robustness.

Q5. What is the advantage of using the determinant of the Hessian matrix rather than its?

Using the determinant of the Hessian matrix rather than its trace (the Laplacian) seems advantageous, as it fires less on elongated, ill-localised structures.

Q6. What did the authors use to arrive at the SURF descriptor?

In order to arrive at these SURF descriptors, the authors experimented with fewer and more wavelet features, using d2x and d 2 y, higher-order wavelets, PCA, median values, average values, etc.

Q7. What is the valuable property of an interest point detector?

The most valuable property of an interest point detector is its repeatability, i.e. whether it reliably finds the same interest points under different viewing conditions.

Q8. What is the first step for constructing the descriptor?

For the extraction of the descriptor, the first step consists of constructing a square region centered around the interest point, and oriented along the orientation selected in the previous section.

Q9. What is the widely used detector?

The most widely used detector probably is the Harris corner detector [10], proposed back in 1988, based on the eigenvalues of the second-moment matrix.

Q10. What is the underlying intensity structure of the sub-regions?

each sub-region has a four-dimensional descriptor vector v for its underlying intensity structure v = ( ∑ dx, ∑ dy, ∑ |dx|, ∑ |dy|).

Q11. What are some examples of interest point detectors?

Examples are the salient region detector proposed by Kadir and Brady [13], which maximises the entropy within the region, and the edge-based region detector proposed by Jurie et al. [14].

Q12. How many additions to the sum of the intensities?

With IΣ calculated, it only takes four additions to calculate the sum of the intensities over any upright, rectangular area, independent of its size.

Q13. What are the effects of the descriptor?

anisotropic scaling, and perspective effects are assumed to be second-order effects, that are covered to some degree by the overall robustness of the descriptor.

Q14. What is the Frobenius norm for the filter?

for example, their 27 × 27 filter corresponds to σ = 3 × 1.2 = 3.6 = s. Furthermore, as the Frobenius norm remains constant for their filters, they are already scale normalised [26].

Q15. What is the effect of fast matching?

This PCA-SIFT yields a 36- dimensional descriptor which is fast for matching, but proved to be less distinctive than SIFT in a second comparative study by Mikolajczyk et al. [8] and slower feature computation reduces the effect of fast matching.

Q16. Why is SURF better suited to the feature space?

Due to space limitations, only results on similarity threshold based matching are shown in Fig. 7, as this technique is better suited to represent the distribution of the descriptor in its feature space [8] and it is in more general use.

Q17. What is the way to compute the SURF descriptor?

the authors also propose an upright version of their descriptor (USURF) that is not invariant to image rotation and therefore faster to compute and better suited for applications where the camera remains more or less horizontal.

Q18. What is the appealing descriptor for practical uses?

The SIFT descriptor still seems to be the most appealing descriptor for practical uses, and hence also the most widely used nowadays.

SURF: speeded up robust features

Summary (2 min read)

1 Introduction

3 Fast-Hessian Detector

4 SURF Descriptor

4.1 Orientation Assignment

4.2 Descriptor Components

5 Experimental Results

6 Conclusion

Figures (9)

Citations

Cites background or methods from "SURF: speeded up robust features"

References