scispace - formally typeset
Search or ask a question
Book ChapterDOI

SURF: speeded up robust features

07 May 2006-Vol. 1, pp 404-417
TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Abstract: In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF's strong performance.

Summary (2 min read)

1 Introduction

  • The task of finding correspondences between two images of the same scene or object is part of many computer vision applications.
  • It has been their goal to develop both a detector and descriptor, which in comparison to the state-of-the-art are faster to compute, while not sacrificing performance.
  • Concerning the photometric deformations, the authors assume a simple linear model with a scale factor and offset.
  • Section 2 describes related work, on which their results are founded.

3 Fast-Hessian Detector

  • The authors base their detector on the Hessian matrix because of its good performance in computation time and accuracy.
  • Therefore, the scale space is analysed by up-scaling the filter size rather than iteratively reducing the image size.
  • At larger scales, the step between consecutive filter sizes should also scale accordingly.
  • As the ratios of their filter layout remain constant after scaling, the approximated Gaussian derivatives scale accordingly.
  • Fig. 2 (left) shows an example of the detected interest points using their ’Fast-Hessian’ detector.

4 SURF Descriptor

  • The good performance of SIFT compared to other descriptors [8] is remarkable.
  • Its mixing of crudely localised information and the distribution of gradient related features seems to yield good distinctive power while fending off the effects of localisation errors in terms of scale or space.
  • The proposed SURF descriptor is based on similar properties, with a complexity stripped down even further.
  • The first step consists of fixing a reproducible orientation based on information from a circular region around the interest point.
  • These two steps are now explained in turn.

4.1 Orientation Assignment

  • For that purpose, the authors first calculate the Haar-wavelet responses in x and y direction, shown in Fig. 2, and this in a circular neighbourhood of radius 6s around the interest point, with s the scale at which the interest point was detected.
  • Therefore, the authors use again integral images for fast filtering.
  • The horizontal and vertical responses within the window are summed.
  • The longest such vector lends its orientation to the interest point.
  • Small sizes fire on single dominating wavelet responses, large sizes yield maxima in vector length that are not outspoken.

4.2 Descriptor Components

  • For the extraction of the descriptor, the first step consists of constructing a square region centered around the interest point, and oriented along the orientation selected in the previous section.
  • The wavelet responses are invariant to a bias in illumination .
  • The extended descriptor for 4 × 4 subregions (SURF-128) comes out to perform best.
  • Hence, this minimal information allows for faster matching and gives a slight increase in performance.

5 Experimental Results

  • First, the authors present results on a standard evaluation set, fot both the detector and the descriptor.
  • For the detector comparison, the authors selected the two viewpoint changes (Graffiti and Wall), one zoom and rotation (Boat) and lighting changes (see Fig. 6, discussed below).
  • The SURF descriptor outperforms the other descriptors in a systematic and significant way, with sometimes more than 10% improvement in recall for the same level of precision.
  • The timings were evaluated on a standard Linux PC (Pentium IV, 3GHz).
  • The object shown on the reference image with the highest number of matches with respect to the test image is chosen as the recognised object.

6 Conclusion

  • The authors have presented a fast and performant interest point detection-description scheme which outperforms the current state-of-the art, both in speed and accuracy.
  • The descriptor is easily extendable for the description of affine invariant regions.
  • The authors gratefully acknowledge the support from Swiss SNF NCCR project IM2, Toyota-TME and the Flemish Fund for Scientific Research.

Did you find this useful? Give us your feedback

Figures (9)

Content maybe subject to copyright    Report

SURF: Speeded Up Robust Features
Herbert Bay
1
, Tinne Tuytelaars
2
, and Luc Van Gool
1,2
1
ETH Zurich
{bay, vangool}@vision.ee.ethz.ch
2
Katholieke Universiteit Leuven
{Tinne.Tuytelaars, Luc.Vangool}@esat.kuleuven.be
Abstract. In this paper, we present a novel scale- and rotation-invariant
interest point detector and descriptor, coined SURF (Speeded Up Ro-
bust Features). It approximates or even outperforms previously proposed
schemes with respect to repeatability, distinctiveness, and robustness, yet
can be computed and compared much faster.
This is achieved by relying on integral images for image convolutions;
by building on the strengths of the leading existing detectors and descrip-
tors (in casu, using a Hessian matrix-based measure for the detector, and
a distribution-based descriptor); and by simplifying these methods to the
essential. This leads to a combination of novel detection, description, and
matching steps. The paper presents experimental results on a standard
evaluation set, as well as on imagery obtained in the context of a real-life
object recognition application. Both show SURF’s strong performance.
1 Introduction
The task of finding correspondences between two images of the same scene or
object is part of many computer vision applications. Camera calibration, 3D
reconstruction, image registration, and object recognition are just a few. The
search for discrete image correspondences the goal of this work can be di-
vided into three main steps. First, ‘interest points’ are selected at distinctive
locations in the image, such as corners, blobs, and T-junctions. The most valu-
able property of an interest point detector is its repeatability, i.e. whether it
reliably finds the same interest points under different viewing conditions. Next,
the neighbourhood of every interest point is represented by a feature vector. This
descriptor has to be distinctive and, at the same time, robust to noise, detec-
tion errors, and geometric and photometric deformations. Finally, the descriptor
vectors are matched between different images. The matching is often based on a
distance between the vectors, e.g. the Mahanalobis or Euclidean distance. The
dimension of the descriptor has a direct impact on the time this takes, and a
lower number of dimensions is therefore desirable.
It has been our goal to develop both a detector and descriptor, which in
comparison to the state-of-the-art are faster to compute, while not sacrificing
performance. In order to succeed, one has to strike a balance between the above
A. Leonardis, H. Bischof, and A. Pinz (Eds.): ECCV 2006, Part I, LNCS 3951, pp. 404–417, 2006.
c
Springer-Verlag Berlin Heidelberg 2006

SURF: Speeded Up Robust Features 405
requirements, like reducing the descriptor’s dimension and complexity, while
keeping it sufficiently distinctive.
A wide variety of detectors and descriptors have already been proposed in
the literature (e.g. [1, 2, 3,4, 5, 6]). Also, detailed comparisons and evaluations on
benchmarking datasets have been performed [7, 8, 9]. While constructing our fast
detector and descriptor, we built on the insights gained from this previous work
in order to get a feel for what are the aspects contributing to performance. In
our experiments on benchmark image sets as well as on a real object recognition
application, the resulting detector and descriptor are not only faster, but also
more distinctive and equally repeatable.
When working with local features, a first issue that needs to be settled is
the required level of invariance. Clearly, this depends on the expected geomet-
ric and photometric deformations, which in turn are determined by the possible
changes in viewing conditions. Here, we focus on scale and image rotation invari-
ant detectors and descriptors. These seem to offer a good compromise between
feature complexity and robustness to commonly occurring deformations. Skew,
anisotropic scaling, and perspective effects are assumed to be second-order ef-
fects, that are covered to some degree by the overall robustness of the descriptor.
As also claimed by Lowe [2], the additional complexity of full affine-invariant fea-
tures often has a negative impact on their robustness and does not pay off, unless
really large viewpoint changes are to be expected. In some cases, even rotation
invariance can be left out, resulting in a scale-invariant only version of our de-
scriptor, which we refer to as ’upright SURF’ (U-SURF). Indeed, in quite a few
applications, like mobile robot navigation or visual tourist guiding, the camera
often only rotates about the vertical axis. The benefit of avoiding the overkill of
rotation invariance in such cases is not only increased speed, but also increased
discriminative power. Concerning the photometric deformations, we assume a
simple linear model with a scale factor and offset. Notice that our detector and
descriptor don’t use colour.
The paper is organised as follows. Section 2 describes related work, on which
our results are founded. Section 3 describes the interest point detection scheme.
In section 4, the new descriptor is presented. Finally, section 5 shows the exper-
imental results and section 6 concludes the paper.
2 Related Work
Interest Point Detectors. The most widely used detector probably is the Har-
ris corner detector [10], proposed back in 1988, based on the eigenvalues of the
second-moment matrix. However, Harris corners are not scale-invariant. Lin-
deberg introduced the concept of automatic scale selection [1]. This allows to
detect interest points in an image, each with their own characteristic scale.
He experimented with both the determinant of the Hessian matrix as well as
the Laplacian (which corresponds to the trace of the Hessian matrix) to detect
blob-like structures. Mikolajczyk and Schmid refined this method, creating ro-
bust and scale-invariant feature detectors with high repeatability, which they

406 H. Bay, T. Tuytelaars, and L. Van Gool
coined Harris-Laplace and Hessian-Laplace [11]. They used a (scale-adapted)
Harris measure or the determinant of the Hessian matrix to select the location,
and the Laplacian to select the scale. Focusing on speed, Lowe [12] approxi-
mated the Laplacian of Gaussian (LoG) by a Difference of Gaussians (DoG)
filter.
Several other scale-invariant interest point detectors have been proposed. Ex-
amples are the salient region detector proposed by Kadir and Brady [13], which
maximises the entropy within the region, and the edge-based region detector pro-
posed by Jurie et al. [14]. They seem less amenable to acceleration though. Also,
several affine-invariant feature detectors have been proposed that can cope with
longer viewpoint changes. However, these fall outside the scope of this paper.
By studying the existing detectors and from published comparisons [15, 8],
we can conclude that (1) Hessian-based detectors are more stable and repeat-
able than their Harris-based counterparts. Using the determinant of the Hessian
matrix rather than its trace (the Laplacian) seems advantageous, as it fires less
on elongated, ill-localised structures. Also, (2) approximations like the DoG can
bring speed at a low cost in terms of lost accuracy.
Feature Descriptors. An even larger variety of feature descriptors has been
proposed, like Gaussian derivatives [16], moment invariants [17], complex fea-
tures [18, 19], steerable filters [20], phase-based local features [21], and descrip-
tors representing the distribution of smaller-scale features within the interest
point neighbourhood. The latter, introduced by Lowe [2], have been shown to
outperform the others [7]. This can be explained by the fact that they capture
a substantial amount of information about the spatial intensity patterns, while
at the same time being robust to small deformations or localisation errors. The
descriptor in [2], called SIFT for short, computes a histogram of local oriented
gradients around the interest point and stores the bins in a 128-dimensional
vector (8 orientation bins for each of the 4 × 4 location bins).
Various refinements on this basic scheme have been proposed. Ke and Suk-
thankar [4] applied PCA on the gradient image. This PCA-SIFT yields a 36-
dimensional descriptor which is fast for matching, but proved to be less distinc-
tive than SIFT in a second comparative study by Mikolajczyk et al. [8] and slower
feature computation reduces the effect of fast matching. In the same paper [8],
the authors have proposed a variant of SIFT, called GLOH, which proved to be
even more distinctive with the same number of dimensions. However, GLOH is
computationally more expensive.
The SIFT descriptor still seems to be the most appealing descriptor for prac-
tical uses, and hence also the most widely used nowadays. It is distinctive and
relatively fast, which is crucial for on-line applications. Recently, Se et al. [22]
implemented SIFT on a Field Programmable Gate Array (FPGA) and improved
its speed by an order of magnitude. However, the high dimensionality of the de-
scriptor is a drawback of SIFT at the matching step. For on-line applications
on a regular PC, each one of the three steps (detection, description, matching)
should be faster still. Lowe proposed a best-bin-first alternative [2] in order to
speed up the matching step, but this results in lower accuracy.

SURF: Speeded Up Robust Features 407
Our approach. In this paper, we propose a novel detector-descriptor scheme,
coined SURF (Speeded-Up Robust Features). The detector is based on the Hes-
sian matrix [11, 1], but uses a very basic approximation, just as DoG [2] is a
very basic Laplacian-based detector. It relies on integral images to reduce the
computation time and we therefore call it the ’Fast-Hessian detector. The de-
scriptor, on the other hand, describes a distribution of Haar-wavelet responses
within the interest point neighbourhood. Again, we exploit integral images for
speed. Moreover, only 64 dimensions are used, reducing the time for feature com-
putation and matching, and increasing simultaneously the robustness. We also
present a new indexing step based on the sign of the Laplacian, which increases
not only the matching speed, but also the robustness of the descriptor.
In order to make the paper more self-contained, we succinctly discuss the con-
cept of integral images, as defined by [23]. They allow for the fast implementation
of box type convolution filters. The entry of an integral image I
Σ
(x)atalocation
x =(x, y) represents the sum of all pixels in the input image I of a rectangular
region formed by the point x and the origin, I
Σ
(x)=
ix
i=0
jy
j=0
I(i, j). With
I
Σ
calculated, it only takes four additions to calculate the sum of the intensities
over any upright, rectangular area, independent of its size.
3 Fast-Hessian Detector
We base our detector on the Hessian matrix because of its good performance in
computation time and accuracy. However, rather than using a different measure
for selecting the location and the scale (as was done in the Hessian-Laplace
detector [11]), we rely on the determinant of the Hessian for both. Given a point
x =(x, y)inanimageI, the Hessian matrix H(x)inx at scale σ is defined
as follows
H(x)=
L
xx
(x) L
xy
(x)
L
xy
(x) L
yy
(x)
, (1)
where L
xx
(x) is the convolution of the Gaussian second order derivative
2
∂x
2
g(σ) with the image I in point x, and similarly for L
xy
(x)andL
yy
(x).
Gaussians are optimal for scale-space analysis, as shown in [24]. In practice,
however, the Gaussian needs to be discretised and cropped (Fig. 1 left half), and
even with Gaussian filters aliasing still occurs as soon as the resulting images are
sub-sampled. Also, the property that no new structures can appear while going to
lower resolutions may have been proven in the 1D case, but is known to not apply
in the relevant 2D case [25]. Hence, the importance of the Gaussian seems to have
been somewhat overrated in this regard, and here we test a simpler alternative.
As Gaussian filters are non-ideal in any case, and given Lowe’s success with LoG
approximations, we push the approximation even further with box filters (Fig. 1
right half). These approximate second order Gaussian derivatives, and can be
evaluated very fast using integral images, independently of size. As shown in the
results section, the performance is comparable to the one using the discretised
and cropped Gaussians.

408 H. Bay, T. Tuytelaars, and L. Van Gool
Fig. 1. Left to right: The (discretised and cropped) Gaussian second order partial
derivatives in y-direction and xy-direction, and our approximations thereof using box
filters. The grey regions are equal to zero.
The 9 × 9 box filters in Fig. 1 are approximations for Gaussian second order
derivatives with σ =1.2 and represent our lowest scale (i.e. highest spatial
resolution). We denote our approximations by D
xx
, D
yy
,andD
xy
.Theweights
applied to the rectangular regions are kept simple for computational efficiency,
but we need to further balance the relative weights in the expression for the
Hessian’s determinant with
|L
xy
(1.2)|
F
|D
xx
(9)|
F
|L
xx
(1.2)|
F
|D
xy
(9)|
F
=0.912... 0.9, where |x|
F
is
the Frobenius norm. This yields
det(H
approx
)=D
xx
D
yy
(0.9D
xy
)
2
. (2)
Furthermore, the filter responses are normalised with respect to the mask size.
This guarantees a constant Frobenius norm for any filter size.
Scale spaces are usually implemented as image pyramids. The images are
repeatedly smoothed with a Gaussian and subsequently sub-sampled in order to
achieve a higher level of the pyramid. Due to the use of box filters and integral
images, we do not have to iteratively apply the same filter to the output of a
previously filtered layer, but instead can apply such filters of any size at exactly
the same speed directly on the original image, and even in parallel (although the
latter is not exploited here). Therefore, the scale space is analysed by up-scaling
the filter size rather than iteratively reducing the image size. The output of the
above 9 ×9 filter is considered as the initial scale layer, to which we will refer as
scale s =1.2 (corresponding to Gaussian derivatives with σ =1.2). The following
layers are obtained by filtering the image with gradually bigger masks, taking
into account the discrete nature of integral images and the specific structure of
our filters. Specifically, this results in filters of size 9×9, 15×15, 21×21, 27×27,
etc. At larger scales, the step between consecutive filter sizes should also scale
accordingly. Hence, for each new octave, the filter size increase is doubled (going
from 6 to 12 to 24). Simultaneously, the sampling intervals for the extraction of
the interest points can be doubled as well.
As the ratios of our filter layout remain constant after scaling, the approx-
imated Gaussian derivatives scale accordingly. Thus, for example, our 27 × 27
filter corresponds to σ =3×1.2=3.6=s. Furthermore, as the Frobenius norm
remains constant for our filters, they are already scale normalised [26].
In order to localise interest points in the image and over scales, a non-
maximum suppression in a 3 × 3 × 3 neighbourhood is applied. The maxima
of the determinant of the Hessian matrix are then interpolated in scale and

Citations
More filters
Journal ArticleDOI
TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.

12,449 citations

Proceedings ArticleDOI
06 Nov 2011
TL;DR: This paper proposes a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise, and demonstrates through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations.
Abstract: Feature matching is at the base of many computer vision problems, such as object recognition or structure from motion. Current methods rely on costly descriptors for detection and matching. In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise. We demonstrate through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations. The efficiency is tested on several real-world applications, including object detection and patch-tracking on a smart phone.

8,702 citations


Cites background or methods from "SURF: speeded up robust features"

  • ...This has led to an intensive search for replacements with lower computation cost; arguably the best of these is SURF [2]....

    [...]

  • ...There are various ways to describe the orientation of a keypoint; many of these involve histograms of gradient computations, for example in SIFT [17] and the approximation by block patterns in SURF [2]....

    [...]

Journal ArticleDOI
TL;DR: ORB-SLAM as discussed by the authors is a feature-based monocular SLAM system that operates in real time, in small and large indoor and outdoor environments, with a survival of the fittest strategy that selects the points and keyframes of the reconstruction.
Abstract: This paper presents ORB-SLAM, a feature-based monocular simultaneous localization and mapping (SLAM) system that operates in real time, in small and large indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.

4,522 citations

Journal ArticleDOI
TL;DR: A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation.
Abstract: This paper presents ORB-SLAM, a feature-based monocular SLAM system that operates in real time, in small and large, indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.

3,807 citations

Proceedings Article
21 Jun 2014
TL;DR: DeCAF as discussed by the authors is an open-source implementation of these deep convolutional activation features, along with all associated network parameters, to enable vision researchers to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Abstract: We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.

3,760 citations

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Proceedings ArticleDOI
01 Dec 2001
TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

18,620 citations


"SURF: speeded up robust features" refers background in this paper

  • ...In order to make the paper more self-contained, we succinctly discuss the concept of integral images, as defined by [23]....

    [...]

Proceedings ArticleDOI
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

16,989 citations


"SURF: speeded up robust features" refers methods in this paper

  • ...Focusing on speed, Lowe [12] approximated the Laplacian of Gaussian (LoG) by a Difference of Gaussians (DoG) filter....

    [...]

Proceedings ArticleDOI
01 Jan 1988
TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

13,993 citations

Journal ArticleDOI
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

7,057 citations

Frequently Asked Questions (18)
Q1. What contributions have the authors mentioned in the paper "Surf: speeded up robust features" ?

In this paper, the authors present a novel scaleand rotation-invariant interest point detector and descriptor, coined SURF ( Speeded Up Robust Features ). The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. 

Future work will aim at optimising the code for additional speed up. 

The benefit of avoiding the overkill of rotation invariance in such cases is not only increased speed, but also increased discriminative power. 

only 64 dimensions are used, reducing the time for feature computation and matching, and increasing simultaneously the robustness. 

Using the determinant of the Hessian matrix rather than its trace (the Laplacian) seems advantageous, as it fires less on elongated, ill-localised structures. 

In order to arrive at these SURF descriptors, the authors experimented with fewer and more wavelet features, using d2x and d 2 y, higher-order wavelets, PCA, median values, average values, etc. 

The most valuable property of an interest point detector is its repeatability, i.e. whether it reliably finds the same interest points under different viewing conditions. 

For the extraction of the descriptor, the first step consists of constructing a square region centered around the interest point, and oriented along the orientation selected in the previous section. 

The most widely used detector probably is the Harris corner detector [10], proposed back in 1988, based on the eigenvalues of the second-moment matrix. 

each sub-region has a four-dimensional descriptor vector v for its underlying intensity structure v = ( ∑ dx, ∑ dy, ∑ |dx|, ∑ |dy|). 

Examples are the salient region detector proposed by Kadir and Brady [13], which maximises the entropy within the region, and the edge-based region detector proposed by Jurie et al. [14]. 

With IΣ calculated, it only takes four additions to calculate the sum of the intensities over any upright, rectangular area, independent of its size. 

anisotropic scaling, and perspective effects are assumed to be second-order effects, that are covered to some degree by the overall robustness of the descriptor. 

for example, their 27 × 27 filter corresponds to σ = 3 × 1.2 = 3.6 = s. Furthermore, as the Frobenius norm remains constant for their filters, they are already scale normalised [26]. 

This PCA-SIFT yields a 36- dimensional descriptor which is fast for matching, but proved to be less distinctive than SIFT in a second comparative study by Mikolajczyk et al. [8] and slower feature computation reduces the effect of fast matching. 

Due to space limitations, only results on similarity threshold based matching are shown in Fig. 7, as this technique is better suited to represent the distribution of the descriptor in its feature space [8] and it is in more general use. 

the authors also propose an upright version of their descriptor (USURF) that is not invariant to image rotation and therefore faster to compute and better suited for applications where the camera remains more or less horizontal. 

The SIFT descriptor still seems to be the most appealing descriptor for practical uses, and hence also the most widely used nowadays.