scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Speeded-Up Robust Features (SURF)

TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
About: This article is published in Computer Vision and Image Understanding.The article was published on 2008-06-01 and is currently open access. It has received 12449 citations till now. The article focuses on the topics: GLOH & Principal curvature-based region detector.

Summary (5 min read)

1. Introduction

  • The task of finding point correspondences between two images of the same scene or object is part of many computer vision applications.
  • It has been their goal to develop both a detector and descriptor that, in comparison to the state-of-the-art, are fast to compute while not sacrificing performance.
  • Skew, anisotropic scaling, and perspective effects are assumed to be second-order effects, that are covered to some degree by the overall robustness of the descriptor.
  • In section 3, the authors describe the strategy applied for fast and robust interest point detection.
  • Both applications highlight SURF's benefits in terms of speed and robustness as opposed to other strategies.

2.1. Interest Point Detection

  • The most widely used detector is probably the Harris corner detector [15] , proposed back in 1988.
  • Mikolajczyk and Schmid [26] refined this method, creating robust and scale-invariant feature detectors with high repeatability, which they coined Harris-Laplace and Hessian-Laplace.
  • They used a (scale-adapted) Harris measure or the determinant of the Hessian matrix to select the location, and the Laplacian to select the scale.
  • They seem less amenable to acceleration though.
  • These fall outside the scope of this article.

2.2. Interest Point Description

  • An even larger variety of feature descriptors has been proposed, like Gaussian derivatives [11] , moment invariants [32] , complex features [1, 36] , steerable filters [12] , phasebased local features [6] , and descriptors representing the distribution of smaller-scale features within the interest point neighbourhood.
  • The latter, introduced by Lothe authors [24] , have been shown to outperform the others [28] .
  • In the same paper [30] , the authors proposed a variant of SIFT, called GLOH, which proved to be even more distinctive with the same number of dimensions.
  • It is distinctive and relatively fast, which is crucial for on-line applications.
  • Together with the descriptor's low dimensionality, any matching algorithm is bound to perform faster.

3. Interest Point Detection

  • The authors approach for interest point detection uses a very basic Hessian-matrix approximation.
  • This lends itself to the use of integral images as made popular by Viola and Jones [41] , which reduces the computation time drastically.

3.1. Integral Images

  • In order to make the article more self-contained, the authors briefly discuss the concept of integral images.
  • They allow for fast computation of box type convolution filters.
  • Once the integral image has been computed, it takes three additions to calculate the sum of the intensities over any upright, rectangular area .
  • Hence, the calculation time is independent of its size.

3.2. Hessian Matrix Based Interest Points

  • The authors base their detector on the Hessian matrix because of its good performance in accuracy.
  • As shown in the results section and figure 3 , the performance is comparable or better than with the discretised and cropped Gaussians.
  • The relative weight w of the filter responses is used to balance the expression for the Hessian's determinant.
  • Notice that for theoretical correctness, the weighting changes depending on the scale.
  • The approximated determinant of the Hessian represents the blob response in the image at location x.

3.3. Scale Space Representation

  • Interest points need to be found at different scales, not least because the search of correspondences often requires their comparison in images where they are seen at different scales.
  • The images are repeatedly smoothed with a Gaussian and then sub-sampled in order to achieve a higher level of the pyramid.
  • Therefore, the scale space is analysed by up-scaling the filter size rather than iteratively reducing the image size, figure 4 .
  • In total, an octave encompasses a scaling factor of 2 (which implies that one needs to more than double the filter size, see below).
  • At the same time, the sampling intervals for the extraction of the interest points can be doubled as well for every new octave.

3.4. Interest Point Localisation

  • Specifically, the authors use a fast variant introduced by Neubeck and Van Gool [33] .
  • The maxima of the determinant of the Hessian matrix are then interpolated in scale and image space with the method proposed by Brown et al. [5] .
  • Scale space interpolation is especially important in their case, as the difference in scale between the first layers of every octave is relatively large.
  • Figure 8 shows an example of the detected interest points using their 'Fast-Hessian' detector.

4. Interest Point Description and Matching

  • The authors descriptor describes the distribution of the intensity content within the interest point neighbourhood, similar to the gradient information extracted by SIFT [24] and its variants.
  • The authors build on the distribution of first order Haar wavelet responses in x and y direction rather than the gradient, exploit integral images for speed, and use only 64 dimensions.
  • The authors refer to their detector-descriptor scheme as SURF -Speeded-Up Robust Features.
  • The first step consists of fixing a reproducible orientation based on information from a circular region around the interest point.
  • These three steps are explained in the following.

4.1. Orientation Assignment

  • For that purpose, the authors first calculate the Haar wavelet responses in x and y direction within a circular neighbourhood of radius 6s around the interest point, with s the scale at which the interest point was detected.
  • In keeping with the rest, also the size of the wavelets are scale dependent and set to a side length of 4s.
  • Therefore, the authors can again use integral images for fast filtering.
  • The two summed responses then yield a local orientation vector.
  • Small sizes fire on single dominating gradients, large sizes tend to yield maxima in vector length that are not outspoken.

4.2. Descriptor based on Sum of Haar Wavelet Responses

  • For the extraction of the descriptor, the first step consists of constructing a square region centred around the interest point and oriented along the orientation selected in the previous section.
  • "Horizontal" and "vertical" here is defined in relation to the selected interest point orientation .
  • The authors then varied the number of sample points and sub-regions.
  • The 4 × 4 sub-region division solution provided the best results, see also section 5.
  • On the other hand, the short descriptor with 3 × 3 subregions (SURF-36) performs slightly worse, but allows for very fast matching and is still acceptable in comparison to other descriptors in the literature.

4.3. Fast Indexing for Matching

  • For fast indexing during the matching stage, the sign of the Laplacian (i.e. the trace of the Hessian matrix) for the underlying interest point is included.
  • Typically, the interest points are found at blob-type structures.
  • The sign of the Laplacian distinguishes bright blobs on dark backgrounds from the reverse situation.
  • This feature is available at no extra computational cost as it was already computed during the detection phase.
  • Note that this is also of advantage for more advanced indexing methods.

5. Results

  • The following presents both simulated as well as realworld results.
  • First, the authors evaluate the effect of some parameter settings and show the overall performance of their detector and descriptor based on a standard evaluation set.
  • (First image of Graffiti scene, 800 × 640) Then, the authors describe two possible applications.
  • Taking this application to image registration a bit further, the authors focus in this article on the more difficult problem of camera calibration and 3D reconstruction, also in wide-baseline cases.
  • SURF manages to calibrate the cameras even in challenging cases reliably and accurately.

5.1. Experimental Evaluation and Parameter Settings

  • The authors tested their detector using the image sequences and testing software provided by Mikolajczyk 2 .
  • The evaluation criterion is the repeatability score.
  • The test sequences comprise images of real textured and structured scenes.
  • There are different types of geometric and photometric transformations, like changing viewpoints, zoom and rotation, image blur, lighting changes and JPEG compression.

5.1.1. SURF Detector

  • The authors tested two versions of their Fast-Hessian detector, depending on the initial Gaussian derivative filter size.
  • The thresholds were adapted according to the number of interest points found with the DoG detector.
  • The FH-15 detector is more than three times faster than DoG and more than four times faster than Hessian-Laplace (see also table 1 ).
  • The repeatability scores for the Graffiti sequence are comparable for all detectors.
  • Hence, these deformations have to be accounted for by the overall robustness of the features.

5.1.2. SURF Descriptor

  • Here, the authors focus on two options offered by the SURF descriptor and their effect on recall/precision.
  • Firstly, the number of divisions of the square grid in figure 12, and hence the descriptor size, has a major impact on the matching speed.
  • Secondly, the authors consider the extended descriptor as described above.
  • Overall, the effect of the extended version is minimal.
  • Here, the authors only show a comparison with two other prominent description schemes (SIFT [24] and GLOH [30] ), again averaged over the test sequences . SURF-64 turns out to perform best.

5.2. Application to 3D

  • The authors evaluate the accuracy of their Fast-Hessian detector for the application of camera selfcalibration and 3D reconstruction.
  • The first evaluation compares different state-of-the-art interest point detectors for the two-view case.
  • The second evaluation considers the N -view case for camera self-calibration and dense 3D reconstruction from multiple images, some taken under wide-baseline conditions.

5.2.1. 2-view Case

  • In order to evaluate the performance of different interest point detection schemes for camera calibration and 3D reconstruction, the authors created a controlled environment.
  • A good scene for such an evaluation are two highly textured planes forming a right angle (measured 88.6 in their case), see figure 20.
  • Principal point and aspect ratio are known.
  • As the number of correct matches is an important factor for the accuracy, the authors adjusted the interest point detectors' parameters so that after matching, they are left with 800 correct matches (matches not belonging to the angle are filtered).
  • Table 2 shows these quantitative results for their two versions of the Fast-Hessian detector (FH-9 and FH-15), the DoG features of SIFT [24] , and the Hessian-and Harris-Laplace detectors proposed by Mikolajczyk and Schmid [29] .

5.2.2. N-view Case

  • The SURF detection and description algorithms have been integrated with the Epoch 3D Webservice of the VISICS research group at the K.U. Leuven 3 .
  • There, the calibration of the cameras and dense depth maps are computed automatically using these images only [40] .
  • The previous procedure using Harris corners and normalised cross correlation of image windows has problems matching such wide-baseline images.
  • Furthermore, the DoG detector combined with SIFT description failed on some image sequences, where SURF succeeded to calibrate all the cameras accurately.
  • The vase is easily recognisable even in the sparse 3D model.

5.3. Application to Object Recognition

  • Bay et al. [3] already demonstrated the usefulness of SURF in a simple object detection task.
  • Basis for this was a publicly available implementation of two bag-of-words classifiers [10] .
  • While this is a rather simple test set for object recognition in general, it definitely serves the purpose of comparing the performance of the actual descriptors.
  • As can be seen, the upright counterparts for both SURF-128 and SURF-64 perform best.
  • These positive results indicate that SURF should be very well suited for tasks in object detection, object recognition or image retrieval.

6. Conclusion and Outlook

  • The authors presented a fast and performant scale and rotationinvariant interest point detector and descriptor.
  • The important speed gain is due to the use of integral images, which drastically reduce the number of operations for simple box convolutions, independent of the chosen scale.
  • The high repeatability is advantageous for camera self-calibration, where an accurate interest point detection has a direct impact on the accuracy of the camera self-calibration and therefore on the quality of the resulting 3D model.
  • The simplicity and again the use of integral images make their descriptor competitive in terms of speed.
  • The latest version of SURF is available for public download.

Did you find this useful? Give us your feedback

Figures (27)
Citations
More filters
Book ChapterDOI
05 Sep 2010
TL;DR: This work proposes to use binary strings as an efficient feature point descriptor, which is called BRIEF, and shows that it is highly discriminative even when using relatively few bits and can be computed using simple intensity difference tests.
Abstract: We propose to use binary strings as an efficient feature point descriptor, which we call BRIEF. We show that it is highly discriminative even when using relatively few bits and can be computed using simple intensity difference tests. Furthermore, the descriptor similarity can be evaluated using the Hamming distance, which is very efficient to compute, instead of the L2 norm as is usually done. As a result, BRIEF is very fast both to build and to match. We compare it against SURF and U-SURF on standard benchmarks and show that it yields a similar or better recognition performance, while running in a fraction of the time required by either.

3,558 citations

Proceedings ArticleDOI
06 Nov 2011
TL;DR: A comprehensive evaluation on benchmark datasets reveals BRISK's adaptive, high quality performance as in state-of-the-art algorithms, albeit at a dramatically lower computational cost (an order of magnitude faster than SURF in cases).
Abstract: Effective and efficient generation of keypoints from an image is a well-studied problem in the literature and forms the basis of numerous Computer Vision applications. Established leaders in the field are the SIFT and SURF algorithms which exhibit great performance under a variety of image transformations, with SURF in particular considered as the most computationally efficient amongst the high-performance methods to date. In this paper we propose BRISK1, a novel method for keypoint detection, description and matching. A comprehensive evaluation on benchmark datasets reveals BRISK's adaptive, high quality performance as in state-of-the-art algorithms, albeit at a dramatically lower computational cost (an order of magnitude faster than SURF in cases). The key to speed lies in the application of a novel scale-space FAST-based detector in combination with the assembly of a bit-string descriptor from intensity comparisons retrieved by dedicated sampling of each keypoint neighborhood.

3,292 citations

Journal ArticleDOI
TL;DR: This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied toTransfer learning, which can be applied to big data environments.
Abstract: Machine learning and data mining techniques have been used in numerous real-world applications. An assumption of traditional machine learning methodologies is the training data and testing data are taken from the same domain, such that the input feature space and data distribution characteristics are the same. However, in some real-world machine learning scenarios, this assumption does not hold. There are cases where training data is expensive or difficult to collect. Therefore, there is a need to create high-performance learners trained with more easily obtained data from different domains. This methodology is referred to as transfer learning. This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied to transfer learning. Lastly, there is information listed on software downloads for various transfer learning solutions and a discussion of possible future research work. The transfer learning solutions surveyed are independent of data size and can be applied to big data environments.

2,900 citations

Journal ArticleDOI
TL;DR: A detailed overview of current advances in vision-based human action recognition is provided, including a discussion of limitations of the state of the art and outline promising directions of research.

2,282 citations

Proceedings ArticleDOI
20 Jun 2009
TL;DR: The experiments show that by using an attribute layer it is indeed possible to build a learning object detection system that does not require any training images of the target classes, and assembled a new large-scale dataset, “Animals with Attributes”, of over 30,000 animal images that match the 50 classes in Osherson's classic table of how strongly humans associate 85 semantic attributes with animal classes.
Abstract: We study the problem of object classification when training and test classes are disjoint, i.e. no training examples of the target classes are available. This setup has hardly been studied in computer vision research, but it is the rule rather than the exception, because the world contains tens of thousands of different object classes and for only a very few of them image, collections have been formed and annotated with suitable class labels. In this paper, we tackle the problem by introducing attribute-based classification. It performs object detection based on a human-specified high-level description of the target objects instead of training images. The description consists of arbitrary semantic attributes, like shape, color or even geographic information. Because such properties transcend the specific learning task at hand, they can be pre-learned, e.g. from image datasets unrelated to the current task. Afterwards, new classes can be detected based on their attribute representation, without the need for a new training phase. In order to evaluate our method and to facilitate research in this area, we have assembled a new large-scale dataset, “Animals with Attributes”, of over 30,000 animal images that match the 50 classes in Osherson's classic table of how strongly humans associate 85 semantic attributes with animal classes. Our experiments show that by using an attribute layer it is indeed possible to build a learning object detection system that does not require any training images of the target classes.

2,228 citations


Cites methods from "Speeded-Up Robust Features (SURF)"

  • ...We have selected six different feature types: RGB color histograms, SIFT [21], rgSIFT [35], PHOG [4], SURF [2] and local self-similarity histograms [30]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations


"Speeded-Up Robust Features (SURF)" refers background or methods in this paper

  • ...Lowe [24] subtracts these pyramid layers in order to get the DoG (Difference of Gaussians) images where edges and blobs can be found....

    [...]

  • ...Here, we only show a comparison with two other prominent description schemes (SIFT [24] and GLOH [30]), again averaged over the test sequences (Fig....

    [...]

  • ...bers over multiple images (we chose one pair from each set of test images), the ratio-matching scheme [24] is used....

    [...]

  • ...the gradient information extracted by SIFT [24] and its variants....

    [...]

  • ...[21,24,27,39,25])....

    [...]

Proceedings ArticleDOI
01 Dec 2001
TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

18,620 citations


"Speeded-Up Robust Features (SURF)" refers methods in this paper

  • ...This lends itself to the use of integral images as made popular by Viola and Jones [41], which reduces the computation time drastically....

    [...]

Proceedings ArticleDOI
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

16,989 citations


"Speeded-Up Robust Features (SURF)" refers methods in this paper

  • ...The DoG detector was kindly provided by David Lowe....

    [...]

  • ...Lowe [24] subtracts these pyramid layers in order to get the DoG (Difference of Gaussians) images where edges and blobs can be found....

    [...]

  • ...The latter, introduced by Lowe [24], have been shown to outperform the others [28]....

    [...]

  • ...Methods include the best-binfirst proposed by Lowe [24], balltrees [35], vocabulary trees [34], locality sensitive hashing [9], or redundant bit vectors [13]....

    [...]

  • ...Focusing on speed, Lowe [23] proposed to approximate the Laplacian of Gaussians (LoG) by a Difference of Gaussians (DoG) filter....

    [...]

Proceedings ArticleDOI
01 Jan 1988
TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

13,993 citations


"Speeded-Up Robust Features (SURF)" refers methods in this paper

  • ...The most widely used detector is probably the Harris corner detector [15], proposed back in 1988....

    [...]

Book ChapterDOI
07 May 2006
TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Abstract: In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF's strong performance.

13,011 citations