scispace - formally typeset
Search or ask a question
Book ChapterDOI

BRIEF: binary robust independent elementary features

TL;DR: This work proposes to use binary strings as an efficient feature point descriptor, which is called BRIEF, and shows that it is highly discriminative even when using relatively few bits and can be computed using simple intensity difference tests.
Abstract: We propose to use binary strings as an efficient feature point descriptor, which we call BRIEF. We show that it is highly discriminative even when using relatively few bits and can be computed using simple intensity difference tests. Furthermore, the descriptor similarity can be evaluated using the Hamming distance, which is very efficient to compute, instead of the L2 norm as is usually done. As a result, BRIEF is very fast both to build and to match. We compare it against SURF and U-SURF on standard benchmarks and show that it yields a similar or better recognition performance, while running in a fraction of the time required by either.

Summary (3 min read)

1 Introduction

  • Feature point descriptors are now at the core of many Computer Vision technologies, such as object recognition, 3D reconstruction, image retrieval, and camera localization.
  • These strings represent binary descriptors whose similarity can be measured by the Hamming distance.
  • ⋆This work has been supported in part by the Swiss National Science Foundation.
  • Furthermore, comparing strings can be done by computing the Hamming distance, which can be done extremely fast on modern CPUs that often provide a specific instruction to perform a XOR or bit count operation, as is the case in the latest SSE [10] instruction set.
  • This means that BRIEF easily outperforms other fast descriptors such as SURF and U-SURF in terms of speed, as will be shown in the Results section.

3 Method

  • The authors approach is inspired by earlier work [9, 15] that showed that image patches could be effectively classified on the basis of a relatively small number of pairwise intensity comparisons.
  • The results of these tests were used to train either randomized classification trees [15] or a Naive Bayesian classifier [9] to recognize patches seen from different viewpoints.
  • When creating such descriptors, the only choices that have to be made are those of the kernels used to smooth the patches before intensity differencing and the spatial arrangement of the (x,y)-pairs.
  • In short, for both images of a pair and for a given number of corresponding keypoints between them, it quantifies how often the correct match can be established using BRIEF for description and the Hamming distance as the metric for matching.
  • This rate can be computed reliably because the scene is planar and the homography between images is known.

3.1 Smoothing Kernels

  • By construction, the tests of Eq. 1 take only the information at single pixels into account and are therefore very noise-sensitive.
  • It is for the same reason that images need to be smoothed before they can be meaningfully differentiated when looking for edges.
  • This analogy applies because their intensity difference tests can be thought of as evaluating the sign of the derivatives within a patch.
  • The more difficult the matching, the more important smoothing becomes to achieving good performance.
  • For the corresponding discrete kernel window the authors found a size of 9×9 pixels be necessary and sufficient.

3.2 Spatial Arrangement of the Binary Tests

  • The authors experimented with the five sampling geometries depicted by Fig.
  • The (xi,yi) locations are evenly distributed over the patch and tests can lie close to the patch border.
  • The first location xi is sampled from a Gaussian centered around the origin while the second location is sampled from another Gaussian centered on xi.
  • Test locations outside the patch are clamped to the edge of the patch.
  • For this reason, in all further experiments presented in this paper, it is the one the authors will use.

3.3 Distance Distributions

  • The authors take a closer look at the distribution of Hamming distances between their descriptors.
  • To this end the authors extract about 4000 matching points from the five image pairs of the Wall sequence.
  • The maximum possible Hamming distance being 32 · 8 = 256 bits, unsurprisingly, the distribution of distances for non-matching points is roughly Gaussian and centered around 128.
  • Since establishing a match can be understood as classifying pairs of points as being a match or not, a classifier that relies on these Hamming distances will work best when their distributions are most separated.
  • As the authors will see in section 4, this is of course what happens with recognition rates being higher in the first pairs of the Wall sequence than in the subsequent ones.

4 Results

  • The authors compare their method against several competing approaches.
  • For evaluation purposes, the authors rely on two straightforward metrics, elapsed CPU time and recognition rate.
  • Since the authors apply the same procedure to all descriptors, and not only ours, the relative rankings they obtain are still valid and speak in BRIEF’s favor.
  • This explains in part why both BRIEF and U-SURF outperform SURF.
  • OBRIEF-32 is not meant to represent a practical approach but to demonstrate that the response to in-plane rotations is more a function of the quality of the orientation estimator rather than of the descriptor itself, as evidenced by the fact that O-BRIEF-32 and SURF are almost perfectly superposed.

5 Conclusion

  • Not only is construction and matching for this descriptor much faster than for other state-of-the-art ones, it also tends to yield higher recognition rates, as long as invariance to large in-plane rotations is not a requirement.
  • The BRIEF code being very simple, the authors will be happy to make it publicly available.
  • It is also important from a more theoretical viewpoint because it confirms the validity of the recent trend [18, 12] that involves moving from the Euclidean to the Hamming distance for matching purposes.
  • Using fast orientation estimators, there is no theoretical reason why this could not be done without any significant speed penalty.

Did you find this useful? Give us your feedback

Figures (9)

Content maybe subject to copyright    Report

BRIEF: Binary Robust Independent
Elementary Features
Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua
CVLab, EPFL, Lausanne, Switzerland
e-mail: firstname.lastname@epfl.ch
Abstract. We propose to use binary strings as an efficient feature point
descriptor, which we call BRIEF. We show that it is highly discriminative
even when using relatively few bits and can be computed using simple
intensity difference tests. Furthermore, the descr iptor similarity can be
evaluated using the Hamming distance, which is very efficient to com-
pute, instead of the L
2
norm as is usually done.
As a result, BRIEF is very fast both to build and to match. We compare
it against SURF and U-SURF on standard benchmarks and show that
it yields a similar or better recognition performance, while running in a
fraction of the time required by either.
1 Introduction
Feature point descriptors are now at the core of many Computer Vision technolo-
gies, such as object recognition, 3D reconstruction, image retrieval, and camera
localization. Since applications of these technologies have to handle ever more
data or to run on mobile devices with limited computational resources, there is
a growing need for local descriptors that are fast to compute, fast to match, and
memory efficient.
One way to speed up matching and reduce memory consumption is to work
with short descriptors. They can be obtained by applying dimensionality reduc-
tion, such as PCA [1] or LDA [2], to an original descriptor such as SIFT [3] or
SURF [4]. For example, it was shown in [5–7] that floating point values of the
descriptor vector could be quantized using very few bits per value without loss of
recognition performance. An even more drastic dimensionality reduction can be
achieved by using hash functions that reduce SIFT descriptors to binary strings,
as done in [8]. These strings represent binary descriptors whose similarity can
be measured by the Hamming distance.
While effective, these approaches to dimensionality reduction require first
computing the full descriptor before further processing can take place. In this
paper, we show that this whole computation can be shortcut by directly com-
puting binary strings from image patches. The individual bits are obtained by
comparing the intensities of pairs of points along the same lines as in [9] but with-
out requiring a training phase. We refer to the resulting descriptor as BRIEF.
This work has been supported in part by the Swiss National Science Foun dation.

2 Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua
Our experiments show that only 256 bits, or even 128 bits, often suffice
to obtain very good matching results. BRIEF is therefore very efficient both to
compute and to store in memory. Furthermore, comparing strings can be done by
computing the Hamming distance, which can be done extremely fast on modern
CPUs that often provide a specific instruction to perform a XOR or bit count
operation, as is the case in the latest SSE [10] instruction set.
This means that BRIEF easily outperforms other fast descriptors such as
SURF and U-SURF in terms of speed, as will be shown in the Results section.
Furthermore, it also outperforms them in terms of recognition rate in many
cases, as we will demonstrate using benchmark datasets.
2 Related Work
The SIFT descriptor [3] is highly discriminant but, being a 128-vector, is rela-
tively slow to compute and match. This can be a drawback for real-time appli-
cations such as SLAM that keep track of many points as well as for algorithms
that require storing very large numbers of descriptors, for example for large-scale
3D reconstruction.
There are many approaches to solving this problem by developing faster to
compute and match descriptors, while preserving the discriminative power of
SIFT. The SURF descriptor [4] represents one of the best known ones. Like
SIFT, it relies on local gradient histograms but uses integral images to speed up
the computation. Different parameter settings are possible but, since using only
64 dimensions already yields good recognition performances, that version has
become very popular and a de facto standard. This is why we compare ourselves
to it in the Results section.
SURF addresses the issue of speed but, since the descriptor is a 64-vector
of floating p oints values, representing it still requires 256 bytes. This becomes
significant when millions of descriptors must be stored. There are three main
classes of approaches to reducing this number.
The first involves dimensionality reduction techniques such as Principal Com-
ponent Analysis (PCA) or Linear Discriminant Embedding (LDE). PCA is very
easy to perform and can reduce descriptor size at no loss in recognition p er-
formance [1]. By contrast, LDE requires labeled training data, in the form of
descriptors that should be matched together, which is more difficult to obtain.
It can improve performance [2] but can also overfit and degrade performance.
A second way to shorten a descriptor is to quantize its floating-point coordi-
nates into integers coded on fewer bits. In [5], it is shown that the SIFT descriptor
can be quantized using only 4 bits per coordinate. Quantization is used for the
same purpose in [6, 7]. It is a simple operation that results not only in memory
gain but also in faster matching as computing the distance between short vectors
can then be done very efficiently on modern CPUs. In [6], it is shown that for
some parameter settings of the DAISY descriptor, PCA and quantization can be
combined to reduce its size to 60 bits. However, in this approach the Hamming

Lecture Notes in Computer Science: BRIEF 3
distance cannot be used for matching because the bits are, in contrast to BRIEF,
arranged in blocks of four and hence cannot be processed independently.
A third and more radical way to shorten a descriptor is to binarize it. For
example, [8] drew its inspiration from Locality Sensitive Hashing (LSH) [11] to
turn floating-point vectors into binary strings. This is done by thresholding the
vectors after multiplication with an appropriate matrix. Similarity between de-
scriptors is then measured by the Hamming distance between the corresponding
binary strings. This is very fast because the Hamming distance can be computed
very efficiently with a bitwise XOR operation followed by a bit count. The same
algorithm was applied to the GIST descriptor to obtain a binary description of
an entire image [12]. Another way to binarize the GIST descriptor is to use non-
linear Neighborhood Component Analysis [12, 13], which seems more powerful
but probably slower at run-time.
While all three classes of shortening techniques provide satisfactory results,
relying on them remains inefficient in the sense that first computing a long
descriptor then shortening it involves a substantial amount of time-consuming
computation. By contrast, the approach we advocate in this paper directly builds
short descriptors by comparing the intensities of pairs of points without ever
creating a long one. Such intensity comparisons were used in [9] for classification
purposes and were shown to be very powerful in spite of their extreme simplicity.
Nevertheless, the present approach is very different from [9] and [14] because it
does not involve any form of online or offline training.
3 Method
Our approach is inspired by earlier work [9, 15] that showed that image patches
could be effectively classified on the basis of a relatively small number of pair-
wise intensity comparisons. The results of these tests were used to train either
randomized classification trees [15] or a Naive Bayesian classifier [9] to recognize
patches seen fr om different viewpoints. Here, we do away with both the classifier
and the trees, and simply create a bit vector out of the test responses, which we
compute after having smoothed the image patch.
More specifically, we define test τ on patch p of size S × S as
τ(p; x, y) :=
1 if p(x) < p(y)
0 otherwise
, (1)
where p(x) is the pixel intensity in a smoothed version of p at x = (u, v)
.
Choosing a set of n
d
(x, y)-location pairs uniquely defines a set of binary tests.
We take our BRIEF descriptor to be the n
d
-dimensional bitstring
f
n
d
(p) :=
X
1in
d
2
i1
τ(p; x
i
, y
i
) . (2)
In this pap er we consider n
d
= 128, 256, and 512 and will show in the Results
section that these yield good compromises between speed, storage efficiency,

4 Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua
and recognition rate. In the remainder of the paper, we will refer to BRIEF
descriptors as BRIEF-k, where k = n
d
/8 represents the number of bytes required
to store the descriptor.
When creating such descriptors, the only choices that have to be made are
those of the kernels used to smooth the patches before intensity differencing and
the spatial arrangement of the (x, y)-pairs. We discuss these in the r emainder of
this section.
To this end, we use the Wall dataset that we will describe in more detail in
section 4. It contains five image pairs, with the first image being the same in all
pairs and the second image shot from a monotonically growing baseline, which
makes matching increasingly more difficult. To compare the pertinence of the
various potential choices, we use as a quality measure the recognition rate in
image pairs that will be precisely defined at the beginning of section 4. In short,
for both images of a pair and for a given number of corresponding keypoints be-
tween them, it quantifies how often the correct match can be established using
BRIEF for description and the Hamming distance as the metric for matching.
This rate can be computed reliably because the scene is planar and the homog-
raphy between images is known. It can therefore be used to check whether points
truly correspond to each other or not.
3.1 Smoothing Kernels
By construction, the tests of Eq. 1 take only the information at single pixels into
account and are therefore very noise-sensitive. By pre-smoothing the patch, this
sensitivity can be reduced, thus increasing the stability and repeatability of the
descriptors. It is for the same reason that images need to be smoothed before
they can be meaningfully differentiated when looking for edges. This analogy
applies because our intensity difference tests can be thought of as evaluating the
sign of the derivatives within a patch.
Fig. 1 illustrates the effects of increasing amounts of G aussian smoothing on
the recognition rates for variances of Gaussian kernel ranging from 0 to 3. The
more difficult the matching, the more important smoothing becomes to achieving
good performance. Furthermore, the recognition rates remain relatively constant
in the 1 to 3 range and, in practice, we use a value of 2. For the corresponding
discrete kernel window we found a size of 9×9 pixels be necessary and sufficient.
3.2 Spatial Arrangement of the Binary Tests
Generating a length n
d
bit vector leaves many options for selecting the n
d
test
locations (x
i
, y
i
) of Eq. 1 in a patch of size S × S. We experimented with the
five sampling geometries depicted by Fig. 2. Assuming the origin of the patch
coordinate system to be located at the patch center, they can be described as
follows.
I) (X, Y) i.i.d. Uniform(
S
2
,
S
2
): The (x
i
, y
i
) locations are evenly distributed
over the patch and tests can lie close to the patch border.

Lecture Notes in Computer Science: BRIEF 5
Wall 1|2 Wall 1|3 Wall 1|4 Wall 1|5 Wall 1|6
0
10
20
30
40
50
60
70
80
90
100
Recognition rate
no smo...
σ=0.65
σ=0.95
σ=1.25
σ=1.55
σ=1.85
σ=2.15
σ=2.45
σ=2.75
σ=3.05
Fig. 1. Each group of 10 bars represents the recognition rates in one specific stereo pair
for increasing levels of Gaussian smoothing. Especially for the hard-to-match pairs,
which are those on the right s ide of the plot, smoothing is essential in slowing down
the rate at which the recognition rate decreases.
Fig. 2. Different approaches to choosing the test locations. All except the righmost one
are selected by random sampling. Showing 128 tests in every image.
II) (X, Y) i.i.d. Gaussian(0,
1
25
S
2
): The tests are sampled from an isotropic
Gaussian distribution. Experimentally we found
s
2
=
5
2
σ σ
2
=
1
25
S
2
to
give best results in terms of recognition rate.
III) X i.i.d. G aussian(0,
1
25
S
2
) , Y i.i.d. Gaussian(x
i
,
1
100
S
2
) : The sampling
involves two steps. The first location x
i
is sampled from a Gaussian centered
around the origin while the second location is sampled from another Gaussian
centered on x
i
. This forces the tests to be more local. Test locations outside
the patch are clamped to the edge of the patch. Again, exp erimentally we
found
S
4
=
5
2
σ σ
2
=
1
100
S
2
for the second Gaussian performing b est.
IV) The (x
i
, y
i
) are randomly sampled from discrete locations of a coarse polar
grid introducing a spatial quantization.

Citations
More filters
Proceedings ArticleDOI
06 Nov 2011
TL;DR: This paper proposes a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise, and demonstrates through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations.
Abstract: Feature matching is at the base of many computer vision problems, such as object recognition or structure from motion. Current methods rely on costly descriptors for detection and matching. In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise. We demonstrate through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations. The efficiency is tested on several real-world applications, including object detection and patch-tracking on a smart phone.

8,702 citations


Cites background or methods from "BRIEF: binary robust independent el..."

  • ...Many different types of distributions of tests were considered in [6]; here we use one of the best performers, a Gaussian distribution around the center of the patch....

    [...]

  • ...Descriptors BRIEF [6] is a recent feature descriptor that uses simple binary tests between pixels in a smoothed image patch....

    [...]

Journal ArticleDOI
TL;DR: ORB-SLAM as discussed by the authors is a feature-based monocular SLAM system that operates in real time, in small and large indoor and outdoor environments, with a survival of the fittest strategy that selects the points and keyframes of the reconstruction.
Abstract: This paper presents ORB-SLAM, a feature-based monocular simultaneous localization and mapping (SLAM) system that operates in real time, in small and large indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.

4,522 citations

Journal ArticleDOI
TL;DR: A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation.
Abstract: This paper presents ORB-SLAM, a feature-based monocular SLAM system that operates in real time, in small and large, indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.

3,807 citations


Cites background or methods from "BRIEF: binary robust independent el..."

  • ...We then discuss map initialization approaches for Monocular SLAM and end with a review of Monocular SLAM systems....

    [...]

  • ...Strasdat et. al [28] demonstrated that keyframe-based techniques are more accurate than filtering for the same computational cost....

    [...]

Proceedings ArticleDOI
06 Nov 2011
TL;DR: A comprehensive evaluation on benchmark datasets reveals BRISK's adaptive, high quality performance as in state-of-the-art algorithms, albeit at a dramatically lower computational cost (an order of magnitude faster than SURF in cases).
Abstract: Effective and efficient generation of keypoints from an image is a well-studied problem in the literature and forms the basis of numerous Computer Vision applications. Established leaders in the field are the SIFT and SURF algorithms which exhibit great performance under a variety of image transformations, with SURF in particular considered as the most computationally efficient amongst the high-performance methods to date. In this paper we propose BRISK1, a novel method for keypoint detection, description and matching. A comprehensive evaluation on benchmark datasets reveals BRISK's adaptive, high quality performance as in state-of-the-art algorithms, albeit at a dramatically lower computational cost (an order of magnitude faster than SURF in cases). The key to speed lies in the application of a novel scale-space FAST-based detector in combination with the assembly of a bit-string descriptor from intensity comparisons retrieved by dedicated sampling of each keypoint neighborhood.

3,292 citations


Cites background or methods from "BRIEF: binary robust independent el..."

  • ...However, despite the clear advantage in speed, the latter approach suffers in terms of reliability and robustness as it has minimal tolerance to image distortions and transformations, in particular to in-plane rotation and scale change....

    [...]

  • ...Their AGAST is essentially an extension for accelerated performance of the now popular FAST, proven to be a very efficient basis for feature extraction....

    [...]

  • ...For the formation of the rotation- and scale-normalized descriptor, BRISK applies the sampling pattern rotated by α = arctan2 (gy, gx) around the keypoint k....

    [...]

  • ...The key to speed lies in the application of a novel scale-space FAST-based detector in combination with the assembly of a bit-string descriptor from intensity comparisons retrieved by dedicated sampling of each keypoint neighborhood....

    [...]

Journal ArticleDOI
TL;DR: A novel tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning, and detection, and develops a novel learning method (P-N learning) which estimates the errors by a pair of “experts”: P-expert estimates missed detections, and N-ex Expert estimates false alarms.
Abstract: This paper investigates long-term tracking of unknown objects in a video stream. The object is defined by its location and extent in a single frame. In every frame that follows, the task is to determine the object's location and extent or indicate that the object is not present. We propose a novel tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning, and detection. The tracker follows the object from frame to frame. The detector localizes all appearances that have been observed so far and corrects the tracker if necessary. The learning estimates the detector's errors and updates it to avoid these errors in the future. We study how to identify the detector's errors and learn from them. We develop a novel learning method (P-N learning) which estimates the errors by a pair of “experts”: (1) P-expert estimates missed detections, and (2) N-expert estimates false alarms. The learning process is modeled as a discrete dynamical system and the conditions under which the learning guarantees improvement are found. We describe our real-time implementation of the TLD framework and the P-N learning. We carry out an extensive quantitative evaluation which shows a significant improvement over state-of-the-art approaches.

3,137 citations


Cites background or result from "BRIEF: binary robust independent el..."

  • ...Similarly as in [60], [61], and [62], the pixel comparisons are generated offline at random and stay fixed in runtime....

    [...]

  • ...This is in contrast to standard approaches [60], [61], [62], where every pixel comparison is generated independent of other pixel comparisons....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Book ChapterDOI
07 May 2006
TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Abstract: In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF's strong performance.

13,011 citations

Journal ArticleDOI
TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.

12,449 citations

Journal ArticleDOI
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

7,057 citations

Book ChapterDOI
07 May 2006
TL;DR: It is shown that machine learning can be used to derive a feature detector which can fully process live PAL video using less than 7% of the available processing time.
Abstract: Where feature points are used in real-time frame-rate applications, a high-speed feature detector is necessary. Feature detectors such as SIFT (DoG), Harris and SUSAN are good methods which yield high quality features, however they are too computationally intensive for use in real-time applications of any complexity. Here we show that machine learning can be used to derive a feature detector which can fully process live PAL video using less than 7% of the available processing time. By comparison neither the Harris detector (120%) nor the detection stage of SIFT (300%) can operate at full frame rate. Clearly a high-speed detector is of limited use if the features produced are unsuitable for downstream processing. In particular, the same scene viewed from two different positions should yield features which correspond to the same real-world 3D locations [1]. Hence the second contribution of this paper is a comparison corner detectors based on this criterion applied to 3D scenes. This comparison supports a number of claims made elsewhere concerning existing corner detectors. Further, contrary to our initial expectations, we show that despite being principally constructed for speed, our detector significantly outperforms existing feature detectors according to this criterion.

3,828 citations

Frequently Asked Questions (14)
Q1. What are the contributions mentioned in the paper "Brief: binary robust independent elementary features⋆" ?

The authors propose to use binary strings as an efficient feature point descriptor, which they call BRIEF. The authors show that it is highly discriminative even when using relatively few bits and can be computed using simple intensity difference tests. The authors compare it against SURF and U-SURF on standard benchmarks and show that it yields a similar or better recognition performance, while running in a fraction of the time required by either. Furthermore, the descriptor similarity can be evaluated using the Hamming distance, which is very efficient to compute, instead of the L2 norm as is usually done. 

In future work, the authors will incorporate orientation and scale invariance into BRIEF so that it can compete with SURF and SIFT in a wider set of situations. 

SURF addresses the issue of speed but, since the descriptor is a 64-vector of floating points values, representing it still requires 256 bytes. 

Since establishing a match can be understood as classifying pairs of points as being a match or not, a classifier that relies on these Hamming distances will work best when their distributions are most separated. 

The maximum possible Hamming distance being 32 · 8 = 256 bits, unsurprisingly, the distribution of distances for non-matching points is roughly Gaussian and centered around 128. 

In other words, on data sets such as those that involve only modest amounts of in-plane rotation, there is a cost not only in terms of speed but also of recognition rate to achieving orientation invariance, as already pointed out in [4]. 

Another way to binarize the GIST descriptor is to use nonlinear Neighborhood Component Analysis [12, 13], which seems more powerful but probably slower at run-time. 

When creating such descriptors, the only choices that have to be made are those of the kernels used to smooth the patches before intensity differencing and the spatial arrangement of the (x,y)-pairs. 

Generating a length nd bit vector leaves many options for selecting the nd test locations (xi,yi) of Eq. 1 in a patch of size S × S. 

Feature point descriptors are now at the core of many Computer Vision technologies, such as object recognition, 3D reconstruction, image retrieval, and camera localization. 

Chief among them is the latest OpenCV implementation of the SURF descriptor [4], which has become a de facto standard for fast-to-compute descriptors. 

To confirm this, the authors detected SURFpoints in both images of each test pair and computed their (SURF- or BRIEF-) descriptors, matched these descriptors to their nearest neighbor, and applied a standard left-right consistency check. 

The individual bits are obtained by comparing the intensities of pairs of points along the same lines as in [9] but without requiring a training phase. 

In [6], it is shown that for some parameter settings of the DAISY descriptor, PCA and quantization can be combined to reduce its size to 60 bits.