Distinctive Image Features from Scale-Invariant Keypoints

Home
/
Papers
/
Distinctive Image Features from Scale-Invariant Keypoints

Distinctive Image Features from Scale-Invariant Keypoints

01 Jan 2011-

TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.

read less

Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.

...read moreread less

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Learning Where to Classify in Multi-view Semantic Segmentation

[...]

Hayko Riemenschneider¹, András Bódis-Szomorú¹, Julien Weissenberg¹, Luc Van Gool¹, Luc Van Gool² - Show less +1 more•Institutions (2)

ETH Zurich¹, Katholieke Universiteit Leuven²

06 Sep 2014

TL;DR: In this paper, the geometry of a 3D mesh model obtained from multi-view reconstruction is exploited to predict the best view before the actual labeling, which leads to a further reduction of computation time and a gain in accuracy.

...read moreread less

Abstract: There is an increasing interest in semantically annotated 3D models, e.g. of cities. The typical approaches start with the semantic labelling of all the images used for the 3D model. Such labelling tends to be very time consuming though. The inherent redundancy among the overlapping images calls for more efficient solutions. This paper proposes an alternative approach that exploits the geometry of a 3D mesh model obtained from multi-view reconstruction. Instead of clustering similar views, we predict the best view before the actual labelling. For this we find the single image part that bests supports the correct semantic labelling of each face of the underlying 3D mesh. Moreover, our single-image approach may surprise because it tends to increase the accuracy of the model labelling when compared to approaches that fuse the labels from multiple images. As a matter of fact, we even go a step further, and only explicitly label a subset of faces (e.g. 10%), to subsequently fill in the labels of the remaining faces. This leads to a further reduction of computation time, again combined with a gain in accuracy. Compared to a process that starts from the semantic labelling of the images, our method to semantically label 3D models yields accelerations of about 2 orders of magnitude. We tested our multi-view semantic labelling on a variety of street scenes.

...read moreread less

100 citations

Journal Article•DOI•

Discriminative Dictionary Learning With Common Label Alignment for Cross-Modal Retrieval

[...]

Cheng Deng¹, Xu Tang¹, Junchi Yan², Wei Liu¹, Xinbo Gao¹ - Show less +1 more•Institutions (2)

Xidian University¹, East China Normal University²

01 Feb 2016-IEEE Transactions on Multimedia

TL;DR: A novel cross-modal retrieval approach based on discriminative dictionary learning that is augmented with common label alignment that outperforms several state-of-the-art methods in terms of retrieval accuracy.

...read moreread less

Abstract: Cross-modal retrieval has attracted much attention in recent years due to its widespread applications. In this area, how to capture and correlate heterogeneous features originating from different modalities remains a challenge. However, most existing methods dealing with cross-modal learning only focus on learning relevant features shared by two distinct feature spaces, therefore overlooking discriminative feature information of them. To remedy this issue and explicitly capture discriminative feature information, we propose a novel cross-modal retrieval approach based on discriminative dictionary learning that is augmented with common label alignment. Concretely, a discriminative dictionary is first learned to account for each modality, which boosts not only the discriminating capability of intra-modality data from different classes but also the relevance of inter-modality data in the same class. Subsequently, all the resulting sparse codes are simultaneously mapped to a common label space, where the cross-modal data samples are characterized and associated. Also in the label space, the discriminativeness and relevance of the considered cross-modal data can be further strengthened by enforcing a common label alignment. Finally, cross-modal retrieval is performed over the common label space. Experiments conducted on two public cross-modal datasets show that the proposed approach outperforms several state-of-the-art methods in term of retrieval accuracy.

...read moreread less

100 citations

Cites methods from "Distinctive Image Features from Sca..."

...We extract SIFT descriptors [31] for images and quantize them into Bag-of-Visual-Words (BoVW) [32] by K-means clustering....
[...]

Journal Article•DOI•

Boosted human re-identification using Riemannian manifolds

[...]

SłAwomir Bk¹, Etienne Corvee¹, Francois Bremond¹, Monique Thonnat¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Jun 2012-Image and Vision Computing

TL;DR: A new appearance model based on Mean Riemannian Covariance (MRC) patches extracted from tracks of a particular individual is presented and it is demonstrated that the proposed approach outperforms state of the art methods.

...read moreread less

100 citations

Cites methods from "Distinctive Image Features from Sca..."

...The histograms were composed of color features, autocorrelograms and a bag of features based on SIFT [11] descriptor....
[...]
...Bag of features based on SIFT [11] descriptor together with online learning were proposed in [24]...
[...]
...Bag of features based on SIFT [11] descriptor together with online learning were proposed in [24] to improve matching accuracy....
[...]

Journal Article•DOI•

Harmony Potentials

[...]

Xavier Boix¹, Josep M. Gonfaus², Joost van de Weijer², Andrew D. Bagdanov, Joan Serrat², Jordi Gonzàlez² - Show less +2 more•Institutions (2)

ETH Zurich¹, Autonomous University of Barcelona²

01 Jan 2012-International Journal of Computer Vision

TL;DR: A new consistency potential is proposed for image labeling problems that can encode any possible combination of labels, penalizing only unlikely combinations of classes, and an effective sampling strategy is proposed over this expanded label set that renders tractable the underlying optimization problem.

...read moreread less

Abstract: The Hierarchical Conditional Random Field (HCRF) model have been successfully applied to a number of image labeling problems, including image segmentation. However, existing HCRF models of image segmentation do not allow multiple classes to be assigned to a single region, which limits their ability to incorporate contextual information across multiple scales. At higher scales in the image, this representation yields an oversimplified model since multiple classes can be reasonably expected to appear within large regions. This simplified model particularly limits the impact of information at higher scales. Since class-label information at these scales is usually more reliable than at lower, noisier scales, neglecting this information is undesirable. To address these issues, we propose a new consistency potential for image labeling problems, which we call the harmony potential. It can encode any possible combination of labels, penalizing only unlikely combinations of classes. We also propose an effective sampling strategy over this expanded label set that renders tractable the underlying optimization problem. Our approach obtains state-of-the-art results on two challenging, standard benchmark datasets for semantic image segmentation: PASCAL VOC 2010, and MSRC-21.

...read moreread less

100 citations

Cites background or methods from "Distinctive Image Features from Sca..."

...We use a bag-of-words representation (Zhang et al. 2007), based on shape SIFT, color SIFT (van de Sande et al. 2010), together with spatial pyramids (Lazebnik et al. 2006) and color attention (Shahbaz et al. 2009) based on the Color Name feature (van de Weijer et al. 2009)....
[...]
...Typically, shape features such as SIFT (Lowe 2004), color features like local color histograms, and texture features like LBPs (Ojala et al. 2002) are used as local descriptors....
[...]
...In the case of MSRC21, we use a simpler bag-of-words representation based on SIFT, RGB histograms, SSIM and spatial pyramids (Lazebnik et al. 2006) with max-pooling (Yang et al. 2009)....
[...]
...These patches are described by shape (SIFT), color (RGB histogram) and the SSIM self-similarity descriptor (Shechtman and Irani 2007)....
[...]
...Advances in object recognition (Schmid and Mohr 1997; Lowe 2004; Sivic and Zisserman 2003) allowed for the recognition of semantic classes in images to aid image segmentation....
[...]

Proceedings Article•DOI•

Made to measure: Bespoke landmarks for 24-hour, all-weather localisation with a camera

[...]

Chris Linegar¹, Winston Churchill¹, Paul Newman¹•Institutions (1)

University of Oxford¹

16 May 2016

TL;DR: This paper trains place-specific linear SVM classifiers to recognise distinctive elements in the environment to extract distinct elements from the environment for localisation in challenging outdoor environments.

...read moreread less

Abstract: This paper is about camera-only localisation in challenging outdoor environments, where changes in lighting, weather and season cause traditional localisation systems to fail. Conventional approaches to the localisation problem rely on point-features such as SIFT, SURF or BRIEF to associate landmark observations in the live image with landmarks stored in the map; however, these features are brittle to the severe appearance change routinely encountered in outdoor environments. In this paper, we propose an alternative to traditional point-features: we train place-specific linear SVM classifiers to recognise distinctive elements in the environment. The core contribution of this paper is an unsupervised mining algorithm which operates on a single mapping dataset to extract distinct elements from the environment for localisation. We evaluate our system on 205km of data collected from central Oxford over a period of six months in bright sun, night, rain, snow and at all times of the day. Our experiment consists of a comprehensive N-vs-N analysis on 22 laps of the approximately 10km route in central Oxford. With our proposed system, the portion of the route where localisation fails is reduced by a factor of 6, from 33.3% to 5.5%.

...read moreread less

100 citations

Cites methods from "Distinctive Image Features from Sca..."

...Landmarks are described by a feature descriptor (e.g. SIFT, SURF, BRIEF)....
[...]
...The landmarks extracted in Section IV-B are used at runtime for localisation instead of traditional point-features such as SIFT, SURF and BRIEF....
[...]
...Valgren examined the effect of seasonal change on SIFT and SURF features for topological localisation, but did not examine metric localisation [20]....
[...]
...Traditional approaches rely on point-features (such as SIFT, SURF and BRIEF) for metric localisation, however these point-features are not robust to severe appearance change....
[...]
...These are then described with a local feature descriptor such as SIFT [10], SURF [11] or one of the binary descriptors [12][13][14][15]....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
…
138
139
140
141
142
143
144
…
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Distinctive Image Features from Scale-Invariant Keypoints

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

01 Nov 2004-International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

46,906 citations

Proceedings Article•DOI•

Object recognition from local scale-invariant features

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

20 Sep 1999

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

16,989 citations

Proceedings Article•DOI•

A Combined Corner and Edge Detector

[...]

Chris Harris, Mike Stephens

01 Jan 1988

TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.

...read moreread less

Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

...read moreread less

13,993 citations

Journal Article•DOI•

A performance evaluation of local descriptors

[...]

Krystian Mikolajczyk¹, Cordelia Schmid²•Institutions (2)

University of Oxford¹, French Institute for Research in Computer Science and Automation²

01 Oct 2005-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.

...read moreread less

Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

...read moreread less

7,057 citations

Journal Article•DOI•

Robust wide-baseline stereo from maximally stable extremal regions

[...]

Jiri Matas¹, Ondrej Chum, Martin Urban, Tomas Pajdla•Institutions (1)

University of Surrey¹

01 Sep 2004-Image and Vision Computing

TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

...read moreread less

3,422 citations