Distinctive Image Features from Scale-Invariant Keypoints

Home
/
Papers
/
Distinctive Image Features from Scale-Invariant Keypoints

Distinctive Image Features from Scale-Invariant Keypoints

01 Jan 2011-

TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.

read less

Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Automatic registration of terrestrial laser scanner point clouds using natural planar surfaces

[...]

P. W. Theiler¹, Konrad Schindler¹•Institutions (1)

ETH Zurich¹

20 Jul 2012-ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

TL;DR: Experiments show that the proposed method is able to successfully coarse register TLS point clouds without the need for artificial targets, and without placing any target objects in the scene.

...read moreread less

Abstract: . Terrestrial laser scanners have become a standard piece of surveying equipment, used in diverse fields like geomatics, manufacturing and medicine. However, the processing of today's large point clouds is time-consuming, cumbersome and not automated enough. A basic step of post-processing is the registration of scans from different viewpoints. At present this is still done using artificial targets or tie points, mostly by manual clicking. The aim of this registration step is a coarse alignment, which can then be improved with the existing algorithm for fine registration. The focus of this paper is to provide such a coarse registration in a fully automatic fashion, and without placing any target objects in the scene. The basic idea is to use virtual tie points generated by intersecting planar surfaces in the scene. Such planes are detected in the data with RANSAC and optimally fitted using least squares estimation. Due to the huge amount of recorded points, planes can be determined very accurately, resulting in well-defined tie points. Given two sets of potential tie points recovered in two different scans, registration is performed by searching for the assignment which preserves the geometric configuration of the largest possible subset of all tie points. Since exhaustive search over all possible assignments is intractable even for moderate numbers of points, the search is guided by matching individual pairs of tie points with the help of a novel descriptor based on the properties of a point's parent planes. Experiments show that the proposed method is able to successfully coarse register TLS point clouds without the need for artificial targets.

...read moreread less

101 citations

Cites methods from "Distinctive Image Features from Sca..."

...They apply the SIFT operator (Lowe, 2003) on the intensity image to find adequate tie points....
[...]

Dissertation•DOI•

Discriminative vision-based recovery and recognition of human motion

[...]

Ronald Poppe

02 Apr 2009

TL;DR: An example-based pose recovery approach is introduced where histograms of oriented gradients (HOG) are used as the image descriptor and simple functions are used to discriminate between two classes after applying a common spatial patterns (CSP) transform on sequences of HOG descriptors.

...read moreread less

Abstract: The automatic analysis of human motion from images opens up the way for applications in the domains of security and surveillance, human-computer interaction, animation, retrieval and sports motion analysis. In this dissertation, the focus is on robust and fast human pose recovery and action recognition. The former is a regression task where the aim is to determine the locations of key joints in the human body, given an image of a human figure. The latter is the process of labeling image sequences with action labels, a classification task. An example-based pose recovery approach is introduced where histograms of oriented gradients (HOG) are used as the image descriptor. From a database containing thousands of HOG-pose pairs, the visually closest examples are selected. Weighted interpolation of the corresponding poses is used to obtain the pose estimate. This approach is fast due to the use of a low-cost distance function. To cope with partial occlusions of the human figure, the normalization and matching of the HOG descriptors was changed from global to the cell level. When occlusion areas in the image are predicted, only part of the descriptor can be used for recovery, thus avoiding adaptation of the database to the occlusion setting. For the recognition of human actions, simple functions are used to discriminate between two classes after applying a common spatial patterns (CSP) transform on sequences of HOG descriptors. In the transform, the difference in variance between two classes is maximized. Each of the discriminative functions softly votes into the two classes. After evaluation of all pairwise functions, the action class that receives most of the voting mass is the estimated class. By combining the two approaches, actions could be recognized by considering sequences of recovered, rotation-normalized poses. Thanks to this normalization, actions could be recognized from arbitrary viewpoints. By handling occlusions in the pose recovery step, actions could be recognized from image observations where occlusion was simulated.

...read moreread less

101 citations

Cites background from "Distinctive Image Features from Sca..."

...[308], who extend the SIFT descriptor [203] to 3D and construct histograms over codewords....
[...]
...Currently, a popular local descriptor is the scale invariant feature transform (SIFT, [203]) and extensions (SIFT-PCA, [167] and GLOH, [217])....
[...]

Proceedings Article•DOI•

YouTubeCat: Learning to categorize wild web videos

[...]

Zheshen Wang¹, Ming Zhao², Yang Song², Sanjiv Kumar², Baoxin Li¹ - Show less +1 more•Institutions (2)

Arizona State University¹, Google²

13 Jun 2010

TL;DR: A fusion framework in which each data source is first combined with the manually-labeled set independently, and a Conditional Random Field (CRF) based fusion strategy is designed, based on the final fused classifier, category labels are predicted for the new videos.

...read moreread less

Abstract: Automatic categorization of videos in a Web-scale unconstrained collection such as YouTube is a challenging task. A key issue is how to build an effective training set in the presence of missing, sparse or noisy labels. We propose to achieve this by first manually creating a small labeled set and then extending it using additional sources such as related videos, searched videos, and text-based webpages. The data from such disparate sources has different properties and labeling quality, and thus fusing them in a coherent fashion is another practical challenge. We propose a fusion framework in which each data source is first combined with the manually-labeled set independently. Then, using the hierarchical taxonomy of the categories, a Conditional Random Field (CRF) based fusion strategy is designed. Based on the final fused classifier, category labels are predicted for the new videos. Extensive experiments on about 80K videos from 29 most frequent categories in YouTube show the effectiveness of the proposed method for categorizing large-scale wild Web videos1.

...read moreread less

101 citations

Book Chapter•DOI•

Discriminative spatial attention for robust tracking

[...]

Jialue Fan¹, Ying Wu¹, Shengyang Dai•Institutions (1)

Northwestern University¹

05 Sep 2010

TL;DR: This paper presents an efficient two-stage method that divides the discriminative domain into a local and a semi-local one, and shows that the set of semi- local ARs are identified through an efficient branch-and-bound procedure that guarantees the optimality.

...read moreread less

Abstract: A major reason leading to tracking failure is the spatial distractions that exhibit similar visual appearances as the target, because they also generate good matches to the target and thus distract the tracker. It is in general very difficult to handle this situation. In a selective attention tracking paradigm, this paper advocates a new approach of discriminative spatial attention that identifies some special regions on the target, called attentional regions (ARs). The ARs show strong discriminative power in their discriminative domains where they do not observe similar things. This paper presents an efficient two-stage method that divides the discriminative domain into a local and a semi-local one. In the local domain, the visual appearance of an attentional region is locally linearized and its discriminative power is closely related to the property of the associated linear manifold, so that a gradient-based search is designed to locate the set of local ARs. Based on that, the set of semi-local ARs are identified through an efficient branch-and-bound procedure that guarantees the optimality. Extensive experiments show that such discriminative spatial attention leads to superior performances in many challenging target tracking tasks.

...read moreread less

101 citations

Cites background from "Distinctive Image Features from Sca..."

...There is a vast literature on salient region selection [10,15,4,11,12,6,1,2]....
[...]

Journal Article•DOI•

Dense SIFT for ghost-free multi-exposure fusion

[...]

Yu Liu¹, Zengfu Wang²•Institutions (2)

University of Science and Technology of China¹, Chinese Academy of Sciences²

01 Aug 2015-Journal of Visual Communication and Image Representation

TL;DR: A new ghost-free multi-exposure fusion method based on dense SIFT is proposed, and two popular weight distribution strategies for local contrast extraction, namely, "weighted-average" and "winner-take-all" are studied.

...read moreread less

101 citations

Cites background or methods from "Distinctive Image Features from Sca..."

...where Normalization (·) denotes the SIFT descriptor normalization operator [27]....
[...]
...In the field of pixel-level image fusion, the activity level measurement must be assigned to each pixel or at least each local block, so the SIFT descriptor [27] cannot be directly employed....
[...]
...The well-known SIFT descriptor proposed by Lowe [27] achieves great success in various computer vision applications....
[...]
...In dense SIFT, a feature descriptor can be extracted for each pixel in an image and the process for detecting interest points in [27] is not required....
[...]
...Please refer to [25] and [27] for more details about the calculation of dense SIFT....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
…
136
137
138
139
140
141
142
…
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Distinctive Image Features from Scale-Invariant Keypoints

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

01 Nov 2004-International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

46,906 citations

Proceedings Article•DOI•

Object recognition from local scale-invariant features

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

20 Sep 1999

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

16,989 citations

Proceedings Article•DOI•

A Combined Corner and Edge Detector

[...]

Chris Harris, Mike Stephens

01 Jan 1988

TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.

...read moreread less

Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

...read moreread less

13,993 citations

Journal Article•DOI•

A performance evaluation of local descriptors

[...]

Krystian Mikolajczyk¹, Cordelia Schmid²•Institutions (2)

University of Oxford¹, French Institute for Research in Computer Science and Automation²

01 Oct 2005-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.

...read moreread less

Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

...read moreread less

7,057 citations

Journal Article•DOI•

Robust wide-baseline stereo from maximally stable extremal regions

[...]

Jiri Matas¹, Ondrej Chum, Martin Urban, Tomas Pajdla•Institutions (1)

University of Surrey¹

01 Sep 2004-Image and Vision Computing

TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

...read moreread less

3,422 citations