Distinctive Image Features from Scale-Invariant Keypoints

Home
/
Papers
/
Distinctive Image Features from Scale-Invariant Keypoints

Distinctive Image Features from Scale-Invariant Keypoints

01 Jan 2011-

TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.

read less

Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Semantic Pooling for Complex Event Analysis in Untrimmed Videos

[...]

Xiaojun Chang¹, Yaoliang Yu², Yi Yang¹, Eric P. Xing²•Institutions (2)

University of Technology, Sydney¹, Carnegie Mellon University²

01 Aug 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work defines a novel notion of semantic saliency that assesses the relevance of each shot with the event of interest and proposes a new isotonic regularizer that is able to exploit the constructed semantic ordering information.

...read moreread less

Abstract: Pooling plays an important role in generating a discriminative video representation. In this paper, we propose a new semantic pooling approach for challenging event analysis tasks (e.g., event detection, recognition, and recounting) in long untrimmed Internet videos, especially when only a few shots/segments are relevant to the event of interest while many other shots are irrelevant or even misleading. The commonly adopted pooling strategies aggregate the shots indifferently in one way or another, resulting in a great loss of information. Instead, in this work we first define a novel notion of semantic saliency that assesses the relevance of each shot with the event of interest. We then prioritize the shots according to their saliency scores since shots that are semantically more salient are expected to contribute more to the final event analysis. Next, we propose a new isotonic regularizer that is able to exploit the constructed semantic ordering information. The resulting nearly-isotonic support vector machine classifier exhibits higher discriminative power in event analysis tasks. Computationally, we develop an efficient implementation using the proximal gradient algorithm, and we prove new and closed-form proximal steps. We conduct extensive experiments on three real-world video datasets and achieve promising improvements.

...read moreread less

321 citations

Cites methods from "Distinctive Image Features from Sca..."

...Various low-level features, e.g., SIFT [20], Space-Time Interest Points [21] and improved dense trajectories [22] have been used....
[...]
...SIFT [20], Space-Time Interest Points [21] and improved dense trajectories [22] have been used....
[...]

Proceedings Article•

Learning to Align from Scratch

[...]

Gary B. Huang¹, Marwan Mattar¹, Honglak Lee², Erik Learned-Miller¹•Institutions (2)

University of Massachusetts Amherst¹, University of Michigan²

03 Dec 2012

TL;DR: This paper incorporates deep learning into the congealing alignment framework, and modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results.

...read moreread less

Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method.

...read moreread less

320 citations

Cites background from "Distinctive Image Features from Sca..."

...However, this required a careful selection of hand-crafted feature representation (SIFT [11]) and soft clustering, and does not achieve as large of an improvement in verification accuracy as supervised alignment (LFW-a)....
[...]

Journal Article•DOI•

Sketch-based shape retrieval

[...]

Mathias Eitz¹, Ronald Richter¹, Tamy Boubekeur², Kristian Hildebrand¹, Marc Alexa¹ - Show less +1 more•Institutions (2)

Technical University of Berlin¹, Télécom ParisTech²

01 Jul 2012

TL;DR: A targeted feature transform based on Gabor filters for this system for 3D object retrieval based on sketched feature lines as input is developed and it is shown objectively that this transform is better suited than other approaches from the literature developed for similar tasks.

...read moreread less

Abstract: We develop a system for 3D object retrieval based on sketched feature lines as input. For objective evaluation, we collect a large number of query sketches from human users that are related to an existing data base of objects. The sketches turn out to be generally quite abstract with large local and global deviations from the original shape. Based on this observation, we decide to use a bag-of-features approach over computer generated line drawings of the objects. We develop a targeted feature transform based on Gabor filters for this system. We can show objectively that this transform is better suited than other approaches from the literature developed for similar tasks. Moreover, we demonstrate how to optimize the parameters of our, as well as other approaches, based on the gathered sketches. In the resulting comparison, our approach is significantly better than any other system described so far.

...read moreread less

319 citations

Cites background or methods from "Distinctive Image Features from Sca..."

...Additionally, we evaluate performance of the popular image descriptor SIFT [Lowe 2004] on sketches....
[...]
...In that sense the optimal angular resolution for our descriptor turns out to be identical to the 8 directions used in the SIFT descriptor....
[...]
...Successful representations often capitalize on the distribution of the value of interest, such as the SIFT and SURF descriptor [Lowe 2004; Bay et al. 2006]....
[...]
...We use two variants: a) the complete scalespace feature detection and extraction pipeline (SIFT) and a singlescale grid sampled approach (SIFT Grid), using exactly the same sampling parameters and feature size as for the GALIF descriptor....
[...]
...Additionally, we evaluate performance of the popular image descriptor SIFT [Lowe 2004] on sketches....
[...]

Proceedings Article•DOI•

Robust visual domain adaptation with low-rank reconstruction

[...]

I-Hong Jhuo¹, Dong Liu², Der-Tsai Lee¹, Shih-Fu Chang²•Institutions (2)

National Taiwan University¹, Columbia University²

16 Jun 2012

TL;DR: This paper transforms the visual samples in the source domain into an intermediate representation such that each transformed source sample can be linearly reconstructed by the samples of the target domain, making it more robust than previous methods.

...read moreread less

Abstract: Visual domain adaptation addresses the problem of adapting the sample distribution of the source domain to the target domain, where the recognition task is intended but the data distributions are different. In this paper, we present a low-rank reconstruction method to reduce the domain distribution disparity. Specifically, we transform the visual samples in the source domain into an intermediate representation such that each transformed source sample can be linearly reconstructed by the samples of the target domain. Unlike the existing work, our method captures the intrinsic relatedness of the source samples during the adaptation process while uncovering the noises and outliers in the source domain that cannot be adapted, making it more robust than previous methods. We formulate our problem as a constrained nuclear norm and l 2, 1 norm minimization objective and then adopt the Augmented Lagrange Multiplier (ALM) method for the optimization. Extensive experiments on various visual adaptation tasks show that the proposed method consistently and significantly beats the state-of-the-art domain adaptation methods.

...read moreread less

318 citations

Book Chapter•DOI•

[...]

Lior Wolf¹, Tal Hassner², Yaniv Taigman¹•Institutions (2)

Tel Aviv University¹, Open University of Israel²

23 Sep 2009

TL;DR: “background samples”, that is, examples which do not belong to any of the classes being learned, may provide a significant performance boost to such face recognition systems, and is defined and evaluated as an extension to the recently proposed “One-Shot Similarity” (OSS) measure.

...read moreread less

Abstract: Evaluating the similarity of images and their descriptors by employing discriminative learners has proven itself to be an effective face recognition paradigm. In this paper we show how “background samples”, that is, examples which do not belong to any of the classes being learned, may provide a significant performance boost to such face recognition systems. In particular, we make the following contributions. First, we define and evaluate the “Two-Shot Similarity” (TSS) score as an extension to the recently proposed “One-Shot Similarity” (OSS) measure. Both these measures utilize background samples to facilitate better recognition rates. Second, we examine the ranking of images most similar to a query image and employ these as a descriptor for that image. Finally, we provide results underscoring the importance of proper face alignment in automatic face recognition systems. These contributions in concert allow us to obtain a success rate of 86.83% on the Labeled Faces in the Wild (LFW) benchmark, outperforming current state-of-the-art results.

...read moreread less

314 citations

Cites methods from "Distinctive Image Features from Sca..."

...We use the same descriptors of [10] with the addition of a SIFT descriptor: the LBP descriptor [30], two variants called Three-patch and Four-patch LBP (TPLBP and FPLBP) [10], the C1 image descriptor [31], and SIFT [32]....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
…
26
27
28
29
30
31
32
…
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Distinctive Image Features from Scale-Invariant Keypoints

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

01 Nov 2004-International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

46,906 citations

Proceedings Article•DOI•

Object recognition from local scale-invariant features

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

20 Sep 1999

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

16,989 citations

Proceedings Article•DOI•

A Combined Corner and Edge Detector

[...]

Chris Harris, Mike Stephens

01 Jan 1988

TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.

...read moreread less

Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

...read moreread less

13,993 citations

Journal Article•DOI•

A performance evaluation of local descriptors

[...]

Krystian Mikolajczyk¹, Cordelia Schmid²•Institutions (2)

University of Oxford¹, French Institute for Research in Computer Science and Automation²

01 Oct 2005-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.

...read moreread less

Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

...read moreread less

7,057 citations

Journal Article•DOI•

Robust wide-baseline stereo from maximally stable extremal regions

[...]

Jiri Matas¹, Ondrej Chum, Martin Urban, Tomas Pajdla•Institutions (1)

University of Surrey¹

01 Sep 2004-Image and Vision Computing

TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

...read moreread less

3,422 citations