Distinctive Image Features from Scale-Invariant Keypoints

Home
/
Papers
/
Distinctive Image Features from Scale-Invariant Keypoints

Distinctive Image Features from Scale-Invariant Keypoints

01 Jan 2011-

TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.

read less

Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Indexing and searching 100M images with map-reduce

[...]

Diana Moise¹, Denis Shestakov¹, Gylfi Gudmundsson¹, Laurent Amsaleg²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, Centre national de la recherche scientifique²

16 Apr 2013

TL;DR: This paper shows how the Map- Reduce paradigm can be applied to indexing algorithms and demonstrates that great scalability can be achieved using Hadoop, a popular Map-Reduce-based framework.

...read moreread less

Abstract: Most researchers working on high-dimensional indexing agree on the following three trends: (i) the size of the multimedia collections to index are now reaching millions if not billions of items, (ii) the computers we use every day now come with multiple cores and (iii) hardware becomes more available, thanks to easier access to Grids and/or Clouds This paper shows how the Map-Reduce paradigm can be applied to indexing algorithms and demonstrates that great scalability can be achieved using Hadoop, a popular Map-Reduce-based framework Dramatic performance improvements are not however guaranteed a priori: such frameworks are rigid, they severely constrain the possible access patterns to data and scares resource RAM has to be shared Furthermore, algorithms require major redesign, and may have to settle for sub-optimal behavior The benefits, however, are many: simplicity for programmers, automatic distribution, fault tolerance, failure detection and automatic re-runs and, last but not least, scalability We share our experience of adapting a clustering-based high-dimensional indexing algorithm to the Map-Reduce model, and of testing it at large scale with Hadoop as we index 30 billion SIFT descriptors We foresee that lessons drawn from our work could minimize time, effort and energy invested by other researchers and practitioners working in similar directions

...read moreread less

80 citations

Cites methods from "Distinctive Image Features from Sca..."

...When the raw descriptor collection is on the order of terabytes, as is the case when indexing tens of millions of real world image using SIFT [18], then indexing may take days or even weeks....
[...]
...Many query images are visually such that only a very small number of SIFT descriptors can be extracted from their contents, e.g., 1% of the images have less than 8 descriptors....
[...]
...We evaluate index creation and search using an image collection containing roughly 100 million images, this is about 30 billion SIFT descriptors or about 4 terabytes of data....
[...]
...SIFT descriptors were then extracted from these images, resulting in about 30 billion descriptors, i.e. 300 SIFT descriptors per image on average....
[...]
...Getting 100% accuracy is impossible as some image variants have zero SIFT descriptors (too dark e.g.)....
[...]

Proceedings Article•DOI•

Large-Scale Video Hashing via Structure Learning

[...]

Guangnan Ye¹, Dong Liu¹, Jun Wang², Shih-Fu Chang¹•Institutions (2)

Columbia University¹, IBM²

01 Dec 2013

TL;DR: A supervised method that explores the structure learning techniques to design efficient hash functions and exploits the common local visual patterns occurring in video frames that are associated with the same semantic class, and simultaneously preserves the temporal consistency over successive frames from the same video.

...read moreread less

Abstract: Recently, learning based hashing methods have become popular for indexing large-scale media data. Hashing methods map high-dimensional features to compact binary codes that are efficient to match and robust in preserving original similarity. However, most of the existing hashing methods treat videos as a simple aggregation of independent frames and index each video through combining the indexes of frames. The structure information of videos, e.g., discriminative local visual commonality and temporal consistency, is often neglected in the design of hash functions. In this paper, we propose a supervised method that explores the structure learning techniques to design efficient hash functions. The proposed video hashing method formulates a minimization problem over a structure-regularized empirical loss. In particular, the structure regularization exploits the common local visual patterns occurring in video frames that are associated with the same semantic class, and simultaneously preserves the temporal consistency over successive frames from the same video. We show that the minimization objective can be efficiently solved by an Accelerated Proximal Gradient (APG) method. Extensive experiments on two large video benchmark datasets (up to around 150K video clips with over 12 million frames) show that the proposed method significantly outperforms the state-of-the-art hashing methods.

...read moreread less

80 citations

Cites methods from "Distinctive Image Features from Sca..."

...Since we apply the widely used Bag-of-Words (BoW) model with local SIFT [15] features for video representation in our formulation, such selected feature dimensions, i....
[...]
...Since we apply the widely used Bag-of-Words (BoW) model with local SIFT [15] features for video representation in our formulation, such selected feature dimensions, i.e., visual words, correspond to discriminative local visual patterns....
[...]
...For each key frame, we extract 128-dimensional SIFT features [15] over key points and perform BoW quantization to derive the image representations [16]....
[...]

Journal Article•DOI•

Practical Privacy-Preserving Content-Based Retrieval in Cloud Image Repositories

[...]

Bernardo Ferreira¹, João A. Rodrigues¹, João Leitão¹, Henrique Domingos¹•Institutions (1)

Universidade Nova de Lisboa¹

01 Jul 2019-IEEE Transactions on Cloud Computing

TL;DR: This paper proposes a secure framework for outsourced privacy-preserving storage and retrieval in large shared image repositories based on IES-CBIR, a novel Image Encryption Scheme that exhibits Content-Based Image Retrieval properties.

...read moreread less

Abstract: Storage requirements for visual data have been increasing in recent years, following the emergence of many highly interactive multimedia services and applications for mobile devices in both personal and corporate scenarios. This has been a key driving factor for the adoption of cloud-based data outsourcing solutions. However, outsourcing data storage to the Cloud also leads to new security challenges that must be carefully addressed, especially regarding privacy. In this paper we propose a secure framework for outsourced privacy-preserving storage and retrieval in large shared image repositories. Our proposal is based on IES-CBIR, a novel Image Encryption Scheme that exhibits Content-Based Image Retrieval properties. The framework enables both encrypted storage and searching using Content-Based Image Retrieval queries while preserving privacy against honest-but-curious cloud administrators. We have built a prototype of the proposed framework, formally analyzed and proven its security properties, and experimentally evaluated its performance and retrieval precision. Our results show that IES-CBIR is provably secure, allows more efficient operations than existing proposals, both in terms of time and space complexity, and paves the way for new practical application scenarios.

...read moreread less

80 citations

Cites background or methods from "Distinctive Image Features from Sca..."

...SSE [17] IDI þ IDvwI OðFEI þ ClusterfvIþ put(vwI )) OðEAESðIÞþ put(CI ) þ get(Idx) þ OðjCBj þ jvwj jRepjÞ Local Color DOPEðIdxÞ þ FEI þ ClusterfvI þ UpdatevwI ðIdxÞ þ EOPEðIdxÞ þ put(Idx)) PKHE [15] IDI+sizeI þ IDvwI OðEPaillierðIÞ þ put(CI )) (OðEPaillier(I) þ put(CI )) – SIFT This Work IDI+sizeI þ IDvwI OðEIES CBIRðIÞ þ put(CI )) OðEIES CBIRðIÞ þ put(CI )) – Global Color the CBIR algorithms used in each work: local color histograms [17], SIFT [33], and global color histograms [34]....
[...]
...In this experiment, PKHE achieved the best result, as expected due to the use of the SIFT retrieval algorithm [33]....
[...]
...grams [17], SIFT [33], and global color histograms [34]....
[...]
...SIFT features were originally designed for object-recognition, and we believe that their use to search by example in image repositories (such as the ones used in our experiments and in the literature) does not leverage its full potential....
[...]
...Retrieval precision results for the PKHE system (in both experiments) were not substantially different from the other systems, even though it uses strong texture-based image features (in particular, SIFT)....
[...]

Journal Article•DOI•

Localization for Multirobot Formations in Indoor Environment

[...]

Haoyao Chen¹, Dong Sun¹, Jie Yang², Jian Chen¹•Institutions (2)

City University of Hong Kong¹, University of Science and Technology of China²

01 Aug 2010-IEEE-ASME Transactions on Mechatronics

TL;DR: A ceiling vision-based simultaneous localization and mapping (SLAM) methodology for solving the global localization problems in multirobot formations is proposed and an efficient data-association method is developed to achieve an optimistic feature match hypothesis quickly and accurately.

...read moreread less

Abstract: Localization is a key issue in multirobot formations, but it has not yet been sufficiently studied. In this paper, we propose a ceiling vision-based simultaneous localization and mapping (SLAM) methodology for solving the global localization problems in multirobot formations. First, an efficient data-association method is developed to achieve an optimistic feature match hypothesis quickly and accurately. Then, the relative poses among the robots are calculated utilizing a match-based approach, for local localization. To achieve the goal of global localization, three strategies are proposed. The first strategy is to globally localize one robot only (i.e., leader) and then localize the others based on relative poses among the robots. The second strategy is that each robot globally localizes itself by implementing SLAM individually. The third strategy is to utilize a common SLAM server, which may be installed on one of the robots, to globally localize all the robots simultaneously, based on a shared global map. Experiments are finally performed on a group of mobile robots to demonstrate the effectiveness of the proposed approaches.

...read moreread less

80 citations

Cites methods from "Distinctive Image Features from Sca..."

...Se et al. [30] used a robust scale-invariant feature transform (SIFT) descriptors to associate features, and Davison et al. [31] employed patch matching algorithm and particle searching strategy for data association....
[...]
...Compared to SIFT [33], Harris corners are more accurate and efficient in textureless environment such as ceilings and walls....
[...]

Journal Article•DOI•

Discriminative Appearance Models for Pictorial Structures

[...]

Mykhaylo Andriluka, Stefan Roth¹, Bernt Schiele•Institutions (1)

Technische Universität Darmstadt¹

01 Sep 2012-International Journal of Computer Vision

TL;DR: The combination of these components within the pictorial structures framework results in a generic model that yields state-of-the-art performance for several datasets on a variety of tasks: people detection, upper body pose estimation, and full body Pose estimation.

...read moreread less

Abstract: In this paper we consider people detection and articulated pose estimation, two closely related and challenging problems in computer vision. Conceptually, both of these problems can be addressed within the pictorial structures framework (Felzenszwalb and Huttenlocher in Int. J. Comput. Vis. 61(1):55---79, 2005; Fischler and Elschlager in IEEE Trans. Comput. C-22(1):67---92, 1973), even though previous approaches have not shown such generality. A principal difficulty for such a general approach is to model the appearance of body parts. The model has to be discriminative enough to enable reliable detection in cluttered scenes and general enough to capture highly variable appearance. Therefore, as the first important component of our approach, we propose a discriminative appearance model based on densely sampled local descriptors and AdaBoost classifiers. Secondly, we interpret the normalized margin of each classifier as likelihood in a generative model and compute marginal posteriors for each part using belief propagation. Thirdly, non-Gaussian relationships between parts are represented as Gaussians in the coordinate system of the joint between the parts. Additionally, in order to cope with shortcomings of tree-based pictorial structures models, we augment our model with additional repulsive factors in order to discourage overcounting of image evidence. We demonstrate that the combination of these components within the pictorial structures framework results in a generic model that yields state-of-the-art performance for several datasets on a variety of tasks: people detection, upper body pose estimation, and full body pose estimation.

...read moreread less

80 citations

Cites background or methods from "Distinctive Image Features from Sca..."

...(6) Boosted Part Detectors In our model, we represent the image evidence E by a densely computed grid of local image descriptors (e.g., shape context (Belongie et al. 2001) or SIFT, (Lowe 2004)—see Sect....
[...]
...In particular, we compute dense appearance representations based on local image descriptors [4,33,35], and use AdaBoost [19] to train discriminative part classifiers....
[...]
...The first interesting outcome of this experiment is that the original SIFT descriptor did not perform well compared to the results obtained with shape context....
[...]
...On the other hand, it shows that SIFT- and HOG-based detectors fail to benefit from a richer image description, which is perhaps due to the fact that properties such as texture do not generalize well across object instances....
[...]
...We compare the performance of shape context descriptors as previously used in [2] with SIFT descriptors [33], and edge templates obtained using the code from [38] and integrated into our pose estimation framework....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
…
174
175
176
177
178
179
180
…
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Distinctive Image Features from Scale-Invariant Keypoints

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

01 Nov 2004-International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

46,906 citations

Proceedings Article•DOI•

Object recognition from local scale-invariant features

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

20 Sep 1999

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

16,989 citations

Proceedings Article•DOI•

A Combined Corner and Edge Detector

[...]

Chris Harris, Mike Stephens

01 Jan 1988

TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.

...read moreread less

Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

...read moreread less

13,993 citations

Journal Article•DOI•

A performance evaluation of local descriptors

[...]

Krystian Mikolajczyk¹, Cordelia Schmid²•Institutions (2)

University of Oxford¹, French Institute for Research in Computer Science and Automation²

01 Oct 2005-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.

...read moreread less

Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

...read moreread less

7,057 citations

Journal Article•DOI•

Robust wide-baseline stereo from maximally stable extremal regions

[...]

Jiri Matas¹, Ondrej Chum, Martin Urban, Tomas Pajdla•Institutions (1)

University of Surrey¹

01 Sep 2004-Image and Vision Computing

TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

...read moreread less

3,422 citations