Distinctive Image Features from Scale-Invariant Keypoints

Home
/
Papers
/
Distinctive Image Features from Scale-Invariant Keypoints

Distinctive Image Features from Scale-Invariant Keypoints

01 Jan 2011-

TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.

read less

Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

shapeDTW: shape Dynamic Time Warping

[...]

Jiaping Zhao¹, Laurent Itti¹•Institutions (1)

University of Southern California¹

06 Jun 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors proposed shape Dynamic Time Warping (shapeDTW), which enhances DTW by taking point-wise local structural information into consideration, and applied shapeDTW to align audio signal pairs having ground-truth alignments, as well as artificially simulated pairs of aligned sequences.

...read moreread less

Abstract: Dynamic Time Warping (DTW) is an algorithm to align temporal sequences with possible local non-linear distortions, and has been widely applied to audio, video and graphics data alignments. DTW is essentially a point-to-point matching method under some boundary and temporal consistency constraints. Although DTW obtains a global optimal solution, it does not necessarily achieve locally sensible matchings. Concretely, two temporal points with entirely dissimilar local structures may be matched by DTW. To address this problem, we propose an improved alignment algorithm, named shape Dynamic Time Warping (shapeDTW), which enhances DTW by taking point-wise local structural information into consideration. shapeDTW is inherently a DTW algorithm, but additionally attempts to pair locally similar structures and to avoid matching points with distinct neighborhood structures. We apply shapeDTW to align audio signal pairs having ground-truth alignments, as well as artificially simulated pairs of aligned sequences, and obtain quantitatively much lower alignment errors than DTW and its two variants. When shapeDTW is used as a distance measure in a nearest neighbor classifier (NN-shapeDTW) to classify time series, it beats DTW on 64 out of 84 UCR time series datasets, with significantly improved classification accuracies. By using a properly designed local structure descriptor, shapeDTW improves accuracies by more than 10% on 18 datasets. To the best of our knowledge, shapeDTW is the first distance measure under the nearest neighbor classifier scheme to significantly outperform DTW, which had been widely recognized as the best distance measure to date. Our code is publicly accessible at: this https URL.

...read moreread less

104 citations

Cites methods from "Distinctive Image Features from Sca..."

...In early days, raw image patches were used as point descriptors [1], and now more powerful descriptors like SIFT [27] are widely adopted since they capture local image structures very well and are invariant to image scale and rotation....
[...]
...They introduce a SIFT-like feature point detector and descriptor to detect and match salient feature points from two sequences first, and then use matched point pairs to regularize the search scope of the warping path....
[...]

Journal Article•DOI•

Deep Transfer Learning for Modality Classification of Medical Images

[...]

Yuhai Yu, Hongfei Lin, Jiana Meng, Xiaocong Wei, Hai Guo, Zhehuan Zhao - Show less +2 more

29 Jul 2017-Information-an International Interdisciplinary Journal

TL;DR: New, state-of-the-art results are obtained which imply that CNNs, based on the proposed transfer learning methods and data augmentation skills, can identify more efficiently modalities of medical images.

...read moreread less

Abstract: Medical images are valuable for clinical diagnosis and decision making. Image modality is an important primary step, as it is capable of aiding clinicians to access required medical image in retrieval systems. Traditional methods of modality classification are dependent on the choice of hand-crafted features and demand a clear awareness of prior domain knowledge. The feature learning approach may detect efficiently visual characteristics of different modalities, but it is limited to the number of training datasets. To overcome the absence of labeled data, on the one hand, we take deep convolutional neural networks (VGGNet, ResNet) with different depths pre-trained on ImageNet, fix most of the earlier layers to reserve generic features of natural images, and only train their higher-level portion on ImageCLEF to learn domain-specific features of medical figures. Then, we train from scratch deep CNNs with only six weight layers to capture more domain-specific features. On the other hand, we employ two data augmentation methods to help CNNs to give the full scope to their potential characterizing image modality features. The final prediction is given by our voting system based on the outputs of three CNNs. After evaluating our proposed model on the subfigure classification task in ImageCLEF2015 and ImageCLEF2016, we obtain new, state-of-the-art results—76.87% in ImageCLEF2015 and 87.37% in ImageCLEF2016—which imply that CNNs, based on our proposed transfer learning methods and data augmentation skills, can identify more efficiently modalities of medical images.

...read moreread less

104 citations

Cites methods from "Distinctive Image Features from Sca..."

...De Herrera et al. [9] combine SIFT (Scale Invariant Feature Transform) [29] with BoC (Bag-of-Colors) [30] features to represent medical images....
[...]
...[9] combine SIFT (Scale Invariant Feature Transform) [29] with BoC (Bag-of-Colors) [30] features to represent medical images....
[...]

Book Chapter•DOI•

Object-based activity recognition with heterogeneous sensors on wrist

[...]

Takuya Maekawa¹, Yutaka Yanagisawa¹, Yasue Kishino¹, Katsuhiko Ishiguro¹, Koji Kamei¹, Yasushi Sakurai¹, Takeshi Okadome¹ - Show less +3 more•Institutions (1)

Nippon Telegraph and Telephone¹

17 May 2010

TL;DR: In this paper, a wearable sensor device equipped with a camera, a microphone, and an accelerometer attached to a user's wrist is used to recognize activities of daily living (ADLs).

...read moreread less

Abstract: This paper describes how we recognize activities of daily living (ADLs) with our designed sensor device, which is equipped with heterogeneous sensors such as a camera, a microphone, and an accelerometer and attached to a user's wrist Specifically, capturing a space around the user's hand by employing the camera on the wrist mounted device enables us to recognize ADLs that involve the manual use of objects such as making tea or coffee and watering plant Existing wearable sensor devices equipped only with a microphone and an accelerometer cannot recognize these ADLs without object embedded sensors We also propose an ADL recognition method that takes privacy issues into account because the camera and microphone can capture aspects of a user's private life We confirmed experimentally that the incorporation of a camera could significantly improve the accuracy of ADL recognition

...read moreread less

104 citations

Cites background or methods from "Distinctive Image Features from Sca..."

...To cope with such scalability problems, we should extract more detailed features such as SIFT features [18] from ‘good’ images, e....
[...]
...Many studies try to detect objects from images while taking occlusion, rotation, scale, and blur into account [27, 18]....
[...]

Proceedings Article•DOI•

Transformation Pursuit for Image Classification

[...]

Mattis Paulin, Jerome Revaud, Zaid Harchaoui, Florent Perronnin¹, Cordelia Schmid - Show less +1 more•Institutions (1)

Xerox¹

23 Jun 2014

TL;DR: This work proposes a principled algorithm -- Image Transformation Pursuit (ITP) -- for the automatic selection of a compact set of transformations, by selecting at each iteration the one that yields the highest accuracy gain.

...read moreread less

Abstract: A simple approach to learning invariances in image clas- sification consists in augmenting the training set with transformed versions of the original images. However, given a large set of possible transformations, selecting a com- pact subset is challenging. Indeed, all transformations are not equally informative and adding uninformative transfor- mations increases training time with no gain in accuracy. We propose a principled algorithm--Image Transformation Pursuit (ITP)--for the automatic selection of a compact set of transformations. ITP works in a greedy fashion, by se- lecting at each iteration the one that yields the highest accuracy gain. ITP also allows to efficiently explore complex transformations, that combine basic transformations. We report results on two public benchmarks: the CUB dataset of bird images and the ImageNet 2010 challenge. Using Fisher Vector representations, we achieve an improvement from 28.2% to 45.2% in top-1 accuracy on CUB, and an im- provement from 70.1% to 74.9% in top-5 accuracy on Im- ageNet. We also show significant improvements for deep convnet features: from 47.3% to 55.4% on CUB and from 77.9% to 81.4% on ImageNet.

...read moreread less

103 citations

Cites methods from "Distinctive Image Features from Sca..."

...This is an interesting find- 200 400 600 SGD Iterations (in thousands) 5 15 25 T op -1 A cc ur ac y (% ) T1 +T2 (crop5) +T3 (flip) +T4 (crop1) +T5 (crop6) +T6(crop0) 200 400 600 800 SGD Iterations (in thousands) 20 25 30 35 40 T op -5 A cc ur ac y (% ) T1 +T2 (flip) +T3 (crop0) +T4 (homo2) +T5 (crop6) +T6(crop1) Figure 8: Test accuracy as a function of the number of SGD iterations on CUB (left) and ILSVRC-30 (right), with SIFT....
[...]
...ITP itself performs at least on par with the TR variant, and sometimes significantly better (see SIFT FVs on CUB or color FVs on ILSVRC-30)....
[...]
...CUB CUB ILSVRC-30 ILSVRC-30 SIFT Color SIFT Color 1 crop5 crop1 flip crop2 2 flip crop5 crop0 color1 3 crop1 crop6 homo2 flip 4 crop6 crop8 crop6 crop1 5 crop0 scale0 crop1 color0 Figure 7: First five selected transformations by ITP (left); overlaying the different crops selected by 2-ITP-S for the SIFT channel (right). ing and test image, we extract local descriptors from all its transformed versions and aggregate them in a single FV....
[...]
...0 1 2 3 4 5 Transformations 10 15 20 25 30 35 T op -1 A cc ur ac y (% ) TR SIFT ITP SIFT TR color ITP color 0 1 2 3 4 5 Transformations 28 30 32 34 36 38 40 42 44 T op -5 A cc ur ac y (% ) TR SIFT ITP SIFT TR color ITP color Figure 6: Evolution of the test accuracy on CUB (left) and ILSVRC-30 (right) as a function of the number of transformations selected by ITP, or its cheaper TR variant....
[...]
...We extract SIFT [17] and color [8] descriptors on a dense grid at multiple scales....
[...]

Proceedings Article•DOI•

A Deformable Mixture Parsing Model with Parselets

[...]

Jian Dong¹, Qiang Chen¹, Wei Xia¹, Zhongyang Huang², Shuicheng Yan¹ - Show less +1 more•Institutions (2)

National University of Singapore¹, Panasonic²

01 Dec 2013

TL;DR: The Deformable Mixture Parsing Model (DMPM) thus directly solves the problem of human parsing by searching for the best graph configuration from a pool of Parse let hypotheses without intermediate tasks.

...read moreread less

Abstract: In this work, we address the problem of human parsing, namely partitioning the human body into semantic regions, by using the novel Parselet representation. Previous works often consider solving the problem of human pose estimation as the prerequisite of human parsing. We argue that these approaches cannot obtain optimal pixel level parsing due to the inconsistent targets between these tasks. In this paper, we propose to use Parselets as the building blocks of our parsing model. Parselets are a group of parsable segments which can generally be obtained by low-level over-segmentation algorithms and bear strong semantic meaning. We then build a Deformable Mixture Parsing Model (DMPM) for human parsing to simultaneously handle the deformation and multi-modalities of Parselets. The proposed model has two unique characteristics: (1) the possible numerous modalities of Parse let ensembles are exhibited as the ``And-Or" structure of sub-trees, (2) to further solve the practical problem of Parselet occlusion or absence, we directly model the visibility property at some leaf nodes. The DMPM thus directly solves the problem of human parsing by searching for the best graph configuration from a pool of Parse let hypotheses without intermediate tasks. Comprehensive evaluations demonstrate the encouraging performance of the proposed approach.

...read moreread less

103 citations

Cites background or methods from "Distinctive Image Features from Sca..."

...Sappearance(a, b) is defined as the χ2 distance of the color and SIFT [23] histogram of segments a and b [29]....
[...]
...Sappearance(a, b) is defined as the χ(2) distance of the color and SIFT [23] histogram of segments a and b [29]....
[...]
...Implementation Details: We extract dense SIFT [23], HOG [9] and color moment as low-level features for Parselets....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
…
131
132
133
134
135
136
137
…
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Distinctive Image Features from Scale-Invariant Keypoints

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

01 Nov 2004-International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

46,906 citations

Proceedings Article•DOI•

Object recognition from local scale-invariant features

[...]

David G. Lowe¹•Institutions (1)

University of British Columbia¹

20 Sep 1999

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

16,989 citations

Proceedings Article•DOI•

A Combined Corner and Edge Detector

[...]

Chris Harris, Mike Stephens

01 Jan 1988

TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.

...read moreread less

Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

...read moreread less

13,993 citations

Journal Article•DOI•

A performance evaluation of local descriptors

[...]

Krystian Mikolajczyk¹, Cordelia Schmid²•Institutions (2)

University of Oxford¹, French Institute for Research in Computer Science and Automation²

01 Oct 2005-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.

...read moreread less

Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

...read moreread less

7,057 citations

Journal Article•DOI•

Robust wide-baseline stereo from maximally stable extremal regions

[...]

Jiri Matas¹, Ondrej Chum, Martin Urban, Tomas Pajdla•Institutions (1)

University of Surrey¹

01 Sep 2004-Image and Vision Computing

TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

...read moreread less

3,422 citations