scispace - formally typeset
Search or ask a question

Distinctive Image Features from Scale-Invariant Keypoints

01 Jan 2011-
TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.
Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.
Citations
More filters
Posted Content
TL;DR: In this article, the authors proposed shape Dynamic Time Warping (shapeDTW), which enhances DTW by taking point-wise local structural information into consideration, and applied shapeDTW to align audio signal pairs having ground-truth alignments, as well as artificially simulated pairs of aligned sequences.
Abstract: Dynamic Time Warping (DTW) is an algorithm to align temporal sequences with possible local non-linear distortions, and has been widely applied to audio, video and graphics data alignments. DTW is essentially a point-to-point matching method under some boundary and temporal consistency constraints. Although DTW obtains a global optimal solution, it does not necessarily achieve locally sensible matchings. Concretely, two temporal points with entirely dissimilar local structures may be matched by DTW. To address this problem, we propose an improved alignment algorithm, named shape Dynamic Time Warping (shapeDTW), which enhances DTW by taking point-wise local structural information into consideration. shapeDTW is inherently a DTW algorithm, but additionally attempts to pair locally similar structures and to avoid matching points with distinct neighborhood structures. We apply shapeDTW to align audio signal pairs having ground-truth alignments, as well as artificially simulated pairs of aligned sequences, and obtain quantitatively much lower alignment errors than DTW and its two variants. When shapeDTW is used as a distance measure in a nearest neighbor classifier (NN-shapeDTW) to classify time series, it beats DTW on 64 out of 84 UCR time series datasets, with significantly improved classification accuracies. By using a properly designed local structure descriptor, shapeDTW improves accuracies by more than 10% on 18 datasets. To the best of our knowledge, shapeDTW is the first distance measure under the nearest neighbor classifier scheme to significantly outperform DTW, which had been widely recognized as the best distance measure to date. Our code is publicly accessible at: this https URL.

104 citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...In early days, raw image patches were used as point descriptors [1], and now more powerful descriptors like SIFT [27] are widely adopted since they capture local image structures very well and are invariant to image scale and rotation....

    [...]

  • ...They introduce a SIFT-like feature point detector and descriptor to detect and match salient feature points from two sequences first, and then use matched point pairs to regularize the search scope of the warping path....

    [...]

Journal ArticleDOI
TL;DR: New, state-of-the-art results are obtained which imply that CNNs, based on the proposed transfer learning methods and data augmentation skills, can identify more efficiently modalities of medical images.
Abstract: Medical images are valuable for clinical diagnosis and decision making. Image modality is an important primary step, as it is capable of aiding clinicians to access required medical image in retrieval systems. Traditional methods of modality classification are dependent on the choice of hand-crafted features and demand a clear awareness of prior domain knowledge. The feature learning approach may detect efficiently visual characteristics of different modalities, but it is limited to the number of training datasets. To overcome the absence of labeled data, on the one hand, we take deep convolutional neural networks (VGGNet, ResNet) with different depths pre-trained on ImageNet, fix most of the earlier layers to reserve generic features of natural images, and only train their higher-level portion on ImageCLEF to learn domain-specific features of medical figures. Then, we train from scratch deep CNNs with only six weight layers to capture more domain-specific features. On the other hand, we employ two data augmentation methods to help CNNs to give the full scope to their potential characterizing image modality features. The final prediction is given by our voting system based on the outputs of three CNNs. After evaluating our proposed model on the subfigure classification task in ImageCLEF2015 and ImageCLEF2016, we obtain new, state-of-the-art results—76.87% in ImageCLEF2015 and 87.37% in ImageCLEF2016—which imply that CNNs, based on our proposed transfer learning methods and data augmentation skills, can identify more efficiently modalities of medical images.

104 citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...De Herrera et al. [9] combine SIFT (Scale Invariant Feature Transform) [29] with BoC (Bag-of-Colors) [30] features to represent medical images....

    [...]

  • ...[9] combine SIFT (Scale Invariant Feature Transform) [29] with BoC (Bag-of-Colors) [30] features to represent medical images....

    [...]

Book ChapterDOI
17 May 2010
TL;DR: In this paper, a wearable sensor device equipped with a camera, a microphone, and an accelerometer attached to a user's wrist is used to recognize activities of daily living (ADLs).
Abstract: This paper describes how we recognize activities of daily living (ADLs) with our designed sensor device, which is equipped with heterogeneous sensors such as a camera, a microphone, and an accelerometer and attached to a user's wrist Specifically, capturing a space around the user's hand by employing the camera on the wrist mounted device enables us to recognize ADLs that involve the manual use of objects such as making tea or coffee and watering plant Existing wearable sensor devices equipped only with a microphone and an accelerometer cannot recognize these ADLs without object embedded sensors We also propose an ADL recognition method that takes privacy issues into account because the camera and microphone can capture aspects of a user's private life We confirmed experimentally that the incorporation of a camera could significantly improve the accuracy of ADL recognition

104 citations


Cites background or methods from "Distinctive Image Features from Sca..."

  • ...To cope with such scalability problems, we should extract more detailed features such as SIFT features [18] from ‘good’ images, e....

    [...]

  • ...Many studies try to detect objects from images while taking occlusion, rotation, scale, and blur into account [27, 18]....

    [...]

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work proposes a principled algorithm -- Image Transformation Pursuit (ITP) -- for the automatic selection of a compact set of transformations, by selecting at each iteration the one that yields the highest accuracy gain.
Abstract: A simple approach to learning invariances in image clas- sification consists in augmenting the training set with transformed versions of the original images. However, given a large set of possible transformations, selecting a com- pact subset is challenging. Indeed, all transformations are not equally informative and adding uninformative transfor- mations increases training time with no gain in accuracy. We propose a principled algorithm--Image Transformation Pursuit (ITP)--for the automatic selection of a compact set of transformations. ITP works in a greedy fashion, by se- lecting at each iteration the one that yields the highest accuracy gain. ITP also allows to efficiently explore complex transformations, that combine basic transformations. We report results on two public benchmarks: the CUB dataset of bird images and the ImageNet 2010 challenge. Using Fisher Vector representations, we achieve an improvement from 28.2% to 45.2% in top-1 accuracy on CUB, and an im- provement from 70.1% to 74.9% in top-5 accuracy on Im- ageNet. We also show significant improvements for deep convnet features: from 47.3% to 55.4% on CUB and from 77.9% to 81.4% on ImageNet.

103 citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...This is an interesting find- 200 400 600 SGD Iterations (in thousands) 5 15 25 T op -1 A cc ur ac y (% ) T1 +T2 (crop5) +T3 (flip) +T4 (crop1) +T5 (crop6) +T6(crop0) 200 400 600 800 SGD Iterations (in thousands) 20 25 30 35 40 T op -5 A cc ur ac y (% ) T1 +T2 (flip) +T3 (crop0) +T4 (homo2) +T5 (crop6) +T6(crop1) Figure 8: Test accuracy as a function of the number of SGD iterations on CUB (left) and ILSVRC-30 (right), with SIFT....

    [...]

  • ...ITP itself performs at least on par with the TR variant, and sometimes significantly better (see SIFT FVs on CUB or color FVs on ILSVRC-30)....

    [...]

  • ...CUB CUB ILSVRC-30 ILSVRC-30 SIFT Color SIFT Color 1 crop5 crop1 flip crop2 2 flip crop5 crop0 color1 3 crop1 crop6 homo2 flip 4 crop6 crop8 crop6 crop1 5 crop0 scale0 crop1 color0 Figure 7: First five selected transformations by ITP (left); overlaying the different crops selected by 2-ITP-S for the SIFT channel (right). ing and test image, we extract local descriptors from all its transformed versions and aggregate them in a single FV....

    [...]

  • ...0 1 2 3 4 5 Transformations 10 15 20 25 30 35 T op -1 A cc ur ac y (% ) TR SIFT ITP SIFT TR color ITP color 0 1 2 3 4 5 Transformations 28 30 32 34 36 38 40 42 44 T op -5 A cc ur ac y (% ) TR SIFT ITP SIFT TR color ITP color Figure 6: Evolution of the test accuracy on CUB (left) and ILSVRC-30 (right) as a function of the number of transformations selected by ITP, or its cheaper TR variant....

    [...]

  • ...We extract SIFT [17] and color [8] descriptors on a dense grid at multiple scales....

    [...]

Proceedings ArticleDOI
01 Dec 2013
TL;DR: The Deformable Mixture Parsing Model (DMPM) thus directly solves the problem of human parsing by searching for the best graph configuration from a pool of Parse let hypotheses without intermediate tasks.
Abstract: In this work, we address the problem of human parsing, namely partitioning the human body into semantic regions, by using the novel Parselet representation. Previous works often consider solving the problem of human pose estimation as the prerequisite of human parsing. We argue that these approaches cannot obtain optimal pixel level parsing due to the inconsistent targets between these tasks. In this paper, we propose to use Parselets as the building blocks of our parsing model. Parselets are a group of parsable segments which can generally be obtained by low-level over-segmentation algorithms and bear strong semantic meaning. We then build a Deformable Mixture Parsing Model (DMPM) for human parsing to simultaneously handle the deformation and multi-modalities of Parselets. The proposed model has two unique characteristics: (1) the possible numerous modalities of Parse let ensembles are exhibited as the ``And-Or" structure of sub-trees, (2) to further solve the practical problem of Parselet occlusion or absence, we directly model the visibility property at some leaf nodes. The DMPM thus directly solves the problem of human parsing by searching for the best graph configuration from a pool of Parse let hypotheses without intermediate tasks. Comprehensive evaluations demonstrate the encouraging performance of the proposed approach.

103 citations


Cites background or methods from "Distinctive Image Features from Sca..."

  • ...Sappearance(a, b) is defined as the χ2 distance of the color and SIFT [23] histogram of segments a and b [29]....

    [...]

  • ...Sappearance(a, b) is defined as the χ(2) distance of the color and SIFT [23] histogram of segments a and b [29]....

    [...]

  • ...Implementation Details: We extract dense SIFT [23], HOG [9] and color moment as low-level features for Parselets....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Proceedings ArticleDOI
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

16,989 citations

Proceedings ArticleDOI
01 Jan 1988
TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

13,993 citations

Journal ArticleDOI
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

7,057 citations

Journal ArticleDOI
TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

3,422 citations

Trending Questions (1)
How can distinctive features theory be applied to elision?

The provided information does not mention anything about the application of distinctive features theory to elision.