scispace - formally typeset
Search or ask a question

Distinctive Image Features from Scale-Invariant Keypoints

01 Jan 2011-
TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.
Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.
Citations
More filters
Proceedings Article
03 Dec 2012
TL;DR: The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity, andSBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments.
Abstract: Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within (0, π/2]. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments.

112 citations


Cites background from "Distinctive Image Features from Sca..."

  • ...To test the effect of Super-Bit depth N , we fix K = 120 for Photo Tourism SIFT and K = 3000 for MIR-Flickr respectively, and to test the effect of code length K, we fix N = 120 for Photo Tourism SIFT and N = 3000 for MIR-Flickr....

    [...]

  • ..., bag-of-words representation of documents and images, and histogram-based representations like SIFT local descriptor [20]....

    [...]

  • ...We conduct the experiment on the following datasets: 1) Photo Tourism patch dataset1 [26], Notre Dame, which contains 104,106 patches, each of which is represented by a 128D SIFT descriptor (Photo Tourism SIFT); and 2) MIR-Flickr2, which contains 25,000 images, each of which is represented by a 3125D bag-of-SIFT-feature histogram; For each dataset, we further conduct a simple preprocessing step as in [12], i.e. mean-centering each data sample, so as to obtain additional mean-centered versions of the above datasets, Photo Tourism SIFT (mean), and MIR-Flickr (mean)....

    [...]

  • ..., the normalized bag-of-words representation for documents, images, and videos, and the normalized histogram-based local features like SIFT [20]....

    [...]

  • ...In real applications, many vector representations are limited in non-negative orthant with all vector entries being non-negative, e.g., bag-of-words representation of documents and images, and histogram-based representations like SIFT local descriptor [20]....

    [...]

Proceedings ArticleDOI
01 May 2017
TL;DR: This paper proposes a novel approach for learning a discriminative holistic image representation which exploits the image content to create a dense and salient scene description and shows that the learnt image representation outperforms off-the-shelf features from the deep networks and hand-crafted features.
Abstract: Visual place recognition under difficult perceptual conditions remains a challenging problem due to changing weather conditions, illumination and seasons Long-term visual navigation approaches for robot localization should be robust to these dynamics of the environment Existing methods typically leverage feature descriptions of whole images or image regions from Deep Convolutional Neural Networks Some approaches also exploit sequential information to alleviate the problem of spatially inconsistent and non-perfect image matches In this paper, we propose a novel approach for learning a discriminative holistic image representation which exploits the image content to create a dense and salient scene description These salient descriptions are learnt over a variety of datasets under large perceptual changes Such an approach enables us to precisely segment the regions of an image which are geometrically stable over large time lags We combine features from these salient regions and an off-the-shelf holistic representation to form a more robust scene descriptor We also introduce a semantically labeled dataset which captures extreme perceptual and structural scene dynamics over the course of 3 years We evaluated our approach with extensive experiments on data collected over several kilometers in Freiburg and show that our learnt image representation outperforms off-the-shelf features from the deep networks and hand-crafted features

112 citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...Traditionally, approaches relied on point descriptors such as SIFT [15] for place recognition....

    [...]

Journal ArticleDOI
TL;DR: The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.
Abstract: The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.

111 citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...In our implementation, the low-level features are SIFT descriptors (Lowe 2004) whose dimensionality is reduced from 128 down to 32 dimensions....

    [...]

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed new super-resolution (SR) scheme achieves significant improvement compared with four state-of-the-art schemes in terms of both subjective and objective qualities.
Abstract: This paper proposes a new super-resolution (SR) scheme for landmark images by retrieving correlated web images. Using correlated web images significantly improves the exemplar-based SR. Given a low-resolution (LR) image, we extract local descriptors from its up-sampled version and bundle the descriptors according to their spatial relationship to retrieve correlated high-resolution (HR) images from the web. Though similar in content, the retrieved images are usually taken with different illumination, focal lengths, and shot perspectives, resulting in uncertainty for the HR detail approximation. To solve this problem, we first propose aligning these images to the up-sampled LR image through a global registration, which identifies the corresponding regions in these images and reduces the mismatching. Second, we propose a structure-aware matching criterion and adaptive block sizes to improve the mapping accuracy between LR and HR patches. Finally, these matched HR patches are blended together by solving an energy minimization problem to recover the desired HR image. Experimental results demonstrate that our SR scheme achieves significant improvement compared with four state-of-the-art schemes in terms of both subjective and objective qualities.

111 citations


Cites background from "Distinctive Image Features from Sca..."

  • ...Please refer to [38] for further details on SIFT extraction....

    [...]

  • ...SIFT descriptors [38] characterize image regions invariant to image scale, rotation, and 3D camera viewpoint....

    [...]

  • ...5, similar to the criterion proposed in [38]....

    [...]

Posted Content
TL;DR: The fit models provide a new platform for exploring the functional principles of human vision, and they show that modern methods of computer vision and machine learning provide important tools for characterizing brain function.
Abstract: The human brain is adept at solving difficult high-level visual processing problems such as image interpretation and object recognition in natural scenes. Over the past few years neuroscientists have made remarkable progress in understanding how the human brain represents categories of objects and actions in natural scenes. However, all current models of high-level human vision operate on hand annotated images in which the objects and actions have been assigned semantic tags by a human operator. No current models can account for high-level visual function directly in terms of low-level visual input (i.e., pixels). To overcome this fundamental limitation we sought to develop a new class of models that can predict human brain activity directly from low-level visual input (i.e., pixels). We explored two classes of models drawn from computer vision and machine learning. The first class of models was based on Fisher Vectors (FV) and the second was based on Convolutional Neural Networks (ConvNets). We find that both classes of models accurately predict brain activity in high-level visual areas, directly from pixels and without the need for any semantic tags or hand annotation of images. This is the first time that such a mapping has been obtained. The fit models provide a new platform for exploring the functional principles of human vision, and they show that modern methods of computer vision and machine learning provide important tools for characterizing brain function.

111 citations


Cites background from "Distinctive Image Features from Sca..."

  • ...SIFT features capture the distribution of edge orientation energy in each patch [20]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Proceedings ArticleDOI
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

16,989 citations

Proceedings ArticleDOI
01 Jan 1988
TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

13,993 citations

Journal ArticleDOI
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

7,057 citations

Journal ArticleDOI
TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

3,422 citations

Trending Questions (1)
How can distinctive features theory be applied to elision?

The provided information does not mention anything about the application of distinctive features theory to elision.