scispace - formally typeset
Search or ask a question

Distinctive Image Features from Scale-Invariant Keypoints

01 Jan 2011-
TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.
Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.
Citations
More filters
Proceedings Article
01 Jan 2011
TL;DR: The eighth edition of the ImageCLEF medical retrieval task was organized in 2011 and a subset of the open access collection of PubMed Central was used as the database, which contains 231,000 images and is substantially larger than previously used collections.
Abstract: The eighth edition of the ImageCLEF medical retrieval task was organized in 2011. A subset of the open access collection of PubMed Central was used as the database in 2011. This database contains 231,000 images and is substantially larger than previously used collections. Additionally, there was a larger fraction of non–clinical images such as graphs and charts. As in 2010, we had three subtasks: modality classification, image–based and case–based retrieval. A new, simple hierarchy for article figures was created. Our belief is that the use of the detected modality should help filter out non–relevant images, thereby improving precision. The goal of the image–based retrieval task was to retrieve an ordered set of images from the collection that best meet the information need specified as a textual statement and a set of sample images, while the goal of the case–based retrieval task was to return an ordered set of articles (rather than images) that best meet the information need provided as a description of a “case”. The number of registrations to the medical task increased to 55 research groups. However, groups submitting runs have remained stable at 17, with the number of submitted runs increasing to 207. Of these, 130 were image–based retrieval runs, 43 were case–based runs while the remaining 34 were modality classification runs. Combining textual and visual cues most often led to best results, but results fusion needs to be used with care.

88 citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...the SIFT [14], as well as various combinations of these....

    [...]

  • ...Techniques Used for Visual Classification The Xerox team, which obtained the best results, uses a Fisher Vector representation of the images built on low level features such as Scale Invariant Feature Transform (SIFT), Local Orientation Histograms (ORH) and local RGB statistics [11]....

    [...]

  • ...Tools such as the GIFT4 (GNU Image Finding Tool) [12] were employed as well as techniques such as the Color Layout Descriptor (CLD) of MPEG–7, the Color and Edge Directivity Descriptor (CEDD), the Fuzzy Color and Texture Histogram (FCTH) using the Lucene image retrieval (LIRE) library5 [13], 4 http://www.gnu.org/software/gift/ 5 http://freshmeat.net/projects/lirecbir/ the SIFT [14], as well as various combinations of these....

    [...]

Journal ArticleDOI
Shiyang Lu1, Zhiyong Wang1, Tao Mei2, Genliang Guan1, David Dagan Feng1 
TL;DR: The proposed Bag-of-Importance (BoI) model for static video summarization is able to exploit both the inter-frame and intra-frame properties of feature representations and identify keyframes capturing both the dominant content and discriminative details within a video.
Abstract: Video summarization helps users obtain quick comprehension of video content. Recently, some studies have utilized local features to represent each video frame and formulate video summarization as a coverage problem of local features. However, the importance of individual local features has not been exploited. In this paper, we propose a novel Bag-of-Importance (BoI) model for static video summarization by identifying the frames with important local features as keyframes, which is one of the first studies formulating video summarization at local feature level, instead of at global feature level. That is, by representing each frame with local features, a video is characterized with a bag of local features weighted with individual importance scores and the frames with more important local features are more representative, where the representativeness of each frame is the aggregation of the weighted importance of the local features contained in the frame. In addition, we propose to learn a transformation from a raw local feature to a more powerful sparse nonlinear representation for deriving the importance score of each local feature, rather than directly utilize the hand-crafted visual features like most of the existing approaches. Specifically, we first employ locality-constrained linear coding (LCC) to project each local feature into a sparse transformed space. LCC is able to take advantage of the manifold geometric structure of the high dimensional feature space and form the manifold of the low dimensional transformed space with the coordinates of a set of anchor points. Then we calculate the l2 norm of each anchor point as the importance score of each local feature which is projected to the anchor point. Finally, the distribution of the importance scores of all the local features in a video is obtained as the BoI representation of the video. We further differentiate the importance of local features with a spatial weighting template by taking the perceptual difference among spatial regions of a frame into account. As a result, our proposed video summarization approach is able to exploit both the inter-frame and intra-frame properties of feature representations and identify keyframes capturing both the dominant content and discriminative details within a video. Experimental results on three video datasets across various genres demonstrate that the proposed approach clearly outperforms several state-of-the-art methods.

88 citations


Cites background from "Distinctive Image Features from Sca..."

  • ...1) Weighting Local Features in the Transformed Space: Given a local descriptor set for representing the collections of all the local features extracted from each frame within a video, where denotes the th local feature (Scale-Invariant Feature Transform (SIFT) descriptor [17] in this paper), denotes the dimension of the feature descriptor, denotes the number of all the local features extracted from the given video....

    [...]

  • ...In recent years, local visual features, such as the scale-invariant feature transform (SIFT) descriptor [17], have been playing a more significant role in many applications of video content analysis due to their distinctive representation capacity [15], [22], [35], [36]....

    [...]

Journal ArticleDOI
TL;DR: This work introduces the smoothing technique to improve the estimates of the small eigenvalues of a covariance matrix of KISS, and introduces the minimum classification error-KISS, which is more reliable than classical ML estimation with the increasing of the number of training samples.
Abstract: In recent years, person reidentification has received growing attention with the increasing popularity of intelligent video surveillance. This is because person reidentification is critical for human tracking with multiple cameras. Recently, keep it simple and straightforward (KISS) metric learning has been regarded as a top level algorithm for person reidentification. The covariance matrices of KISS are estimated by maximum likelihood (ML) estimation. It is known that discriminative learning based on the minimum classification error (MCE) is more reliable than classical ML estimation with the increasing of the number of training samples. When considering a small sample size problem, direct MCE KISS does not work well, because of the estimate error of small eigenvalues. Therefore, we further introduce the smoothing technique to improve the estimates of the small eigenvalues of a covariance matrix. Our new scheme is termed the minimum classification error-KISS (MCE-KISS). We conduct thorough validation experiments on the VIPeR and ETHZ datasets, which demonstrate the robustness and effectiveness of MCE-KISS for person reidentification.

88 citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...Scale invariant feature transform (SIFT) [20] and speeded up robust features (SURF) have also been used to extract texture features [21], [22]....

    [...]

Proceedings ArticleDOI
01 Nov 2013
TL;DR: This work examines the benefits of dense feature extraction and multimodal features for improving the accuracy and robustness of an instance recognition system and obtains significant improvements over previously published results on two RGB-D datasets.
Abstract: Despite the rich information provided by sensors such as the Microsoft Kinect in the robotic perception setting, the problem of detecting object instances remains unsolved, even in the tabletop setting, where segmentation is greatly simplified. Existing object detection systems often focus on textured objects, for which local feature descriptors can be used to reliably obtain correspondences between different views of the same object. We examine the benefits of dense feature extraction and multimodal features for improving the accuracy and robustness of an instance recognition system. By combining multiple modalities and blending their scores through an ensemble-based method in order to generate our final object hypotheses, we obtain significant improvements over previously published results on two RGB-D datasets. On the Challenge dataset, our method results in only one missed detection (achieving 100% precision and 99.77% recall). On the Willow dataset, we also make significant gains on the prior state of the art (achieving 98.28% precision and 87.78% recall), resulting in an increase in F-score from 0.8092 to 0.9273.

88 citations


Cites background from "Distinctive Image Features from Sca..."

  • ...[3], which first constructs a sparse descriptor database by extracting SIFT features [4] at training time....

    [...]

Journal ArticleDOI
TL;DR: This novel approach to build 3-D models of skin wounds from color images using a low-cost and user-friendly image acquisition device suitable for widespread application in health care centers entails the development of a robust image processing chain.
Abstract: In this paper, after an overview of the literature concerning the imaging technologies applied to skin wounds assessment, we present an original approach to build 3-D models of skin wounds from color images. The method can deal with uncalibrated images acquired with a handheld digital camera with free zooming. Compared with the cumbersome imaging systems already proposed, this novel solution uses a low-cost and user-friendly image acquisition device suitable for widespread application in health care centers. However, this method entails the development of a robust image processing chain. An original iterative matching scheme is used to generate a dense estimation of the surface geometry from two widely separated views. The best configuration for taking photographs lies between 15deg and 30deg for the vergency angle. The metric reconstruction of the skin wound is fully automated through self-calibration. From the 3-D model of the skin wound, accurate volumetric measurements are achieved. The accuracy of the inferred 3-D surface is validated by registration to a ground truth and repetitive tests on volume. The global precision around 3% is in accordance with the clinical requirement of 5% for assessing the healing process.

88 citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...The adopted matching strategy begins with finding a small number of robust correspondences between the two images using a ‘winner takes all’ strategy and the SIFT descriptor [38]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Proceedings ArticleDOI
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

16,989 citations

Proceedings ArticleDOI
01 Jan 1988
TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

13,993 citations

Journal ArticleDOI
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

7,057 citations

Journal ArticleDOI
TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

3,422 citations

Trending Questions (1)
How can distinctive features theory be applied to elision?

The provided information does not mention anything about the application of distinctive features theory to elision.