scispace - formally typeset
Search or ask a question

Distinctive Image Features from Scale-Invariant Keypoints

01 Jan 2011-
TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.
Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.
Citations
More filters
Journal ArticleDOI
Lin Zhang1, Lida Li1, Anqi Yang1, Ying Shen1, Meng Yang2 
TL;DR: A novel approach, namely CR_CompCode, which can achieve high recognition accuracy while having an extremely low computational complexity is proposed, which is highly effective and efficient for contactless palmprint identification.

156 citations

Journal ArticleDOI
TL;DR: The results show that SDA is feasible for detecting loops at a satisfactory precision and can therefore provide an alternative way for visual SLAM systems and show the workflow of training the network, utilizing the features and computing the similarity score.
Abstract: This paper is concerned of the loop closure detection problem for visual simultaneous localization and mapping systems. We propose a novel approach based on the stacked denoising auto-encoder (SDA), a multi-layer neural network that autonomously learns an compressed representation from the raw input data in an unsupervised way. Different with the traditional bag-of-words based methods, the deep network has the ability to learn the complex inner structures in image data, while no longer needs to manually design the visual features. Our approach employs the characteristics of the SDA to solve the loop detection problem. The workflow of training the network, utilizing the features and computing the similarity score is presented. The performance of SDA is evaluated by a comparison study with Fab-map 2.0 using data from open datasets and physical robots. The results show that SDA is feasible for detecting loops at a satisfactory precision and can therefore provide an alternative way for visual SLAM systems.

155 citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...The patches are computed by sparse key-point detection algorithms like SIFT (Lowe 2004), FAST (Agrawal et al....

    [...]

  • ...When a new observation comes, the visual features, such as SIFT (Lowe 2004) or SURF (Bay et al....

    [...]

  • ...The patches are computed by sparse key-point detection algorithms like SIFT (Lowe 2004), FAST (Agrawal et al. 2008) or ORB (Rublee et al. 2011) and are then filtered to spread over the whole image....

    [...]

  • ...When a new observation comes, the visual features, such as SIFT (Lowe 2004) or SURF (Bay et al. 2006), are extracted and the descriptors are approximated by the entries in the dictionary....

    [...]

Journal ArticleDOI
TL;DR: A multiscale deep feature learning method for high-resolution satellite image scene classification by warp the original satellite image into multiple different scales and developing a multiple kernel learning method to automatically learn the optimal combination of such features.
Abstract: In this paper, we propose a multiscale deep feature learning method for high-resolution satellite image scene classification. Specifically, we first warp the original satellite image into multiple different scales. The images in each scale are employed to train a deep convolutional neural network (DCNN). However, simultaneously training multiple DCNNs is time-consuming. To address this issue, we explore DCNN with spatial pyramid pooling (SPP-net). Since different SPP-nets have the same number of parameters, which share the identical initial values, and only fine-tuning the parameters in fully connected layers ensures the effectiveness of each network, thereby greatly accelerating the training process. Then, the multiscale satellite images are fed into their corresponding SPP-nets, respectively, to extract multiscale deep features. Finally, a multiple kernel learning method is developed to automatically learn the optimal combination of such features. Experiments on two difficult data sets show that the proposed method achieves favorable performance compared with other state-of-the-art methods.

154 citations


Cites background from "Distinctive Image Features from Sca..."

  • ...One is the bag of visual words (BOVWs) model [12]–[14], which generally includes three steps: 1) extracting man-made visual features, such as scale invariant feature transform (SIFT) [15] and histogram of oriented gradient [16] descriptors; 2) clustering features to form visual words (clustering centers) by using k-means or other clustering methods; and 3) mapping visual features to the closest word and generating a mid-level feature representation by word histograms....

    [...]

  • ...generally includes three steps: 1) extracting man-made visual features, such as scale invariant feature transform (SIFT) [15]...

    [...]

  • ...Sheng et al. [10] proposed to use SC to generate three mid-level representations based on SIFT, local ternary pattern histogram Fourier, and color histogram features, respectively....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes a mobile food recognition system, FoodCam, the purposes of which are estimating calorie and nutrition of foods and recording a user’s eating habits and obtained positive evaluation by a user study compared to the food recording system without object recognition.
Abstract: We propose a mobile food recognition system, FoodCam, the purposes of which are estimating calorie and nutrition of foods and recording a user's eating habits. In this paper, we propose image recognition methods which are suitable for mobile devices. The proposed method enables real-time food image recognition on a consumer smartphone. This characteristic is completely different from the existing systems which require to send images to an image recognition server. To recognize food items, a user draws bounding boxes by touching the screen first, and then the system starts food item recognition within the indicated bounding boxes. To recognize them more accurately, we segment each food item region by GrubCut, extract image features and finally classify it into one of the one hundred food categories with a linear SVM. As image features, we adopt two kinds of features: one is the combination of the standard bag-of-features and color histograms with ?2 kernel feature maps, and the other is a HOG patch descriptor and a color patch descriptor with the state-of-the-art Fisher Vector representation. In addition, the system estimates the direction of food regions where the higher SVM output score is expected to be obtained, and it shows the estimated direction in an arrow on the screen in order to ask a user to move a smartphone camera. This recognition process is performed repeatedly and continuously. We implemented this system as a standalone mobile application for Android smartphones so as to use multiple CPU cores effectively for real-time recognition. In the experiments, we have achieved the 79.2 % classification rate for the top 5 category candidates for a 100-category food dataset with the ground-truth bounding boxes when we used HOG and color patches with the Fisher Vector coding as image features. In addition, we obtained positive evaluation by a user study compared to the food recording system without object recognition.

154 citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...The characteristics of HOG is no invariant for scale and rotation, which is different from the standard local descriptors such as SIFT [Lowe(2004)] and SURF [Bay et al.(2008)]....

    [...]

Proceedings ArticleDOI
13 Oct 2012
TL;DR: The Difference of Normals (DoN) provides a computationally efficient, multi-scale approach to processing large unorganized 3D point clouds and is shown to segment large 3Dpoint clouds into scale-salient clusters towards applications in semi-automatic annotation, and as a pre-processing step in automatic object recognition.
Abstract: A novel multi-scale operator for unorganized 3D point clouds is introduced. The Difference of Normals (DoN) provides a computationally efficient, multi-scale approach to processing large unorganized 3D point clouds. The application of DoN in the multi-scale filtering of two different real-world outdoor urban LIDAR scene datasets is quantitatively and qualitatively demonstrated. In both datasets the DoN operator is shown to segment large 3D point clouds into scale-salient clusters, such as cars, people, and lamp posts towards applications in semi-automatic annotation, and as a pre-processing step in automatic object recognition. The application of the operator to segmentation is evaluated on a large public dataset of outdoor LIDAR scenes with ground truth annotations.

153 citations


Cites background or methods from "Distinctive Image Features from Sca..."

  • ...In 2D images, the concept of scale space is often described with a family of gradually smoothed images created through the convolution of a Gaussian kernel, such a Gaussian scale-space has a wide range of applications in image processing and 2D computer vision, such as in edge sharpening and interest point selection [12]....

    [...]

  • ...The DoG is an approximation to the Laplacian of the Gaussian (LoG) operator, and is widely used in applications such as image enhancement, blob detection, edge detection, finding points of salience, presegmenting images [13], and perhaps most notably in the form of a DoG pyramid for obtaining scale invariance in 2D object recognition [12]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Proceedings ArticleDOI
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

16,989 citations

Proceedings ArticleDOI
01 Jan 1988
TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

13,993 citations

Journal ArticleDOI
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

7,057 citations

Journal ArticleDOI
TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

3,422 citations

Trending Questions (1)
How can distinctive features theory be applied to elision?

The provided information does not mention anything about the application of distinctive features theory to elision.