scispace - formally typeset
Search or ask a question
Journal Articleā€¢DOIā€¢

Distinctive Image Features from Scale-Invariant Keypoints

01 Nov 2004-International Journal of Computer Vision (Kluwer Academic Publishers)-Vol. 60, Iss: 2, pp 91-110
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

Content maybe subject toĀ copyrightĀ Ā Ā  Report

Citations
More filters
Journal Articleā€¢DOIā€¢
26 Jul 2010
TL;DR: The most important perceptual aspects of stereo vision are discussed and their implications for stereoscopic content creation are formalized into a set of basic disparity mapping operators that enable us to control and retarget the depth of a stereoscopic scene in a nonlinear and locally adaptive fashion.
Abstract: This paper addresses the problem of remapping the disparity range of stereoscopic images and video. Such operations are highly important for a variety of issues arising from the production, live broadcast, and consumption of 3D content. Our work is motivated by the observation that the displayed depth and the resulting 3D viewing experience are dictated by a complex combination of perceptual, technological, and artistic constraints. We first discuss the most important perceptual aspects of stereo vision and their implications for stereoscopic content creation. We then formalize these insights into a set of basic disparity mapping operators. These operators enable us to control and retarget the depth of a stereoscopic scene in a nonlinear and locally adaptive fashion. To implement our operators, we propose a new strategy based on stereoscopic warping of the input video streams. From a sparse set of stereo correspondences, our algorithm computes disparity and image-based saliency estimates, and uses them to compute a deformation of the input views so as to meet the target disparities. Our approach represents a practical solution for actual stereo production and display that does not require camera calibration, accurate dense depth maps, occlusion handling, or inpainting. We demonstrate the performance and versatility of our method using examples from live action post-production, 3D display size adaptation, and live broadcast. An additional user study and ground truth comparison further provide evidence for the quality and practical relevance of the presented work.

418Ā citations

Proceedings Articleā€¢DOIā€¢
20 Jun 2009
TL;DR: This paper proposes to model the spatio-temporal context information in a hierarchical way, where three levels of context are exploited in ascending order of abstraction, and proposes to employ the Multiple Kernel Learning (MKL) technique to prune the kernels towards speedup in algorithm evaluation.
Abstract: The problem of recognizing actions in realistic videos is challenging yet absorbing owing to its great potentials in many practical applications. Most previous research is limited due to the use of simplified action databases under controlled environments or focus on excessively localized features without sufficiently encapsulating the spatio-temporal context. In this paper, we propose to model the spatio-temporal context information in a hierarchical way, where three levels of context are exploited in ascending order of abstraction: 1) point-level context (SIFT average descriptor), 2) intra-trajectory context (trajectory transition descriptor), and 3) inter-trajectory context (trajectory proximity descriptor). To obtain efficient and compact representations for the latter two levels, we encode the spatiotemporal context information into the transition matrix of a Markov process, and then extract its stationary distribution as the final context descriptor. Building on the multichannel nonlinear SVMs, we validate this proposed hierarchical framework on the realistic action (HOHA) and event (LSCOM) recognition databases, and achieve 27% and 66% relative performance improvements over the state-of-the-art results, respectively. We further propose to employ the Multiple Kernel Learning (MKL) technique to prune the kernels towards speedup in algorithm evaluation.

418Ā citations


Cites methods from "Distinctive Image Features from Sca..."

  • ...In this work, we model the spatio-temporal context information encoded in unconstrained videos based on the SIFTbased [10] trajectory, in a hierarchy of three abstraction levels (see Fig....

    [...]

  • ...For spatially salient point detection and representation, we adopt the well established SIFT [10] technique, the effectiveness of which has been validated in numerous visual tasks, such as object recognition [9] and robot navigation [15]....

    [...]

Proceedings Articleā€¢DOIā€¢
Lyndon Kennedy1, Mor Naaman1, Shane Ahern1, Rahul Nair1, Tye Rattenbury1Ā ā€¢
29 Sep 2007
TL;DR: A location-tag-vision-based approach to retrieving images of geography-related landmarks and features from the Flickr dataset is demonstrated, suggesting that community-contributed media and annotation can enhance and improve access to multimedia resources - and the understanding of the world.
Abstract: The advent of media-sharing sites like Flickr and YouTube has drastically increased the volume of community-contributed multimedia resources available on the web These collections have a previously unimagined depth and breadth, and have generated new opportunities - and new challenges - to multimedia research How do we analyze, understand and extract patterns from these new collections? How can we use these unstructured, unrestricted community contributions of media (and annotation) to generate "knowledge" As a test case, we study Flickr - a popular photo sharing website Flickr supports photo, time and location metadata, as well as a light-weight annotation model We extract information from this dataset using two different approaches First, we employ a location-driven approach to generate aggregate knowledge in the form of "representative tags" for arbitrary areas in the world Second, we use a tag-driven approach to automatically extract place and event semantics for Flickr tags, based on each tag's metadata patterns With the patterns we extract from tags and metadata, vision algorithms can be employed with greater precision In particular, we demonstrate a location-tag-vision-based approach to retrieving images of geography-related landmarks and features from the Flickr dataset The results suggest that community-contributed media and annotation can enhance and improve our access to multimedia resources - and our understanding of the world

417Ā citations

Proceedings Articleā€¢DOIā€¢
16 Jun 2012
TL;DR: It is shown that a recognition system using only representations obtained from deep learning can achieve comparable accuracy with a system using a combination of hand-crafted image descriptors, and empirically show that learning weights not only is necessary for obtaining good multilayer representations, but also provides robustness to the choice of the network architecture parameters.
Abstract: Most modern face recognition systems rely on a feature representation given by a hand-crafted image descriptor, such as Local Binary Patterns (LBP), and achieve improved performance by combining several such representations. In this paper, we propose deep learning as a natural source for obtaining additional, complementary representations. To learn features in high-resolution images, we make use of convolutional deep belief networks. Moreover, to take advantage of global structure in an object class, we develop local convolutional restricted Boltzmann machines, a novel convolutional learning model that exploits the global structure by not assuming stationarity of features across the image, while maintaining scalability and robustness to small misalignments. We also present a novel application of deep learning to descriptors other than pixel intensity values, such as LBP. In addition, we compare performance of networks trained using unsupervised learning against networks with random filters, and empirically show that learning weights not only is necessary for obtaining good multilayer representations, but also provides robustness to the choice of the network architecture parameters. Finally, we show that a recognition system using only representations obtained from deep learning can achieve comparable accuracy with a system using a combination of hand-crafted image descriptors. Moreover, by combining these representations, we achieve state-of-the-art results on a real-world face verification database.

417Ā citations

Journal Articleā€¢DOIā€¢
TL;DR: The authors' method can accomplish the mismatch removal from thousands of putative correspondences in only a few milliseconds, and achieves better or favorably competitive performance in accuracy while intensively cutting time cost by more than two orders of magnitude.
Abstract: Seeking reliable correspondences between two feature sets is a fundamental and important task in computer vision. This paper attempts to remove mismatches from given putative image feature correspondences. To achieve the goal, an efficient approach, termed as locality preserving matching (LPM), is designed, the principle of which is to maintain the local neighborhood structures of those potential true matches. We formulate the problem into a mathematical model, and derive a closed-form solution with linearithmic time and linear space complexities. Our method can accomplish the mismatch removal from thousands of putative correspondences in only a few milliseconds. To demonstrate the generality of our strategy for handling image matching problems, extensive experiments on various real image pairs for general feature matching, as well as for point set registration, visual homing and near-duplicate image retrieval are conducted. Compared with other state-of-the-art alternatives, our LPM achieves better or favorably competitive performance in accuracy while intensively cutting time cost by more than two orders of magnitude.

416Ā citations


Cites background or methods from "Distinctive Image Features from Sca..."

  • ...Guo and Cao (2012) proposed a triangle constraint, which can produce better putative correspondences in terms of quantity and accuracy compared with the distance ratio in Lowe (2004)....

    [...]

  • ...In this paper, the similarity function s is defined as follows: we first establish SIFT putative feature correspondences and subsequently use our LPM to remove false matches, the similarity s(Ii , I j ) is then assigned by the number of preserved matches on the two given images Ii and I j ....

    [...]

  • ...Fortunately, there are severalwell-designed feature descriptors (e.g., SIFT Lowe 2004) can efficiently establish putative correspondence between feature sets, therefore, we consider this component as an easy mission....

    [...]

  • ...In this paper, we treat the target task as a matching problem between two sets of discrete points, where each point is an image feature extracted by a feature detector and has a local image descriptor, e.g. the scale invariant feature transform (SIFT) (Lowe 2004)....

    [...]

  • ...Therefore, they are not associated with local image descriptors such as SIFT....

    [...]

References
More filters
Proceedings Articleā€¢DOIā€¢
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

16,989Ā citations


"Distinctive Image Features from Sca..." refers background or methods in this paper

  • ...The initial implementation of this approach (Lowe, 1999) simply located keypoints at the location and scale of the central sample point....

    [...]

  • ...Earlier work by the author (Lowe, 1999) extended the local feature approach to achieve scale invariance....

    [...]

  • ...More details on applications of these features to recognition are available in other pape rs (Lowe, 1999; Lowe, 2001; Se, Lowe and Little, 2002)....

    [...]

  • ...To efficiently detect stable keypoint locations in scale space, we have proposed (Lowe, 1999) using scalespace extrema in the difference-of-Gaussian function convolved with the image, D(x, y, Ļƒ ), which can be computed from the difference of two nearby scales separated by a constant multiplicativeā€¦...

    [...]

  • ...More details on applications of these features to recognition are available in other papers (Lowe, 1999, 2001; Se et al., 2002)....

    [...]

Bookā€¢
01 Jan 2000
TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.
Abstract: From the Publisher: A basic problem in computer vision is to understand the structure of a real world scene given several images of it. Recent major developments in the theory and practice of scene reconstruction are described in detail in a unified framework. The book covers the geometric principles and how to represent objects algebraically so they can be computed and applied. The authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly.

15,558Ā citations

01 Jan 2001
TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.
Abstract: Downloading the book in this website lists can give you more advantages. It will show you the best book collections and completed collections. So many books can be found in this website. So, this is not only this multiple view geometry in computer vision. However, this book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts. This is simple, read the soft file of the book and you get it.

14,282Ā citations


"Distinctive Image Features from Sca..." refers background in this paper

  • ...A more general solution would be to solve for the fundamental matrix (Luong and Faugeras, 1996; Hartley and Zisserman, 2000)....

    [...]

Proceedings Articleā€¢DOIā€¢
01 Jan 1988
TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Abstract: The problem we are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work. For example, we desire to obtain an understanding of natural scenes, containing roads, buildings, trees, bushes, etc., as typified by the two frames from a sequence illustrated in Figure 1. The solution to this problem that we are pursuing is to use a computer vision system based upon motion analysis of a monocular image sequence from a mobile camera. By extraction and tracking of image features, representations of the 3D analogues of these features can be constructed.

13,993Ā citations

Journal Articleā€¢DOIā€¢
TL;DR: The high utility of MSERs, multiple measurement regions and the robust metric is demonstrated in wide-baseline experiments on image pairs from both indoor and outdoor scenes.

3,422Ā citations

Trending Questions (1)
How can distinctive features theory be applied to elision?

The provided information does not mention anything about the application of distinctive features theory to elision.