scispace - formally typeset
Search or ask a question

Showing papers on "Scale-invariant feature transform published in 2007"


Proceedings ArticleDOI
29 Sep 2007
TL;DR: This paper uses a bag of words approach to represent videos, and presents a method to discover relationships between spatio-temporal words in order to better describe the video data.
Abstract: In this paper we introduce a 3-dimensional (3D) SIFT descriptor for video or 3D imagery such as MRI data. We also show how this new descriptor is able to better represent the 3D nature of video data in the application of action recognition. This paper will show how 3D SIFT is able to outperform previously used description methods in an elegant and efficient manner. We use a bag of words approach to represent videos, and present a method to discover relationships between spatio-temporal words in order to better describe the video data.

1,757 citations


Journal ArticleDOI
TL;DR: A fully automatic face recognition algorithm that is multimodal (2D and 3D) and performs hybrid (feature based and holistic) matching in order to achieve efficiency and robustness to facial expressions is presented.
Abstract: We present a fully automatic face recognition algorithm and demonstrate its performance on the FRGC v2.0 data. Our algorithm is multimodal (2D and 3D) and performs hybrid (feature based and holistic) matching in order to achieve efficiency and robustness to facial expressions. The pose of a 3D face along with its texture is automatically corrected using a novel approach based on a single automatically detected point and the Hotelling transform. A novel 3D spherical face representation (SFR) is used in conjunction with the scale-invariant feature transform (SIFT) descriptor to form a rejection classifier, which quickly eliminates a large number of candidate faces at an early stage for efficient recognition in case of large galleries. The remaining faces are then verified using a novel region-based matching approach, which is robust to facial expressions. This approach automatically segments the eyes- forehead and the nose regions, which are relatively less sensitive to expressions and matches them separately using a modified iterative closest point (ICP) algorithm. The results of all the matching engines are fused at the metric level to achieve higher accuracy. We use the FRGC benchmark to compare our results to other algorithms that used the same database. Our multimodal hybrid algorithm performed better than others by achieving 99.74 percent and 98.31 percent verification rates at a 0.001 false acceptance rate (FAR) and identification rates of 99.02 percent and 95.37 percent for probes with a neutral and a nonneutral expression, respectively.

495 citations



Proceedings ArticleDOI
Simon Winder1, Matthew Brown1
17 Jun 2007
TL;DR: The best descriptors were those with log polar histogramming regions and feature vectors constructed from rectified outputs of steerable quadrature filters, which gave one third of the incorrect matches produced by SIFT.
Abstract: In this paper we study interest point descriptors for image matching and 3D reconstruction. We examine the building blocks of descriptor algorithms and evaluate numerous combinations of components. Various published descriptors such as SIFT, GLOH, and Spin images can be cast into our framework. For each candidate algorithm we learn good choices for parameters using a training set consisting of patches from a multi-image 3D reconstruction where accurate ground-truth matches are known. The best descriptors were those with log polar histogramming regions and feature vectors constructed from rectified outputs of steerable quadrature filters. At a 95% detection rate these gave one third of the incorrect matches produced by SIFT.

433 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: An affine invariant shape descriptor for maximally stable extremal regions (MSER) is introduced that uses only the shape of the detected MSER itself and can achieve the best performance under a range of imaging conditions by matching both the texture and shape descriptors.
Abstract: This paper introduces an affine invariant shape descriptor for maximally stable extremal regions (MSER). Affine invariant feature descriptors are normally computed by sampling the original grey-scale image in an invariant frame defined from each detected feature, but we instead use only the shape of the detected MSER itself. This has the advantage that features can be reliably matched regardless of the appearance of the surroundings of the actual region. The descriptor is computed using the scale invariant feature transform (SIFT), with the resampled MSER binary mask as input. We also show that the original MSER detector can be modified to achieve better scale invariance by detecting MSERs in a scale pyramid. We make extensive comparisons of the proposed feature against a SIFT descriptor computed on grey-scale patches, and also explore the possibility of grouping the shape descriptors into pairs to incorporate more context. While the descriptor does not perform as well on planar scenes, we demonstrate various categories of full 3D scenes where it outperforms the SIFT descriptor computed on grey-scale patches. The shape descriptor is also shown to be more robust to changes in illumination. We show that a system can achieve the best performance under a range of imaging conditions by matching both the texture and shape descriptors.

245 citations


Proceedings ArticleDOI
10 Apr 2007
TL;DR: The use of a recently developed feature, SURF, is proposed to improve the performance of appearance-based localization methods that perform image retrieval in large data sets, showing the use of SURF as the best compromise between efficiency and accuracy in the results.
Abstract: Many robotic applications work with visual reference maps, which usually consist of sets of more or less organized images. In these applications, there is a compromise between the density of reference data stored and the capacity to identify later the robot localization, when it is not exactly in the same position as one of the reference views. Here we propose the use of a recently developed feature, SURF, to improve the performance of appearance-based localization methods that perform image retrieval in large data sets. This feature is integrated with a vision-based algorithm that allows both topological and metric localization using omnidirectional images in a hierarchical approach. It uses pyramidal kernels for the topological localization and three-view geometric constraints for the metric one. Experiments with several omnidirectional images sets are shown, including comparisons with other typically used features (radial lines and SIFT). The advantages of this approach are proved, showing the use of SURF as the best compromise between efficiency and accuracy in the results.

243 citations


Proceedings ArticleDOI
09 Jul 2007
TL;DR: Two novel schemes for near duplicate image and video-shot detection based on global hierarchical colour histograms, using Locality Sensitive Hashing for fast retrieval and local feature descriptors, are proposed and compared.
Abstract: This paper proposes and compares two novel schemes for near duplicate image and video-shot detection. The first approach is based on global hierarchical colour histograms, using Locality Sensitive Hashing for fast retrieval. The second approach uses local feature descriptors (SIFT) and for retrieval exploits techniques used in the information retrieval community to compute approximate set intersections between documents using a min-Hash algorithm.The requirements for near-duplicate images vary according to the application, and we address two types of near duplicate definition: (i) being perceptually identical (e.g. up to noise, discretization effects, small photometric distortions etc); and (ii) being images of the same 3D scene (so allowing for viewpoint changes and partial occlusion). We define two shots to be near-duplicates if they share a large percentage of near-duplicate frames.We focus primarily on scalability to very large image and video databases, where fast query processing is necessary. Both methods are designed so that only a small amount of data need be stored for each image. In the case of near-duplicate shot detection it is shown that a weak approximation to histogram matching, consuming substantially less storage, is sufficient for good results. We demonstrate our methods on the TRECVID 2006 data set which contains approximately 165 hours of video (about 17.8M frames with 146K key frames), and also on feature films and pop videos.

237 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: The experimental results demonstrate the robustness of SIFT features to expression, accessory and pose variations and a simple non-statistical matching strategy combined with local and global similarity on key-points clusters to solve face recognition problems.
Abstract: Scale invariant feature transform (SIFT) proposed by Lowe has been widely and successfully applied to object detection and recognition. However, the representation ability of SIFT features in face recognition has rarely been investigated systematically. In this paper, we proposed to use the person-specific SIFT features and a simple non-statistical matching strategy combined with local and global similarity on key-points clusters to solve face recognition problems. Large scale experiments on FERET and CAS-PEAL face databases using only one training sample per person have been carried out to compare it with other non person-specific features such as Gabor wavelet feature and local binary pattern feature. The experimental results demonstrate the robustness of SIFT features to expression, accessory and pose variations.

225 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper formulates descriptor design as a non- parametric dimensionality reduction problem, and adopts a discriminative approach that can exceed the performance of the current state of the art techniques such as SIFT with far fewer dimensions, and with virtually no parameters to be tuned by hand.
Abstract: Invariant feature descriptors such as SIFT and GLOH have been demonstrated to be very robust for image matching and visual recognition. However, such descriptors are generally parameterised in very high dimensional spaces e.g. 128 dimensions in the case of SIFT. This limits the performance of feature matching techniques in terms of speed and scalability. Furthermore, these descriptors have traditionally been carefully hand crafted by manually tuning many parameters. In this paper, we tackle both of these problems by formulating descriptor design as a non- parametric dimensionality reduction problem. In contrast to previous approaches that use only the global statistics of the inputs, we adopt a discriminative approach. Starting from a large training set of labelled match/non-match pairs, we pursue lower dimensional embeddings that are optimised for their discriminative power. Extensive comparative experiments demonstrate that we can exceed the performance of the current state of the art techniques such as SIFT with far fewer dimensions, and with virtually no parameters to be tuned by hand.

211 citations


Proceedings Article
01 Jan 2007
TL;DR: This paper addresses the issues of outdoor appearance-based topological localization for a mobile robot over time and shows that two variants of SURF, called U-SURF and SURF-128, outperform the other algorithms in terms of accuracy and speed.
Abstract: Local feature matching has become a commonly used method to compare images. For mobile robots, a reliable method for comparing images can constitute a key component for localization and loop closing tasks. In this paper, we address the issues of outdoor appearance-based topological localization for a mobile robot over time. Our data sets, each consisting of a large number of panoramic images, have been acquired over a period of nine months with large seasonal changes (snowcovered ground, bare trees, autumn leaves, dense foliage, etc.). Two different types of image feature algorithms, SIFT and the more recent SURF, have been used to compare the images. We show that two variants of SURF, called U-SURF and SURF-128, outperform the other algorithms in terms of accuracy and speed.

175 citations


Proceedings ArticleDOI
12 Apr 2007
TL;DR: This method extends the concepts used in the computer vision SIFT technique for extracting and matching distinctive scale invariant features in 2D scalar images to scalar image of arbitrary dimensionality by using hyperspherical coordinates for gradients and multidimensional histograms to create the feature vectors.
Abstract: We present a fully automated multimodal medical image matching technique Our method extends the concepts used in the computer vision SIFT technique for extracting and matching distinctive scale invariant features in 2D scalar images to scalar images of arbitrary dimensionality This extension involves using hyperspherical coordinates for gradients and multidimensional histograms to create the feature vectors These features were successfully applied to determine accurate feature point correspondence between pairs of medical images (3D) and dynamic volumetric data (3D+time)

Journal ArticleDOI
TL;DR: In this paper, the authors compare and evaluate how well different implementations of SIFT and SURF perform in terms of invariancy and runtime efficiency for object detection and object recognition.

Proceedings ArticleDOI
03 Dec 2007
TL;DR: A 3D interest point detector that is based on SURF and a 3D descriptor that extends SIFT are proposed that are applied to the problem of detecting repeated structure in range images, and promising results are reported.
Abstract: This paper presents a method for describing and recognising local structure in 3D images. The method extends proven techniques for 2D object recognition in images. In particular, we propose a 3D interest point detector that is based on SURF, and a 3D descriptor that extends SIFT. The method is applied to the problem of detecting repeated structure in range images, and promising results are reported.

Proceedings ArticleDOI
26 Dec 2007
TL;DR: It is shown experimentally that the transformation allows a significant dimensionality reduction and improves matching performance of a state-of-the art SIFT descriptor and consistent improvement in precision-recall and speed of fast matching in tree structures at the expense of little overhead for projecting the descriptors into transformed space.
Abstract: In this paper we propose to transform an image descriptor so that nearest neighbor (NN) search for correspondences becomes the optimal matching strategy under the assumption that inter-image deviations of corresponding descriptors have Gaussian distribution. The Euclidean NN in the transformed domain corresponds to the NN according to a truncated Mahalanobis metric in the original descriptor space. We provide theoretical justification for the proposed approach and show experimentally that the transformation allows a significant dimensionality reduction and improves matching performance of a state-of-the art SIFT descriptor. We observe consistent improvement in precision-recall and speed of fast matching in tree structures at the expense of little overhead for projecting the descriptors into transformed space. In the context of SIFT vs. transformed M- SIFT comparison, tree search structures are evaluated according to different criteria and query types. All search tree experiments confirm that transformed M-SIFTperforms better than the original SIFT.

Proceedings ArticleDOI
28 May 2007
TL;DR: This paper reports on the implementation of a GPU-based, real-time eye blink detector on very low contrast images acquired under near-infrared illumination that is part of a multi-sensor data acquisition and analysis system for driver performance assessment and training.
Abstract: This paper reports on the implementation of a GPU-based, real-time eye blink detector on very low contrast images acquired under near-infrared illumination. This detector is part of a multi-sensor data acquisition and analysis system for driver performance assessment and training. Eye blinks are detected inside regions of interest that are aligned with the subject's eyes at initialization. Alignment is maintained through time by tracking SIFT feature points that are used to estimate the affine transformation between the initial face pose and the pose in subsequent frames. The GPU implementation of the SIFT feature point extraction algorithm ensures real-time processing. An eye blink detection rate of 97% is obtained on a video dataset of 33,000 frames showing 237 blinks from 22 subjects.

Book ChapterDOI
01 Jan 2007
TL;DR: A hand posture recognition system using the discrete Adaboost learning algorithm with Lowe’s scale invariant feature transform (SIFT) features is proposed to tackle the degraded performance due to background noise in training images and the in-plane rotation variant detection.
Abstract: Hand posture understanding is essential to human robot interaction. The existing hand detection approaches using a Viola-Jones detector have two fundamental issues, the degraded performance due to background noise in training images and the in-plane rotation variant detection. In this paper, a hand posture recognition system using the discrete Adaboost learning algorithm with Lowe’s scale invariant feature transform (SIFT) features is proposed to tackle these issues simultaneously. In addition, we apply a sharing feature concept to increase the accuracy of multi-class hand posture recognition. The experimental results demonstrate that the proposed approach successfully recognizes three hand posture classes and can deal with the background noise issues. Our detector is in-plane rotation invariant, and achieves satisfactory multi-view hand detection.

Proceedings ArticleDOI
01 Jan 2007

Proceedings ArticleDOI
12 Nov 2007
TL;DR: A novel ellipse detection algorithm which retains the original advantages of the Hough Transform while minimizing the storage and computation complexity and uses an accumulator that is only one dimensional.
Abstract: The main advantage of using the Hough Transform to detect ellipses is its robustness against missing data points. However, the storage and computational requirements of the Hough Transform preclude practical applications. Although there are many modifications to the Hough Transform, these modifications still demand significant storage requirement. In this paper, we present a novel ellipse detection algorithm which retains the original advantages of the Hough Transform while minimizing the storage and computation complexity. More specifically, we use an accumulator that is only one dimensional. As such, our algorithm is more effective in terms of storage requirement. In addition, our algorithm can be easily parallelized to achieve good execution time. Experimental results on both synthetic and real images demonstrate the robustness and effectiveness of our algorithm in which both complete and incomplete ellipses can be extracted.

Proceedings Article
30 Mar 2007
TL;DR: This work shows that for this application domain, the SIFT interest points can be dramatically pruned to effect large reductions in both memory requirements and query run-time, with almost negligible loss in effectiveness.
Abstract: The detection of image versions from large image collections is a formidable task as two images are rarely identical. Geometric variations such as cropping, rotation, and slight photometric alteration are unsuitable for content-based retrieval techniques, whereas digital watermarking techniques have limited application for practical retrieval. Recently, the application of Scale Invariant Feature Transform (SIFT) interest points to this domain have shown high effectiveness, but scalability remains a problem due to the large number of features generated for each image. In this work, we show that for this application domain, the SIFT interest points can be dramatically pruned to effect large reductions in both memory requirements and query run-time, with almost negligible loss in effectiveness. We demonstrate that, unlike using the original SIFT features, the pruned features scales better for collections containing hundreds of thousands of images.

Journal ArticleDOI
TL;DR: A hierarchical approach for building recognition using a method for selecting discriminative SIFT features and a simple probabilistic model for integration of the evidence from individual matches based on the match quality is proposed.

Proceedings ArticleDOI
25 Jun 2007
TL;DR: An approach for classifying images of charts based on the shape and spatial relationships of their primitives and two novel features to represent the structural information based on region segmentation and curve saliency are introduced.
Abstract: We present an approach for classifying images of charts based on the shape and spatial relationships of their primitives. Five categories are considered: bar-charts, curve-plots, pie-charts, scatter-plots and surface-plots. We introduce two novel features to represent the structural information based on (a) region segmentation and (b) curve saliency. The local shape is characterized using the Histograms of Oriented Gradients (HOG) and the Scale Invariant Feature Transform (SIFT) descriptors. Each image is represented by sets of feature vectors of each modality. The similarity between two images is measured by the overlap in the distribution of the features -measured using the Pyramid Match algorithm. A test image is classified based on its similarity with training images from the categories. The approach is tested with a database of images collected from the Internet.

Proceedings ArticleDOI
28 May 2007
TL;DR: A performance evaluation framework for visual feature extraction and matching in the visual simultaneous localization and mapping (SLAM) context is presented and shows that all methods can be made to perform well, although it is possible to distinguish between the three.
Abstract: We present a performance evaluation framework for visual feature extraction and matching in the visual simultaneous localization and mapping (SLAM) context. Although feature extraction is a crucial component, no qualitative study comparing different techniques from the visual SLAM perspective exists. We extend previous image pair evaluation methods to handle non-planar scenes and the multiple image sequence requirements of our application, and compare three popular feature extractors used in visual SLAM: the Harris corner detector, the Kanade-Lucas-Tomasi tracker (KLT), and the scale-invariant feature transform (SIFT). We present results from a typical indoor environment in the form of recall/precision curves, and also investigate the effect of increasing distance between image viewpoints on extractor performance. Our results show that all methods can be made to perform well, although it is possible to distinguish between the three. We conclude by presenting guidelines for selecting a feature extractor for visual SLAM based on our experiments.

Patent
01 Aug 2007
TL;DR: In this paper, the authors proposed a method for object recognition based on nearest neighbor search of local descriptors such as SIFT, which is based on the observation that the level of accuracy of nearest neighbour search for correct recognition depends on images to be recognized.
Abstract: For object recognition based on nearest neighbor search of local descriptors such as SIFT, it is important to keep the nearest neighbor search efficient to deal with a huge number of descriptors. The present invention provides methods of efficient recognition. In one embodiment, the method is based on the observation that the level of accuracy of nearest neighbor search for correct recognition depends on images to be recognized. The method is characterized by the mechanism that multiple recognizers with approximate nearest neighbor search are cascaded in the order of the level of approximation so as to improve the efficiency by adaptively controlling the level to be applied depending on images. In another embodiment the method is characterized by excluding local descriptors with low discriminability when a plenty of local descriptors are present in the vicinity and a plenty of distance calculation are required.

Proceedings ArticleDOI
12 Dec 2007
TL;DR: A new model-based approach, capitalizing on explicit structure and with the advantages of being robust in noise and occlusion handling is proposed, which achieves an encouraging recognition rate, on an image database selected from the XM2VTS database.
Abstract: Ears are a new biometric with major advantage in that they appear to maintain their structure with increasing age. Most current approaches are holistic and describe the ear by its general properties. We propose a new model-based approach, capitalizing on explicit structure and with the advantages of being robust in noise and occlusion. Our model is a constellation of generalized ear parts, which is learned off-line using an unsupervised learning algorithm over an enrolled training set of 63 ear images. The Scale Invariant Feature Transform (SIFT), is used to detect the features within the ear images. In recognition, given a profile image of the human head, the ear is enrolled and recognised from the parts selected via the model. We achieve an encouraging recognition rate, on an image database selected from the XM2VTS database. A head-to-head comparison with PCA is also presented to show the advantage derived by the use of the model in successful occlusion handling.

Proceedings ArticleDOI
29 Sep 2007
TL;DR: A novel tracking method to handle the problem of large motion by using Scale Invariant Feature Transform (SIFT) based registration algorithm that shows an accurate pose recovery when the head has large motion, even with movement along the Z axis.
Abstract: Although there exists dozens of vision based 3D head tracking methods, none of them considers the problem of large motion, especially the movement along the Z axis. In this paper we propose a novel tracking method to handle this problem by using Scale Invariant Feature Transform (SIFT) based registration algorithm. Salient SIFT features are first detected and tracked between two images, and then the 3D points corresponding to these features are obtained from a stereo camera. With these 3D points, a registration algorithm in a RANSAC framework is employed to detect the outliers and estimate the head pose. Performance evaluation shows an accurate pose recovery (3° RMS) when the head has large motion, even with movement along the Z axis was about 150 cm.

Proceedings ArticleDOI
26 Dec 2007
TL;DR: This work extends the successful 2D robust feature concept into the third dimension in that it produces a descriptor for a reconstructed 3D surface region that is perspectively invariant if the region can locally be approximated well by a plane.
Abstract: We extend the successful 2D robust feature concept into the third dimension in that we produce a descriptor for a reconstructed 3D surface region. The descriptor is perspectively invariant if the region can locally be approximated well by a plane. We exploit depth and texture information, which is nowadays available in real-time from video of moving cameras, from stereo systems or PMD cameras (photonic mixer devices). By computing a normal view onto the surface we still keep the descriptiveness of similarity invariant features like SIFT while achieving in- variance against perspective distortions, while descriptiveness typically suffers when using affine invariant features. Our approach can be exploited for structure-from-motion, for stereo or PMD cameras, alignment of large scale reconstructions or improved video registration.

Proceedings ArticleDOI
01 Oct 2007
TL;DR: The results presented are promising in order to be used as reference generator for the control system, in which a series of matched key-points pairs that fulfill the transformation equations are selected, rejecting otherwise the corrupted data.
Abstract: This paper explores the possibilities to use robust object tracking algorithms based on visual model features as generator of visual references for UAV control. A scale invariant feature transform (SIFT) algorithm is used for detecting the salient points at every processed image, then a projective transformation for evaluating the visual references is obtained using a version of the RANSAC algorithm, in which a series of matched key-points pairs that fulfill the transformation equations are selected, rejecting otherwise the corrupted data. The system has been tested using diverse image sequences showing its capability to track objects significantly changed in scale, position, rotation, generating at the same time velocity references to the UAV flight controller. The robustness our approach has also been validated using images taken from real flights showing noise and lighting distortions. The results presented are promising in order to be used as reference generator for the control system.

Proceedings ArticleDOI
22 Oct 2007
TL;DR: This paper suggests multiple vehicles detection by quad-tree segmentation and tracking method using scale invariant feature transform to improve the performance of tracking for extracting traffic parameter such as vehicle count, speed, class, and so on.
Abstract: To monitor road situation, the source from CCTV is more useful than any other data from GPS or loop detector because it can give the whole picture of the two-dimensional traffic situation. This paper suggests multiple vehicles detection by quad-tree segmentation and tracking method using scale invariant feature transform to improve the performance of tracking for extracting traffic parameter such as vehicle count, speed, class, and so on. The experimental result presents the proposed method is effective and robust on detection and tracking vehicle, especially in cases that a vehicle changes a lane, occlusion of vehicles is occurred, and an affine shape of vehicle is changed due to car movement.

Journal ArticleDOI
TL;DR: A new simultaneous localization and mapping (SLAM) algorithm for building dense three‐dimensional maps using information acquired from a range imager and a conventional camera, for robotic search and rescue in unstructured indoor environments.
Abstract: The main contribution of this paper is a new simultaneous localization and mapping (SLAM) algorithm for building dense three-dimensional maps using information acquired from a range imager and a conventional camera, for robotic search and rescue in unstructured indoor environments. A key challenge in this scenario is that the robot moves in 6D and no odometry information is available. An extended information filter (EIF) is used to estimate the state vector containing the sequence of camera poses and some selected 3D point features in the environment. Data association is performed using a combination of scale invariant feature transformation (SIFT) feature detection and matching, random sampling consensus (RANSAC), and least square 3D point sets fitting. Experimental results are provided to demonstrate the effectiveness of the techniques developed. © 2007 Wiley Periodicals, Inc.

Book ChapterDOI
18 Nov 2007
TL;DR: The method of Canonical Correlation Analysis is combined with the discriminant functions and Scale-Invariant-Feature-Transform for the discriminative spatiotemporal features for robust gesture recognition.
Abstract: This paper addresses gesture recognition under small sample size, where direct use of traditional classifiers is difficult due to high dimensionality of input space.We propose a pairwise feature extraction method of video volumes for classification. The method of Canonical Correlation Analysis is combined with the discriminant functions and Scale-Invariant-Feature-Transform (SIFT) for the discriminative spatiotemporal features for robust gesture recognition. The proposed method is practically favorable as it works well with a small amount of training samples, involves few parameters, and is computationally efficient. In the experiments using 900 videos of 9 hand gesture classes, the proposed method notably outperformed the classifiers such as Support Vector Machine/Relevance Vector Machine, achieving 85% accuracy.