scispace - formally typeset
Search or ask a question

Showing papers on "Scale-invariant feature transform published in 2010"


Journal ArticleDOI
TL;DR: An EM-based algorithm to compute dense depth and occlusion maps from wide-baseline image pairs using a local image descriptor, DAISY, which is very efficient to compute densely and robust against many photometric and geometric transformations.
Abstract: In this paper, we introduce a local image descriptor, DAISY, which is very efficient to compute densely. We also present an EM-based algorithm to compute dense depth and occlusion maps from wide-baseline image pairs using this descriptor. This yields much better results in wide-baseline situations than the pixel and correlation-based algorithms that are commonly used in narrow-baseline stereo. Also, using a descriptor makes our algorithm robust against many photometric and geometric transformations. Our descriptor is inspired from earlier ones such as SIFT and GLOH but can be computed much faster for our purposes. Unlike SURF, which can also be computed efficiently at every pixel, it does not introduce artifacts that degrade the matching performance when used densely. It is important to note that our approach is the first algorithm that attempts to estimate dense depth maps from wide-baseline image pairs, and we show that it is a good one at that with many experiments for depth estimation accuracy, occlusion detection, and comparing it against other descriptors on laser-scanned ground truth scenes. We also tested our approach on a variety of indoor and outdoor scenes with different photometric and geometric transformations and our experiments support our claim to being robust against these.

1,484 citations


Book ChapterDOI
05 Sep 2010
TL;DR: This work devise an adaptive, prioritized algorithm for matching a representative set of SIFT features covering a large scene to a query image for efficient localization, based on considering features in the scene database, and matching them to query image features, as opposed to more conventional methods that match image features to visual words or database features.
Abstract: We present a fast, simple location recognition and image localization method that leverages feature correspondence and geometry estimated from large Internet photo collections. Such recovered structure contains a significant amount of useful information about images and image features that is not available when considering images in isolation. For instance, we can predict which views will be the most common, which feature points in a scene are most reliable, and which features in the scene tend to co-occur in the same image. Based on this information, we devise an adaptive, prioritized algorithm for matching a representative set of SIFT features covering a large scene to a query image for efficient localization. Our approach is based on considering features in the scene database, and matching them to query image features, as opposed to more conventional methods that match image features to visual words or database features. We find this approach results in improved performance, due to the richer knowledge of characteristics of the database features compared to query image features. We present experiments on two large city-scale photo collections, showing that our algorithm compares favorably to image retrieval-style approaches to location recognition.

523 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: This work proposes a pose-adaptive matching method that uses pose-specific classifiers to deal with different pose combinations of the matching face pair, and finds that a simple normalization mechanism after PCA can further improve the discriminative ability of the descriptor.
Abstract: We present a novel approach to address the representation issue and the matching issue in face recognition (verification). Firstly, our approach encodes the micro-structures of the face by a new learning-based encoding method. Unlike many previous manually designed encoding methods (e.g., LBP or SIFT), we use unsupervised learning techniques to learn an encoder from the training examples, which can automatically achieve very good tradeoff between discriminative power and invariance. Then we apply PCA to get a compact face descriptor. We find that a simple normalization mechanism after PCA can further improve the discriminative ability of the descriptor. The resulting face representation, learning-based (LE) descriptor, is compact, highly discriminative, and easy-to-extract. To handle the large pose variation in real-life scenarios, we propose a pose-adaptive matching method that uses pose-specific classifiers to deal with different pose combinations (e.g., frontal v.s. frontal, frontal v.s. left) of the matching face pair. Our approach is comparable with the state-of-the-art methods on the Labeled Face in Wild (LFW) benchmark (we achieved 84.45% recognition rate), while maintaining excellent compactness, simplicity, and generalization ability across different datasets.

470 citations


Book ChapterDOI
05 Sep 2010
TL;DR: A localization method in which the SIFT descriptors of the detected SIFT interest points in the reference images are indexed using a tree in order to localize a query image and a novel GPS-tag-based pruning method removes the less reliable descriptors.
Abstract: Finding an image's exact GPS location is a challenging computer vision problem that has many real-world applications. In this paper, we address the problem of finding the GPS location of images with an accuracy which is comparable to hand-held GPS devices. We leverage a structured data set of about 100,000 images build from Google Maps Street View as the reference images. We propose a localization method in which the SIFT descriptors of the detected SIFT interest points in the reference images are indexed using a tree. In order to localize a query image, the tree is queried using the detected SIFT descriptors in the query image. A novel GPS-tag-based pruning method removes the less reliable descriptors. Then, a smoothing step with an associated voting scheme is utilized; this allows each query descriptor to vote for the location its nearest neighbor belongs to, in order to accurately localize the query image. A parameter called Confidence of Localization which is based on the Kurtosis of the distribution of votes is defined to determine how reliable the localization of a particular image is. In addition, we propose a novel approach to localize groups of images accurately in a hierarchical manner. First, each image is localized individually; then, the rest of the images in the group are matched against images in the neighboring area of the found first match. The final location is determined based on the Confidence of Localization parameter. The proposed image group localization method can deal with very unclear queries which are not capable of being geolocated individually.

401 citations


Journal ArticleDOI
TL;DR: The proposed method starts by estimating the transform between matched scale invariant feature transform (SIFT) keypoints, which are insensitive to geometrical and illumination distortions, and then finds all pixels within the duplicated regions after discounting the estimated transforms.
Abstract: Region duplication is a simple and effective operation to create digital image forgeries, where a continuous portion of pixels in an image, after possible geometrical and illumination adjustments, are copied and pasted to a different location in the same image. Most existing region duplication detection methods are based on directly matching blocks of image pixels or transform coefficients, and are not effective when the duplicated regions have geometrical or illumination distortions. In this work, we describe a new region duplication detection method that is robust to distortions of the duplicated regions. Our method starts by estimating the transform between matched scale invariant feature transform (SIFT) keypoints, which are insensitive to geometrical and illumination distortions, and then finds all pixels within the duplicated regions after discounting the estimated transforms. The proposed method shows effective detection on an automatically synthesized forgery image database with duplicated and distorted regions. We further demonstrate its practical performance with several challenging forgery images created with state-of-the-art tools.

392 citations


Proceedings Article
06 Dec 2010
TL;DR: This work highlights the kernel view of orientation histograms, and shows that they are equivalent to a certain type of match kernels over image patches, and designs a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes into compact patch-level features.
Abstract: The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, etc.) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA) [23]. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.

369 citations


Proceedings ArticleDOI
25 Oct 2010
TL;DR: This paper proposes a novel scheme, spatial coding, to encode the spatial relationships among local features in an image, and achieves a 53% improvement in mean average precision and 46% reduction in time cost over the baseline bag-of-words approach.
Abstract: The state-of-the-art image retrieval approaches represent images with a high dimensional vector of visual words by quantizing local features, such as SIFT, in the descriptor space. The geometric clues among visual words in an image is usually ignored or exploited for full geometric verification, which is computationally expensive. In this paper, we focus on partial-duplicate web image retrieval, and propose a novel scheme, spatial coding, to encode the spatial relationships among local features in an image. Our spatial coding is both efficient and effective to discover false matches of local features between images, and can greatly improve retrieval performance. Experiments in partial-duplicate web image search, using a database of one million images, reveal that our approach achieves a 53% improvement in mean average precision and 46% reduction in time cost over the baseline bag-of-words approach.

248 citations


Journal ArticleDOI
TL;DR: It is shown that the enhanced matching approach proposed in this paper boosts the recognition accuracy compared with the standard SIFT-based feature-matching method.
Abstract: In this paper, a new algorithm for vehicle logo recognition on the basis of an enhanced scale-invariant feature transform (SIFT)-based feature-matching scheme is proposed. This algorithm is assessed on a set of 1200 logo images that belong to ten distinctive vehicle manufacturers. A series of experiments are conducted, splitting the 1200 images to a training set and a testing set, respectively. It is shown that the enhanced matching approach proposed in this paper boosts the recognition accuracy compared with the standard SIFT-based feature-matching method. The reported results indicate a high recognition rate in vehicle logos and a fast processing time, making it suitable for real-time applications.

201 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: This work develops a bottom-up motion-based approach to robustly segment out foreground objects in egocentric video and shows that it greatly improves object recognition accuracy.
Abstract: Identifying handled objects, i.e. objects being manipulated by a user, is essential for recognizing the person's activities. An egocentric camera as worn on the body enjoys many advantages such as having a natural first-person view and not needing to instrument the environment. It is also a challenging setting, where background clutter is known to be a major source of problems and is difficult to handle with the camera constantly and arbitrarily moving. In this work we develop a bottom-up motion-based approach to robustly segment out foreground objects in egocentric video and show that it greatly improves object recognition accuracy. Our key insight is that egocentric video of object manipulation is a special domain and many domain-specific cues can readily help. We compute dense optical flow and fit it into multiple affine layers. We then use a max-margin classifier to combine motion with empirical knowledge of object location and background movement as well as temporal cues of support region and color appearance. We evaluate our segmentation algorithm on the large Intel Egocentric Object Recognition dataset with 42 objects and 100K frames. We show that, when combined with temporal integration, figure-ground segmentation improves the accuracy of a SIFT-based recognition system from 33% to 60%, and that of a latent-HOG system from 64% to 86%.

191 citations


Book ChapterDOI
05 Sep 2010
TL;DR: This paper learns a non-linear transformation model by minimizing a novel margin-based cost function, which aims to separate matching descriptors from two classes of non-matching descriptors, and demonstrates impressive gains in performance on a ground truth dataset.
Abstract: Many visual search and matching systems represent images using sparse sets of "visual words": descriptors that have been quantized by assignment to the best-matching symbol in a discrete vocabulary. Errors in this quantization procedure propagate throughout the rest of the system, either harming performance or requiring correction using additional storage or processing. This paper aims to reduce these quantization errors at source, by learning a projection from descriptor space to a new Euclidean space in which standard clustering techniques are more likely to assign matching descriptors to the same cluster, and nonmatching descriptors to different clusters. To achieve this, we learn a non-linear transformation model by minimizing a novel margin-based cost function, which aims to separate matching descriptors from two classes of non-matching descriptors. Training data is generated automatically by leveraging geometric consistency. Scalable, stochastic gradient methods are used for the optimization. For the case of particular object retrieval, we demonstrate impressive gains in performance on a ground truth dataset: our learnt 32-D descriptor without spatial re-ranking outperforms a baseline method using 128-D SIFT descriptors with spatial re-ranking.

186 citations


Journal ArticleDOI
TL;DR: This paper addresses the problem of outdoor, appearance-based topological localization, particularly over long periods of time where seasonal changes alter the appearance of the environment, with a straightforward method that relies on local image features to compare single-image pairs.

Journal ArticleDOI
TL;DR: This paper reviews techniques to accelerate concept classification, where the trade-off between computational efficiency and accuracy is shown and the results lead to a 7-fold speed increase without accuracy loss, and a 70- fold speed increase with 3% accuracy loss.
Abstract: As datasets grow increasingly large in content-based image and video retrieval, computational efficiency of concept classification is important. This paper reviews techniques to accelerate concept classification, where we show the trade-off between computational efficiency and accuracy. As a basis, we use the Bag-of-Words algorithm that in the 2008 benchmarks of TRECVID and PASCAL lead to the best performance scores. We divide the evaluation in three steps: 1) Descriptor Extraction, where we evaluate SIFT, SURF, DAISY, and Semantic Textons. 2) Visual Word Assignment, where we compare a k-means visual vocabulary with a Random Forest and evaluate subsampling, dimension reduction with PCA, and division strategies of the Spatial Pyramid. 3) Classification, where we evaluate the χ2, RBF, and Fast Histogram Intersection kernel for the SVM. Apart from the evaluation, we accelerate the calculation of densely sampled SIFT and SURF, accelerate nearest neighbor assignment, and improve accuracy of the Histogram Intersection kernel. We conclude by discussing whether further acceleration of the Bag-of-Words pipeline is possible. Our results lead to a 7-fold speed increase without accuracy loss, and a 70-fold speed increase with 3% accuracy loss. The latter system does classification in real-time, which opens up new applications for automatic concept classification. For example, this system permits five standard desktop PCs to automatically tag for 20 classes all images that are currently uploaded to Flickr.

Journal ArticleDOI
TL;DR: The experimental results, involving more than 15 million region pairs, indicate the proposed ZM phase descriptor has, generally speaking, the best performance under the common photometric and geometric transformations.
Abstract: A local image descriptor robust to the common photometric transformations (blur, illumination, noise, and JPEG compression) and geometric transformations (rotation, scaling, translation, and viewpoint) is crucial to many image understanding and computer vision applications. In this paper, the representation and matching power of region descriptors are to be evaluated. A common set of elliptical interest regions is used to evaluate the performance. The elliptical regions are further normalized to be circular with a fixed size. The normalized circular regions will become affine invariant up to a rotational ambiguity. Here, a new distinctive image descriptor to represent the normalized region is proposed, which primarily comprises the Zernike moment (ZM) phase information. An accurate and robust estimation of the rotation angle between a pair of normalized regions is then described and used to measure the similarity between two matching regions. The discriminative power of the new ZM phase descriptor is compared with five major existing region descriptors (SIFT, GLOH, PCA-SIFT, complex moments, and steerable filters) based on the precision-recall criterion. The experimental results, involving more than 15 million region pairs, indicate the proposed ZM phase descriptor has, generally speaking, the best performance under the common photometric and geometric transformations. Both quantitative and qualitative analyses on the descriptor performances are given to account for the performance discrepancy. First, the key factor for its striking performance is due to the fact that the ZM phase has accurate estimation accuracy of the rotation angle between two matching regions. Second, the feature dimensionality and feature orthogonality also affect the descriptor performance. Third, the ZM phase is more robust under the nonuniform image intensity fluctuation. Finally, a time complexity analysis is provided.

Proceedings ArticleDOI
23 Aug 2010
TL;DR: An original approach is proposed that computes SIFT descriptors on a set of facial landmarks of depth images, and then selects the subset of most relevant features, which achieves an average recognition rate of 77.5% on the BU-3DFE database.
Abstract: In this paper, the problem of person-independent facial expression recognition is addressed on 3D shapes. To this end, an original approach is proposed that computes SIFT descriptors on a set of facial landmarks of depth images, and then selects the subset of most relevant features. Using SVM classification of the selected features, an average recognition rate of 77.5% on the BU-3DFE database has been obtained. Comparative evaluation on a common experimental setup, shows that our solution is able to obtain state of the art results.

Journal ArticleDOI
TL;DR: A novel recognition framework for human actions using hybrid features extracted using motion-selectivity attribute of 3D dual-tree complex wavelet transform and affine SIFT local image detector which offers enhanced capabilities to preserve structure and correlation amongst neighborhood pixels of a video frame.

Journal ArticleDOI
TL;DR: In this paper, the impact of image filtering and skipping features detected at the highest scales on the performance of SIFT operator for SAR image registration is analyzed based on multisensor, multitemporal and different viewpoint SAR images.
Abstract: The SIFT operator's success for computer vision applications makes it an attractive alternative to the intricate feature based SAR image registration problem. The SIFT operator processing chain is capable of detecting and matching scale and affine invariant features. For SAR images, the operator is expected to detect stable features at lower scales where speckle influence diminishes. To adapt the operator performance to SAR images we analyse the impact of image filtering and of skipping features detected at the highest scales. We present our analysis based on multisensor, multitemporal and different viewpoint SAR images. The operator shows potential to become a robust alternative for point feature based registration of SAR images as subpixel registration consistency was achieved for most of the tested datasets. Our findings indicate that operator performance in terms of repeatability and matching capability is affected by an increase in acquisition differences within the imagery. We also show that the proposed adaptations result in a significant speed-up compared to the original SIFT operator.

Journal ArticleDOI
TL;DR: A new model of attention guidance for efficient and scalable first-stage search and recognition with many objects is described, on par or better than SIFT and HMAX, while being, respectively, 1500 and 279 times faster.

Proceedings ArticleDOI
11 Nov 2010
TL;DR: A SIFT algorithm adapted for 3D surfaces (called meshSIFT) and its applications to 3D face pose normalisation and recognition that outperform most other algorithms found in literature.
Abstract: This paper presents a SIFT algorithm adapted for 3D surfaces (called meshSIFT) and its applications to 3D face pose normalisation and recognition. The algorithm allows reliable detection of scale space extrema as local feature locations. The scale space contains the mean curvature in each vertex on different smoothed versions of the input mesh. The meshSIFT algorithm then describes the neighbourhood of every scale space extremum in a feature vector consisting of concatenated histograms of shape indices and slant angles. The feature vectors are reliably matched by comparing the angle in feature space. Using RANSAC, the best rigid transformation can be estimated based on the matched features leading to 84% correct pose normalisation of 3D faces from the Bosphorus database. Matches are mostly found between two face surfaces of the same person, allowing the algorithm to be used for 3D face recognition. Simply counting the number of matches allows 93.7% correct identification for face surfaces in the Bosphorus database and 97.7% when only frontal images are considered. In the verification scenario, we obtain an equal error rate of 15.0% to 5.1% (depending on the investigated face surfaces). These results outperform most other algorithms found in literature.

Proceedings ArticleDOI
TL;DR: This paper presents a local feature-based method for matching facial sketch images to face photographs, which is the first known feature- based method for performing such matching.
Abstract: This paper presents a local feature-based method for matching facial sketch images to face photographs, which is the first known feature-based method for performing such matching. Starting with a training set of sketch to photo correspondences (i.e. a set of sketch and photo images of the same subjects), we demonstrate the ability to match sketches to photos: (1) directly using SIFT feature descriptors, (2) in a "common representation" that measures the similarity between a sketch and photo by their distance from the training set of sketch/photo pairs, and (3) by fusing the previous two methods. For both matching methods, the first step is to sample SIFT feature descriptors uniformly across all the sketch and photo images. In direct matching, we simply measure the distance of the SIFT descriptors between sketches and photos. In common representation matching, the distance between the descriptor vectors of the probe sketches and gallery photos at each local sample point is measured. This results in a vector of distances across the sketch or photo image to each member of the training basis. Further recognition improvements are shown by score level fusion of the two sketch matchers. Compared with published sketch to photo matching algorithms, experimental results demonstrate improved matching performances using the presented feature-based methods.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: A novel face representation in which a face is represented in terms of dense Scale Invariant Feature Transform (d-SIFT) and shape contexts of the face image and AdaBoost is adopted to select features and form a strong classifier to solve the problem of gender recognition.
Abstract: In this paper, we propose a novel face representation in which a face is represented in terms of dense Scale Invariant Feature Transform (d-SIFT) and shape contexts of the face image. The application of the representation in gender recognition has been investigated. There are four problems when applying the SIFT to facial gender recognition. (1) There may be only a few keypoints that can be found in a face image due to the missing texture and poorly illuminated faces; (2) The SIFT descriptors at the keypoints (we called it sparse SIFT) are distinctive whereas alternative descriptors at non-keypoints (e.g. grid) could cause negative impact on the accuracy; (3) Relatively larger image size is required to obtain sufficient keypoints support the matching and (4) The matching assumes that the faces are properly registered. This paper addresses these difficulties using a combination of SIFT descriptors and shape contexts of face images. Instead of extracting descriptors around interest points only, local feature descriptors are extracted at regular image grid points that allow for a dense description of the face images. In addition, the global shape contexts of the face images are fused with the dense SIFT to improve the accuracy. AdaBoost is adopted to select features and form a strong classifier. The proposed approach is then applied to solve the problem of gender recognition. The experimental results on a large set of faces showed that the proposed method can achieve high accuracies even for faces that are not aligned.

Book ChapterDOI
05 Sep 2010
TL;DR: 3D building information is exploited by exploiting the nowadays often available 3D building data and massive street-view like image data for database creation to solve the problem of large scale place-of-interest recognition in cell phone images of urban scenarios.
Abstract: We address the problem of large scale place-of-interest recognition in cell phone images of urban scenarios. Here, we go beyond what has been shown in earlier approaches by exploiting the nowadays often available 3D building information (e.g. from extruded floor plans) and massive street-view like image data for database creation. Exploiting vanishing points in query images and thus fully removing 3D rotation from the recognition problem allows then to simplify the feature invariance to a pure homothetic problem, which we show leaves more discriminative power in feature descriptors than classical SIFT. We rerank visual word based document queries using a fast stratified homothetic verification that is tailored for repetitive patterns like window grids on facades and in most cases boosts the correct document to top positions if it was in the short list. Since we exploit 3D building information, the approach finally outputs the camera pose in real world coordinates ready for augmenting the cell phone image with virtual 3D information. The whole system is demonstrated to outperform traditional approaches on city scale experiments for different sources of street-view like image data and a challenging set of cell phone images.

Proceedings ArticleDOI
07 Jul 2010
TL;DR: A panorama image stitching system which combines an image matching algorithm; modified SURF and an image blending algorithm; multi-band blending and it can make the stitching seam invisible and get a perfect panorama for large image data and it is faster than previous method.
Abstract: SURF (Speeded Up Robust Features) is one of the famous feature-detection algorithms. This paper proposes a panorama image stitching system which combines an image matching algorithm; modified SURF and an image blending algorithm; multi-band blending. The process is divided in the following steps: first, get feature descriptor of the image using modified SURF; secondly, find matching pairs, check the neighbors by K-NN (K-nearest neighbor), and remove the mismatch couples by RANSAC(Random Sample Consensus); then, adjust the images by bundle adjustment and estimate the accurate homography matrix; lastly, blend images by multi-band blending. Also, comparison of SIFT (Scale Invariant Feature Transform) and modified SURF are also shown as a base of selection of image matching algorithm. According to the experiments, the present system can make the stitching seam invisible and get a perfect panorama for large image data and it is faster than previous method.

Journal ArticleDOI
TL;DR: The proposed approach outperforms existing works such as scale invariant feature transform (SIFT), or the speeded-up robust features (SURF), and is robust to some changes in illumination, viewpoint, color distribution, image quality, and object deformation.

Proceedings ArticleDOI
04 Nov 2010
TL;DR: An improved Hough transform for line detection that employs the “many-to-one” mapping and sliding window neighborhood technique to alleviate the computational and storage load is proposed.
Abstract: The Hough transform is a popular robust method for detecting lines in an image. However, the computational complexity and storage requirements are the main bottlenecks of the standard Hough transform (SHT) applied on real-time detection. Therefore, many variations on Hough's original transform have been proposed to alleviate the computational and storage burden. In this paper, an improved Hough transform for line detection is proposed, which shares the similar characteristic of the modified Hough transform (MHT) and the Windowed random Hough transform (RHT). The proposed method employs the “many-to-one” mapping and sliding window neighborhood technique to alleviate the computational and storage load. Extensive experiments indicate that the proposed method has achieved a much better performance than the previous variations of Hough transform.

Journal ArticleDOI
TL;DR: This article has analysed the feature detection, identification and matching steps of the original SIFT processing chain and proposes to assist the standard SIFT matching scheme to utilise the SIFT operator capability for effective results in challenging SAR image matching scenarios.
Abstract: With the increasing availability and rapidly improving the spatial resolution of synthetic aperture radar (SAR) images from the latest and future satellites like TerraSAR-X and TanDEM-X, their applicability in remote sensing applications is set to be paramount. Considering challenges in the field of point feature-based multisensor/multimodal SAR image matching/registration and advancements in the field of computer vision, we extend the applicability of the scale invariant feature transform (SIFT) operator for SAR images. In this article, we have analysed the feature detection, identification and matching steps of the original SIFT processing chain. We implement steps to counter the speckle influence, which deteriorates the SIFT operator performance for SAR images. In feature identification, we evaluate different local gradient estimating techniques and highlight the fact that giving up the SIFT's rotation invariance characteristic increases the potential number of matches when the multiple SAR images from...

Book ChapterDOI
21 Jun 2010
TL;DR: A novel face recognition technique that computes the SIFT descriptors at predefined (fixed) locations learned during the training stage is presented, which renders the approach more robust to illumination changes than related approaches from the literature.
Abstract: The Scale Invariant Feature Transform (SIFT) is an algorithm used to detect and describe scale-, translation- and rotation-invariant local features in images The original SIFT algorithm has been successfully applied in general object detection and recognition tasks, panorama stitching and others One of its more recent uses also includes face recognition, where it was shown to deliver encouraging results SIFT-based face recognition techniques found in the literature rely heavily on the so-called keypoint detector, which locates interest points in the given image that are ultimately used to compute the SIFT descriptors While these descriptors are known to be among others (partially) invariant to illumination changes, the keypoint detector is not Since varying illumination is one of the main issues affecting the performance of face recognition systems, the keypoint detector represents the main source of errors in face recognition systems relying on SIFT features To overcome the presented shortcoming of SIFT-based methods, we present in this paper a novel face recognition technique that computes the SIFT descriptors at predefined (fixed) locations learned during the training stage By doing so, it eliminates the need for keypoint detection on the test images and renders our approach more robust to illumination changes than related approaches from the literature Experiments, performed on the Extended Yale B face database, show that the proposed technique compares favorably with several popular techniques from the literature in terms of performance

Proceedings ArticleDOI
05 Jul 2010
TL;DR: The combination of Dollar's detection method and the improved LBP-TOP descriptor is shown to be computationally efficient and to reach the best recognition accuracy on the KTH database.
Abstract: In this paper, we evaluate and compare different feature detection and feature description methods for part-based approaches in human action recognition. Different methods have been proposed in the literature for both feature detection of space-time interest points and description of local video patches. It is however unclear which method performs better in the field of human action recognition. We compare, in the feature detection section, Dollar's method [18], Laptev's method [22], a bank of 3D-Gabor filters [6] and a method based on Space-Time Differences of Gaussians. We also compare and evaluate different descriptors such as Gradient [18], HOG-HOF [22], 3D SIFT [24] and an enhanced version of LBP-TOP [15]. We show the combination of Dollar's detection method and the improved LBP-TOP descriptor to be computationally efficient and to reach the best recognition accuracy on the KTH database.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: A novel framework for recognising realistic human actions in unconstrained environments based on computing a rich set of descriptors from key point trajectories is presented and an adaptive feature fusion method to combine different local motion descriptors for improving model robustness against feature noise and background clutters is developed.
Abstract: Problem This paper addresses the problem of recognising realistic human actions captured in unconstrained environments (Fig. 1). Existing approaches for action recognition have been focused on improving visual feature representation using either spatio-temporal interest points or key-points trajectories. However, these methods are insufficient to handle the situations when action videos are recorded in unconstrained environments because: (1) Reliable visual features are hard to be extracted due to occlusions, illumination change, scale variation and background clutters. (2) Effectiveness of visual features are strongly dependent on the unpredictable characteristics of camera movements. (3) Complicated visual actions result in unequal discriminativeness of visual features. Our Solutions In this paper, we present a novel framework for recognising realistic human actions in unconstrained environments. The novelties of our work lie in three aspects: First, we propose a new action representation based on computing a rich set of descriptors from key point trajectories. Second, in order to cope with drastic changes in motion characteristics with and without camera movements, we develop an adaptive feature fusion method to combine different local motion descriptors for improving model robustness against feature noise and background clutters. Finally, we propose a novel Multi-Class Delta Latent Dirichlet Allocation (MC-∆LDA) model for feature selection. The most informative features in a high dimensional feature space are selected collaboratively rather than independently. Motion Descriptors We first compute trajectories of key-points using KLT tracker and SIFT matching. After trajectory pruning by identifying the Region of Interest (ROI), we compute three types of motion descriptors from the survived trajectories. First, Orientation-Magnitude Descriptor is extracted by quantising orientation and magnitude of motion between two consecutive points in the same trajectory. Second, Trajectory Shape Descriptor is extracted by computing Fourier coefficients of a single trajectory. Finally, Appearance Descriptor is extracted by computing the SIFT features at all points of a trajectory. Interest Point Features We also detect spatio-temporal interest points as they contain complementary information to trajectory features. At an interest point, a surrounding 3D cuboid is extracted. We use gradient vectors to describe these cuboids and PCA to reduce descriptor’s dimensionality. Adaptive Feature Fusion We wish to fuse adaptively trajectory based descriptors with 3D interest point based descriptors according the presence of camera movement. The presence of moving camera is detected by computing the global optical flow over all frames in a clip. If the majority of the frames contain global motion, we regard the clip as being recorded by a moving camera. For clips without camera movement, both interest point and trajectory based descriptors can be computed reliably and thus both types of descriptors are used for recognition. In contrast, when camera motion can be detected, interest point based descriptors are less meaningful so only trajectory descriptors are employed. Collaborative Feature Selection We propose a MC-∆LDA model (Fig. 2) for collaboratively selecting dominant features for classification. We consider each video clip x j is a mixture of Nt topics Φ = {φt}t t=1 (to be discovered), each of which φt is a multinomial distribution over Nw words (visual features). The MC-∆LDA model aims to constrain topic proportion non-uniformly and on a per-clip basis. For each video clip belonging to action category Ac, we model it as a mixture of: (1) Ns t topics which are shared by all Nc category of actions, and (2) Nt,c topics which are uniquely associated with action category Ac. In MC-∆LDA, the nonuniform proportion of topic mixture for a single clip x j is enforced by its action class label c j and the hyperparameter αc for the corresponding action class c. Given the total number of topics Nt = Ns t +∑ Nc c=1 Nt,c, the structure of the MC-∆LDA model, and the observable variables (clips x j and action labels c j), we can learn the Ns t shared topics as well as all ∑c c=1 Nt,c unique topics for all Nc classes of actions. We use the N s t topics shared by all actions for selecting discriminative features. The Ns t shared topics are represented as an Nw×N t dimension matrix Φs. The feature selection can be summarised into two steps: (1) For each feature vk, k = Figure 1: Actions captured in an unconstrained environments, YouTube dataset. From left to right: cycling, diving, soccer juggling, and walking with a dog.

Journal ArticleDOI
TL;DR: This paper proposes an efficient indexing scheme for searching large iris biometric database that achieves invariance to similarity transformations, illumination and occlusion and shows a substantial improvement over exhaustive search technique in terms of time and accuracy.

Proceedings ArticleDOI
Rob Hess1
25 Oct 2010
TL;DR: An open-source SIFT library is presented, implemented in C and freely available at http://eecs.oregonstate.edu/~hess/sift, and its performance is compared with that of the original SIFT executable released by David Lowe.
Abstract: Recent years have seen an explosion in the use of invariant keypoint methods across nearly every area of computer vision research. Since its introduction, the scale-invariant feature transform (SIFT) has been one of the most effective and widely-used of these methods and has served as a major catalyst in their popularization. In this paper, I present an open-source SIFT library, implemented in C and freely available at http://eecs.oregonstate.edu/~hess/sift.html, and I briefly compare its performance with that of the original SIFT executable released by David Lowe.