scispace - formally typeset
Search or ask a question

Showing papers on "3D single-object recognition published in 2008"


Book ChapterDOI
20 Oct 2008
TL;DR: It is shown how both an object class specific representation and a discriminative recognition model can be learned using the AdaBoost algorithm, which allows many different kinds of simple features to be combined into a single similarity function.
Abstract: Viewpoint invariant pedestrian recognition is an important yet under-addressed problem in computer vision. This is likely due to the difficulty in matching two objects with unknown viewpoint and pose. This paper presents a method of performing viewpoint invariant pedestrian recognition using an efficiently and intelligently designed object representation, the ensemble of localized features (ELF). Instead of designing a specific feature by hand to solve the problem, we define a feature space using our intuition about the problem and let a machine learning algorithm find the best representation. We show how both an object class specific representation and a discriminative recognition model can be learned using the AdaBoost algorithm. This approach allows many different kinds of simple features to be combined into a single similarity function. The method is evaluated using a viewpoint invariant pedestrian recognition dataset and the results are shown to be superior to all previous benchmarks for both recognition and reacquisition of pedestrians.

1,554 citations


Journal ArticleDOI
TL;DR: The explosion of studies on crowding is reviewed—in grating discrimination, letter and face recognition, visual search, selective attention, and reading—and a universal principle, the Bouma law is found, which is equal for all objects, although the effect is weaker between dissimilar objects.
Abstract: It is now emerging that vision is usually limited by object spacing rather than size. The visual system recognizes an object by detecting and then combining its features. ‘Crowding’ occurs when objects are too close together and features from several objects are combined into a jumbled percept. Here, we review the explosion of studies on crowding—in grating discrimination, letter and face recognition, visual search, selective attention, and reading—and find a universal principle, the Bouma law. The critical spacing required to prevent crowding is equal for all objects, although the effect is weaker between dissimilar objects. Furthermore, critical spacing at the cortex is independent of object position, and critical spacing at the visual field is proportional to object distance from fixation. The region where object spacing exceeds critical spacing is the ‘uncrowded window’. Observers cannot recognize objects outside of this window and its size limits the speed of reading and search. Object recognition means calling a chair a chair, despite variations in style, viewpoint, rendering and surrounding clutter. Crowding is a breakdown of object recognition. Let us begin by sketching a popular two-step model of object recognition: feature detection and combination. Features are components of images that are detected independently 1–4 . They are typically simple and nonoverlapping. The first step in object recognition is feature detection 4 . Each neuron in the primary visual cortex responds when a feature matches its receptive field. Only the features that drive neurons hard enough are detected 5 . In the second step, the brain combines some of the detected features to recognize the object. This combining step (including ‘integration’, ‘binding’, ‘segmentation’, ‘pooling’, ‘grouping’, ‘contour integration’ and ‘selective attention’) is still mysterious 3,4,6–11 .

602 citations


Proceedings ArticleDOI
23 Jun 2008
TL;DR: This work proposes a new procedure for recognition of low-resolution faces, when there is a high-resolution training set available, and shows that recognition of faces of as low as 6 times 6 pixel size is considerably improved compared to matching using a super-resolution reconstruction followed by classification, and to matching with a low- resolution training set.
Abstract: Face recognition degrades when faces are of very low resolution since many details about the difference between one person and another can only be captured in images of sufficient resolution. In this work, we propose a new procedure for recognition of low-resolution faces, when there is a high-resolution training set available. Most previous super-resolution approaches are aimed at reconstruction, with recognition only as an after-thought. In contrast, in the proposed method, face features, as they would be extracted for a face recognition algorithm (e.g., eigenfaces, Fisher-faces, etc.), are included in a super-resolution method as prior information. This approach simultaneously provides measures of fit of the super-resolution result, from both reconstruction and recognition perspectives. This is different from the conventional paradigms of matching in a low-resolution domain, or, alternatively, applying a super-resolution algorithm to a low-resolution face and then classifying the super-resolution result. We show, for example, that recognition of faces of as low as 6 times 6 pixel size is considerably improved compared to matching using a super-resolution reconstruction followed by classification, and to matching with a low-resolution training set.

272 citations


Proceedings ArticleDOI
01 Sep 2008
TL;DR: An expression-invariant method for face recognition by fitting an identity/expression separated 3D Morphable Model to shape data that greatly improves recognition and retrieval rates in the uncooperative setting, while achieving recognition rates on par with the best recognition algorithms in the face recognition great vendor test.
Abstract: We describe an expression-invariant method for face recognition by fitting an identity/expression separated 3D Morphable Model to shape data. The expression model greatly improves recognition and retrieval rates in the uncooperative setting, while achieving recognition rates on par with the best recognition algorithms in the face recognition great vendor test. The fitting is performed with a robust nonrigid ICP algorithm. It is able to perform face recognition in a fully automated scenario and on noisy data. The system was evaluated on two datasets, one with a high noise level and strong expressions, and the standard UND range scan database, showing that while expression invariance increases recognition and retrieval performance for the expression dataset, it does not decrease performance on the neutral dataset. The high recognition rates are achieved even with a purely shape based method, without taking image data into account.

221 citations


Book ChapterDOI
Amr Ahmed1, Kai Yu, Wei Xu, Yihong Gong, Eric P. Xing1 
12 Oct 2008
TL;DR: This paper presents a framework for training hierarchical feed-forward models for visual recognition, using transfer learning from pseudo tasks, and shows that these pseudo tasks induce an informative inverse-Wishart prior on the functional behavior of the network, offering an effective way to incorporate useful prior knowledge into the network training.
Abstract: Building visual recognition models that adapt across different domains is a challenging task for computer vision. While feature-learning machines in the form of hierarchial feed-forward models (e.g., convolutional neural networks) showed promise in this direction, they are still difficult to train especially when few training examples are available. In this paper, we present a framework for training hierarchical feed-forward models for visual recognition, using transfer learning from pseudo tasks. These pseudo tasks are automatically constructed from data without supervision and comprise a set of simple pattern-matching operations. We show that these pseudo tasks induce an informative inverse-Wishart prior on the functional behavior of the network, offering an effective way to incorporate useful prior knowledge into the network training. In addition to being extremely simple to implement, and adaptable across different domains with little or no extra tuning, our approach achieves promising results on challenging visual recognition tasks, including object recognition, gender recognition, and ethnicity recognition.

163 citations


Book ChapterDOI
12 Oct 2008
TL;DR: By integrating the image partition hypotheses in an intuitive combined top-down and bottom-up recognition approach, this work improves object and feature support and explores possible extensions of the method and whether they provide improved performance.
Abstract: The joint tasks of object recognition and object segmentation from a single image are complex in their requirement of not only correct classification, but also deciding exactly which pixels belong to the object Exploring all possible pixel subsets is prohibitively expensive, leading to recent approaches which use unsupervised image segmentation to reduce the size of the configuration space Image segmentation, however, is known to be unstable, strongly affected by small image perturbations, feature choices, or different segmentation algorithms This instability has led to advocacy for using multiple segmentations of an image In this paper, we explore the question of how to best integrate the information from multiple bottom-up segmentations of an image to improve object recognition robustness By integrating the image partition hypotheses in an intuitive combined top-down and bottom-up recognition approach, we improve object and feature support We further explore possible extensions of our method and whether they provide improved performance Results are presented on the MSRC 21-class data set and the Pascal VOC2007 object segmentation challenge

143 citations


Proceedings ArticleDOI
01 Sep 2008
TL;DR: This work builds on the method of to create a prototype access control system, capable of handling variations in illumination and expression, as well as significant occlusion or disguise, and gaining a better understanding strengths and limitations of sparse representation as a tool for robust recognition.
Abstract: This work builds on the method of to create a prototype access control system, capable of handling variations in illumination and expression, as well as significant occlusion or disguise. Our demonstration will allow participants to interact with the algorithm, gaining a better understanding strengths and limitations of sparse representation as a tool for robust recognition.

140 citations


Patent
20 Feb 2008
TL;DR: In this paper, a view-based approach is presented that does not show the drawbacks of previous methods because it is robust to image noise, object occlusions, clutter, and contrast changes.
Abstract: The present invention provides a system and method for recognizing a 3D object in a single camera image and for determining the 3D pose of the object with respect to the camera coordinate system. In one typical application, the 3D pose is used to make a robot pick up the object. A view-based approach is presented that does not show the drawbacks of previous methods because it is robust to image noise, object occlusions, clutter, and contrast changes. Furthermore, the 3D pose is determined with a high accuracy. Finally, the presented method allows the recognition of the 3D object as well as the determination of its 3D pose in a very short computation time, making it also suitable for real-time applications. These improvements are achieved by the methods disclosed herein.

117 citations


Patent
03 Dec 2008
TL;DR: In this article, an image of an object is extracted from a particular frame in the video file, and a subsequent image is also extracted from another frame in a subsequent frame by calculating a similarity value between the extracted images from the particular frame and subsequent frame.
Abstract: The present disclosure relates to systems and methods for modeling, recognizing, and tracking object images in video files. In one embodiment, a video file, which includes a plurality of frames, is received. An image of an object is extracted from a particular frame in the video file, and a subsequent image is also extracted from a subsequent frame. A similarity value is then calculated between the extracted images from the particular frame and subsequent frame. If the calculated similarity value exceeds a predetermined similarity threshold, the extracted object images are assigned to an object group. The object group is used to generate an object model associated with images in the group, wherein the model is comprised of image features extracted from optimal object images in the object group. Optimal images from the group are also used for comparison to other object models for purposes of identifying images.

100 citations


Patent
11 Feb 2008
TL;DR: In this article, a multimodal approach to object de-duplication is devised that analyzes an object to be stored and chooses a deplication technique that is likely to be effective for storing the object.
Abstract: Various object de-duplication techniques may be applied to object systems (such as to files in a file store) to identify similar or identical objects or portions thereof, so that duplicate objects or object portions may be associated with one copy, and the duplicate copies may be removed. However, an object de-duplication technique that is suitable for de-duplicating one type of object may be inefficient for de-duplicating another type of object; e.g., a de-duplication method that significantly condenses sets of small objects may achieve very little condensation among sets of large objects, and vice versa. A multimodal approach to object de-duplication may be devised that analyzes an object to be stored and chooses a de-duplication technique that is likely to be effective for storing the object. The object index may be configured to support several de-duplication schemes for indexing and storing many types of objects in a space-economizing manner.

99 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: An approach to human action recognition via local feature tracking and robust estimation of background motion through a robust feature extraction algorithm based on KLT tracker and SIFT as well as a method for estimating dominant planes in the scene.
Abstract: This paper discusses an approach to human action recognition via local feature tracking and robust estimation of background motion. The main contribution is a robust feature extraction algorithm based on KLT tracker and SIFT as well as a method for estimating dominant planes in the scene. Multiple interest point detectors are used to provide large number of features for every frame. The motion vectors for the features are estimated using optical flow and SIFT based matching. The features are combined with image segmentation to estimate dominant homographies, and then separated into static and moving ones regardless the camera motion. The action recognition approach can handle camera motion, zoom, human appearance variations, background clutter and occlusion. The motion compensation shows very good accuracy on a number of test sequences. The recognition system is extensively compared to state-of-the art action recognition methods and the results are improved.

Proceedings ArticleDOI
23 Jun 2008
TL;DR: The role of context for dense scene labeling in small images with low resolution images with impoverished appearance information is explored and the algorithm achieves state-of-the-art performance on MSRC and Corel datasets.
Abstract: Traditionally, object recognition is performed based solely on the appearance of the object. However, relevant information also exists in the scene surrounding the object. As supported by our human studies, this contextual information is necessary for accurate recognition in low resolution images. This scenario with impoverished appearance information, as opposed to using images of higher resolution, provides an appropriate venue for studying the role of context in recognition. In this paper, we explore the role of context for dense scene labeling in small images. Given a segmentation of an image, our algorithm assigns each segment to an object category based on the segmentpsilas appearance and contextual information. We explicitly model context between object categories through the use of relative location and relative scale, in addition to co-occurrence. We perform recognition tests on low and high resolution images, which vary significantly in the amount of appearance information present, using just the object appearance information, the combination of appearance and context, as well as just context without object appearance information (blind recognition). We also perform these tests in human studies and analyze our findings to reveal interesting patterns. With the use of our context model, our algorithm achieves state-of-the-art performance on MSRC and Corel. datasets.

Proceedings ArticleDOI
22 Sep 2008
TL;DR: An improvement of the original SIFT algorithm providing more reliable feature matching for the purpose of object recognition is proposed, and the main idea is to divide the features extracted from both the test and the model object image into several sub-collections before they are matched.
Abstract: The SIFT algorithm (Scale Invariant Feature Transform) proposed by Lowe [1] is an approach for extracting distinctive invariant features from images. It has been successfully applied to a variety of computer vision problems based on feature matching including object recognition, pose estimation, image retrieval and many others. However, in real-world applications there is still a need for improvement of the algorithm's robustness with respect to the correct matching of SIFT features. In this paper, an improvement of the original SIFT algorithm providing more reliable feature matching for the purpose of object recognition is proposed. The main idea is to divide the features extracted from both the test and the model object image into several sub-collections before they are matched. The features are divided into several sub-collections considering the features arising from different octaves, that is from different frequency domains. To evaluate the performance of the proposed approach, it was applied to real images acquired with the stereo camera system of the rehabilitation robotic system FRIEND II. The experimental results show an increase in the number of correct features matched and, at the same time, a decrease in the number of outliers in comparison with the original SIFT algorithm. Compared with the original SIFT algorithm, a 40% reduction in processing time was achieved for the matching of the stereo images.

Proceedings Article
08 Dec 2008
TL;DR: This work presents a discriminative part-based approach for human action recognition from video sequences using motion features based on the recently proposed hidden conditional random field (hCRF) for object recognition.
Abstract: We present a discriminative part-based approach for human action recognition from video sequences using motion features. Our model is based on the recently proposed hidden conditional random field (hCRF) for object recognition. Similar to hCRF for object recognition, we model a human action by a flexible constellation of parts conditioned on image observations. Different from object recognition, our model combines both large-scale global features and local patch features to distinguish various actions. Our experimental results show that our model is comparable to other state-of-the-art approaches in action recognition. In particular, our experimental results demonstrate that combining large-scale global features and local patch features performs significantly better than directly applying hCRF on local patches alone.

Proceedings ArticleDOI
19 May 2008
TL;DR: Experimental results demonstrate that the system described in this paper is a highly competent object recognition system that is capable of locating numerous challenging objects amongst distractors.
Abstract: This paper studies the sequential object recognition problem faced by a mobile robot searching for specific objects within a cluttered environment. In contrast to current state-of-the-art object recognition solutions which are evaluated on databases of static images, the system described in this paper employs an active strategy based on identifying potential objects using an attention mechanism and planning to obtain images of these objects from numerous viewpoints. We demonstrate the use of a bag-of-features technique for ranking potential objects, and show that this measure outperforms geometric matching for invariance across viewpoints. Our system implements informed visual search by prioritising map locations and re-examining promising locations first. Experimental results demonstrate that our system is a highly competent object recognition system that is capable of locating numerous challenging objects amongst distractors.

Book ChapterDOI
12 Oct 2008
TL;DR: This work proposes a novel representation to model 3D object classes that allows the model to synthesize novel views of an object class at recognition time and incorporates it in a novel two-step algorithm that is able to classify objects under arbitrary and/or unseen poses.
Abstract: An important task in object recognition is to enable algorithms to categorize objects under arbitrary poses in a cluttered 3D world. A recent paper by Savarese & Fei-Fei [1] has proposed a novel representation to model 3D object classes. In this representation stable parts of objects from one class are linked together to capture both the appearance and shape properties of the object class. We propose to extend this framework and improve the ability of the model to recognize poses that have not been seen in training. Inspired by works in single object view synthesis (e.g., Seitz & Dyer [2]), our new representation allows the model to synthesize novel views of an object class at recognition time. This mechanism is incorporated in a novel two-step algorithm that is able to classify objects under arbitrary and/or unseen poses. We compare our results on pose categorization with the model and dataset presented in [1]. In a second experiment, we collect a new, more challenging dataset of 8 object classes from crawling the web. In both experiments, our model shows competitive performances compared to [1] for classifying objects in unseen poses.

Book ChapterDOI
03 Sep 2008
TL;DR: A model that is able to extract object identity, position, and rotation angles, where each code is independent of all others is proposed, which demonstrates the model behavior on complex three-dimensional objects under translation and in-depth rotation on homogeneous backgrounds.
Abstract: Primates are very good at recognizing objects independently of viewing angle or retinal position and outperform existing computer vision systems by far But invariant object recognition is only one prerequisite for successful interaction with the environment An animal also needs to assess an object's position and relative rotational angle We propose here a model that is able to extract object identity, position, and rotation angles, where each code is independent of all others We demonstrate the model behavior on complex three-dimensional objects under translation and in-depth rotation on homogeneous backgrounds A similar model has previously been shown to extract hippocampal spatial codes from quasi-natural videos The rigorous mathematical analysis of this earlier application carries over to the scenario of invariant object recognition

Proceedings ArticleDOI
01 Jan 2008
TL;DR: This paper presents an approach for human action recognition by finding the discriminative key frames from a video sequence and representing them with the distribution of local motion features and their spatiotemporal arrangements.
Abstract: This paper presents an approach for human action recognition by finding the discriminative key frames from a video sequence and representing them with the distribution of local motion features and their spatiotemporal arrangements. In this approach, the key frames of the video sequence are selected by their discriminative power and represented by the local motion features detected in them and integrated from their temporal neighbors. In the key frame’s representation, the spatial arrangements of the motion features are captured in a hierarchical spatial pyramid structure. By using frame by frame voting for the recognition, experiments have demonstrated improved performances over most of the other known methods on the popular benchmark data sets. Recognizinghumanactionfromimagesequencesis an appealingyet challengingproblem in computer vision with many applications including motion capture, human-computer interaction, environment control, and security surveillance. In this paper, we focus on recognizing the activities of a person in an image sequence from local motion features and their spatiotemporal arrangements. Our approach is motivated by the recent success of “bag-of-words” model for general object recognition in computer vision[21, 14]. This representation, which is adapted from the text retrieval literature, models the object by the distribution of words from a fixed visual code book, which is usually obtained by vector quantization of local image visual features. However, this method discards the spatial and the temporal relations among these visual features, which could be helpful in object recognition. Addressing this problem, our approach uses a hierarchical representation for the key frames of a given video sequence to integrate information from both the spatial and the temporal domains. We first apply a spatiotemporal feature detector to the video sequence and obtain the local motion features. Then we generate a visual word code book by quantization of the local motion features and assign a word label to each of them. Next we select key frames of the video sequence by their discriminative power. Then, for each key frame, we integrate the visual words from its nearby frames, divide the key frame spatially into finer subdivisions and compute in each cell the histograms of the visual words detected in this key frame and its temporal neighbors. Finally, we concatenate the histograms from all cells and use

Patent
14 Mar 2008
TL;DR: In this article, a feature information collecting apparatus includes: a vehicle position information acquiring device that acquires the vehicle's position information that represents the current position of the vehicle; an image information acquisition device that collects image information for the vicinity of a vehicle; a recognition result storing device that stores the recognition results for the recognition target object obtained by the image recognition device, in association with information of the recognition position of a target object based on the vehicle position.
Abstract: A feature information collecting apparatus includes: a vehicle position information acquiring device that acquires the vehicle position information that represents the current position of the vehicle; an image information acquiring device that acquires image information for the vicinity of the vehicle; an image recognition device that carries out image recognition processing for the recognition target object included in the image information; a recognition result storing device that stores the recognition information that represents the recognition results for the recognition target object obtained by the image recognition device, in association with information of the recognition position of the recognition target object based on the vehicle position information; and a learned feature extraction device that extracts recognition target objects that can be repeatedly recognized by image recognition as learned features based on a plurality of sets of recognition information that is related to the same place stored in a recognition result storing device due to the image information for the same place being recognized a plurality of times by image recognition, and outputs this along with the position information for the learned features.

Book ChapterDOI
26 Jun 2008
TL;DR: An innovative, mobile museum guide system is presented, which enables camera phones to recognize paintings in art galleries and the k-means based clustering approach was found to significantly improve the computational time.
Abstract: This article explores the feasibility of a market-ready, mobile pattern recognition system based on the latest findings in the field of object recognition and currently available hardware and network technology. More precisely, an innovative, mobile museum guide system is presented, which enables camera phones to recognize paintings in art galleries. After careful examination, the algorithms Scale-Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF) were found most promising for this goal. Consequently, both have been integrated in a fully implemented prototype system and their performance has been thoroughly evaluated under realistic conditions. In order to speed up the matching process for finding the corresponding sample in the feature database, an approximation to Nearest Neighbor Search was investigated. The k-means based clustering approach was found to significantly improve the computational time.

Journal ArticleDOI
TL;DR: This paper focuses on making use of the robot's manipulation abilities to learn complete object representations suitable for 3D object recognition, and shows that the acquired data is of sufficient quality to train a classifier that can recognize 3D objects independently of the viewpoint.
Abstract: The exploration and learning of new objects is an essential capability of a cognitive robot. In this paper we focus on making use of the robot's manipulation abilities to learn complete object representations suitable for 3D object recognition. Taking control of the object allows the robot to focus on relevant parts of the images, thus bypassing potential pitfalls of purely bottom-up attention and segmentation. The main contribution of the paper consists in integrated visuomotor processes that allow the robot to learn object representations by manipulation without having any prior knowledge about the objects. Our experimental results show that the acquired data is of sufficient quality to train a classifier that can recognize 3D objects independently of the viewpoint.

Patent
04 Sep 2008
TL;DR: In this article, a method of tracking objects on a plane within video images of the objects captured by a video camera is proposed, which includes processing the captured video images so as to extract one or more image features from each object and generating object identification data for each object, from the comparing, which identifies the respective object on the plane.
Abstract: A method of tracking objects on a plane within video images of the objects captured by a video camera. The method includes processing the captured video images so as to extract one or more image features from each object, detecting each of the objects from a relative position of the objects on the plane as viewed from the captured video images by comparing the one or more extracted image features associated with each object with sample image features from a predetermined set of possible example objects which the captured video images may contain; and generating object identification data for each object, from the comparing, which identifies the respective object on the plane. The method further includes generating a three dimensional model of the plane and logging, for each detected object, the object identification data for each object which identifies the respective object on the plane together with object path data. The object path provides a position of the object on the three dimensional model of the plane from the video images with respect to time and relates to the path that each object has taken within the video images. The logging includes detecting an occlusion event in dependence upon whether a first image feature associated with a first of the objects obscures a whole or part of at least a second image feature associated with at least a second of the objects; and, if an occlusion event is detected, associating the object identification data for the first object and the object identification data for the second object with the object path data for both the first object and the second object respectively and logging the associations. The logging further includes identifying at least one of the objects involved in the occlusion event in dependence upon a comparison between the one or more image features associated with that object and the sample image features from the predetermined set of possible example objects, and updating the logged path data after the identification of at least one of the objects so that the respective path data is associated with the respective identified object.

Book ChapterDOI
26 Mar 2008
TL;DR: A system which allows to request information on physical objects by taking a picture of them, using a mobile phone with integrated camera, and which identifies an object from a query image through multiple recognition stages, including local visual features, global geometry, and optionally also metadata such as GPS location.
Abstract: We present a system which allows to request information on physical objects by taking a picture of them. This way, using a mobile phone with integrated camera, users can interact with objects or "things" in a very simple manner. A further advantage is that the objects themselves don't have to be tagged with any kind of markers. At the core of our system lies an object recognition method, which identifies an object from a query image through multiple recognition stages, including local visual features, global geometry, and optionally also metadata such as GPS location. We present two applications for our system, namely a slide tagging application for presentation screens in smart meeting rooms and a cityguide on a mobile phone. Both systems are fully functional, including an application on the mobile phone, which allows simplest point-and-shoot interaction with objects. Experiments evaluate the performance of our approach in both application scenarios and show good recognition results under challenging conditions.

Proceedings ArticleDOI
07 Jun 2008
TL;DR: The experimental results demonstrate the effectiveness of the SIFT (scale invariant feature transform) approach and show this algorithm is of higher robustness and real-time performance.
Abstract: SIFT (scale invariant feature transform) is used to solve visual tracking problem, where the appearances of the tracked object and scene background change during tracking. The implementation of this algorithm has five major stages: scale-space extrema detection; keypoint localization; orientation assignment; keypoint descriptor; keypoint matching. From the beginning frame, object is selected as the template, its SIFT features are computed. Then in the following frames, the SIFT features are computed. Euclidean distance between the object's SIFT features and the frames' SIFT features can be used to compute the accurate position of the matched object. The experimental results on real video sequences demonstrate the effectiveness of this approach and show this algorithm is of higher robustness and real-time performance. It can solve the matching problem with translation, rotation and affine distortion between images. It plays an important role in video object tracking and video object retrieval.

Proceedings ArticleDOI
23 Jun 2008
TL;DR: Experiments conducted on object recognition show that when plugging the kernel in SVMs, the authors clearly outperform SVMs with ldquocontext-freerdquo kernels, and this paper will show that the fixed-point of this energy is a new type of kernel (ldquoCDKrdquo) which also satisfies the Mercer condition.
Abstract: The success of kernel methods including support vector networks (SVMs) strongly depends on the design of appropriate kernels. While initially kernels were designed in order to handle fixed-length data, their extension to unordered, variable-length data became more than necessary for real pattern recognition problems such as object recognition and bioinformatics. We focus in this paper on object recognition using a new type of kernel referred to as ldquocontext-dependentrdquo. Objects, seen as constellations of local features (interest points, regions, etc.), are matched by minimizing an energy function mixing (1) a fidelity term which measures the quality of feature matching, (2) a neighborhood criteria which captures the object geometry and (3) a regularization term. We will show that the fixed-point of this energy is a ldquocontext-dependentrdquo kernel (ldquoCDKrdquo) which also satisfies the Mercer condition. Experiments conducted on object recognition show that when plugging our kernel in SVMs, we clearly outperform SVMs with ldquocontext-freerdquo kernels.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper proposes a method of object recognition and segmentation using scale-invariant feature transform (SIFT) and graph cuts and thanks to this combination, both recognition and segmentsation are performed automatically under cluttered backgrounds including occlusion.
Abstract: In this paper, we propose a method of object recognition and segmentation using scale-invariant feature transform (SIFT) and graph cuts. SIFT feature is invariant for rotations, scale changes, and illumination changes and it is often used for object recognition. However, in previous object recognition work using SIFT, the object region is simply presumed by the affine-transformation and the accurate object region was not segmented. On the other hand, graph cuts is proposed as a segmentation method of a detail object region. But it was necessary to give seeds manually. By combing SIFT and graph cuts, in our method, the existence of objects is recognized first by vote processing of SIFT keypoints. After that, the object region is cut out by graph cuts using SIFT keypoints as seeds. Thanks to this combination, both recognition and segmentation are performed automatically under cluttered backgrounds including occlusion.

Proceedings ArticleDOI
26 Oct 2008
TL;DR: This paper presents a luminance field manifold trajectory analysis based solution for human activity recognition, without explicit object level information extraction and understanding that is computationally efficient and can operate in real time.
Abstract: The explosive growth of video content in recent years fueled by the technological leaps in computing and communication has created new challenges for video content analysis that can serve applications in video surveillance, video searching and mining. Human action detection and recognition is one of the important tasks in this effort. In this paper, we present a luminance field manifold trajectory analysis based solution for human activity recognition, without explicit object level information extraction and understanding. This approach is computationally efficient and can operate in real time. The recognition performance is also comparable with the state of art in comparable set ups.

Patent
15 Apr 2008
TL;DR: In this paper, a system and a method for generating effects in a webcam application is presented, which includes identifying a first object and a second object in a video image and adding a first user-created object to the video image to create an altered video image.
Abstract: A system and a method for generating effects in a webcam application are provided. The method includes identifying a first object and a second object in a video image. The method also includes adding a first user-created object to the video image to create an altered video image and adding a second user-created object to the altered video image to further alter the altered video image. Other steps included are associating the second user-created object with the second object; identifying a movement of the second object; and moving the second user-created object in the altered video image in accordance with the association of the second user-created object with the second object. The first object is a static object, and the first user-created object is manually movable. The movement of the second user-created object in association with the second object is independent of a movement of the first user-created object.

Proceedings ArticleDOI
22 Apr 2008
TL;DR: The design and implementation of a dual-camera sensor network that can be used as a memory assistant tool for assisted living performs energy-efficient object detection and recognition of commonly misplaced objects and can seamlessly integrate feedback from the user to improve the robustness of object recognition.
Abstract: This paper presents the design and implementation of a dual-camera sensor network that can be used as a memory assistant tool for assisted living. Our system performs energy-efficient object detection and recognition of commonly misplaced objects. The novelty in our approach is the ability to tradeoff between recognition accuracy and computational efficiency by employing a combination of low complexity but less precise color histogram-based image recognition together with more complex image recognition using SIFT descriptors. In addition, our system can seamlessly integrate feedback from the user to improve the robustness of object recognition. Experimental results reveal that our system is computation-efficient and adaptive to slow changes of environmental conditions.

Journal ArticleDOI
TL;DR: This paper proposes to also use co-location and co-activation, together with weak top-down constraints, such as alignment, as guiding principles for learning the appearance of local object parts.