scispace - formally typeset
Search or ask a question

Showing papers on "3D single-object recognition published in 2007"


Journal ArticleDOI
TL;DR: A hierarchical system that closely follows the organization of visual cortex and builds an increasingly complex and invariant feature representation by alternating between a template matching and a maximum pooling operation is described.
Abstract: We introduce a new general framework for the recognition of complex visual scenes, which is motivated by biology: We describe a hierarchical system that closely follows the organization of visual cortex and builds an increasingly complex and invariant feature representation by alternating between a template matching and a maximum pooling operation. We demonstrate the strength of the approach on a range of recognition tasks: From invariant single object recognition in clutter to multiclass categorization problems and complex scene understanding tasks that rely on the recognition of both shape-based as well as texture-based objects. Given the biological constraints that the system had to satisfy, the approach performs surprisingly well: It has the capability of learning from only a few training examples and competes with state-of-the-art systems. We also discuss the existence of a universal, redundant dictionary of features that could handle the recognition of most object categories. In addition to its relevance for computer vision, the success of this approach suggests a plausibility proof for a class of feedforward models of object recognition in cortex

1,779 citations


Proceedings ArticleDOI
29 Sep 2007
TL;DR: This paper uses a bag of words approach to represent videos, and presents a method to discover relationships between spatio-temporal words in order to better describe the video data.
Abstract: In this paper we introduce a 3-dimensional (3D) SIFT descriptor for video or 3D imagery such as MRI data. We also show how this new descriptor is able to better represent the 3D nature of video data in the application of action recognition. This paper will show how 3D SIFT is able to outperform previously used description methods in an elegant and efficient manner. We use a bag of words approach to represent videos, and present a method to discover relationships between spatio-temporal words in order to better describe the video data.

1,757 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper describes face data as resulting from a generative model which incorporates both within- individual and between-individual variation, and calculates the likelihood that the differences between face images are entirely due to within-individual variability.
Abstract: Many current face recognition algorithms perform badly when the lighting or pose of the probe and gallery images differ. In this paper we present a novel algorithm designed for these conditions. We describe face data as resulting from a generative model which incorporates both within-individual and between-individual variation. In recognition we calculate the likelihood that the differences between face images are entirely due to within-individual variability. We extend this to the non-linear case where an arbitrary face manifold can be described and noise is position-dependent. We also develop a "tied" version of the algorithm that allows explicit comparison across quite different viewing conditions. We demonstrate that our model produces state of the art results for (i) frontal face recognition (ii) face recognition under varying pose.

1,099 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: This work proposes to incorporate semantic object context as a post-processing step into any off-the-shelf object categorization model using a conditional random field (CRF) framework, which maximizes object label agreement according to contextual relevance.
Abstract: In the task of visual object categorization, semantic context can play the very important role of reducing ambiguity in objects' visual appearance. In this work we propose to incorporate semantic object context as a post-processing step into any off-the-shelf object categorization model. Using a conditional random field (CRF) framework, our approach maximizes object label agreement according to contextual relevance. We compare two sources of context: one learned from training data and another queried from Google Sets. The overall performance of the proposed framework is evaluated on the PASCAL and MSRC datasets. Our findings conclude that incorporating context into object categorization greatly improves categorization accuracy.

740 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: The alignment method improves performance on a face recognition task, both over unaligned images and over images aligned with a face alignment algorithm specifically developed for and trained on hand-labeled face images.
Abstract: Many recognition algorithms depend on careful positioning of an object into a canonical pose, so the position of features relative to a fixed coordinate system can be examined. Currently, this positioning is done either manually or by training a class-specialized learning algorithm with samples of the class that have been hand-labeled with parts or poses. In this paper, we describe a novel method to achieve this positioning using poorly aligned examples of a class with no additional labeling. Given a set of unaligned examplars of a class, such as faces, we automatically build an alignment mechanism, without any additional labeling of parts or poses in the data set. Using this alignment mechanism, new members of the class, such as faces resulting from a face detector, can be precisely aligned for the recognition process. Our alignment method improves performance on a face recognition task, both over unaligned images and over images aligned with a face alignment algorithm specifically developed for and trained on hand-labeled face images. We also demonstrate its use on an entirely different class of objects (cars), again without providing any information about parts or pose to the learning algorithm.

375 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: It is demonstrated that it is possible to automatically learn object models from video of household activities and employ these models for activity recognition, without requiring any explicit human labeling.
Abstract: We propose an approach to activity recognition based on detecting and analyzing the sequence of objects that are being manipulated by the user. In domains such as cooking, where many activities involve similar actions, object-use information can be a valuable cue. In order for this approach to scale to many activities and objects, however, it is necessary to minimize the amount of human-labeled data that is required for modeling. We describe a method for automatically acquiring object models from video without any explicit human supervision. Our approach leverages sparse and noisy readings from RFID tagged objects, along with common-sense knowledge about which objects are likely to be used during a given activity, to bootstrap the learning process. We present a dynamic Bayesian network model which combines RFID and video data to jointly infer the most likely activity and object labels. We demonstrate that our approach can achieve activity recognition rates of more than 80% on a real-world dataset consisting of 16 household activities involving 33 objects with significant background clutter. We show that the combination of visual object recognition with RFID data is significantly more effective than the RFID sensor alone. Our work demonstrates that it is possible to automatically learn object models from video of household activities and employ these models for activity recognition, without requiring any explicit human labeling.

359 citations


Proceedings ArticleDOI
29 Jul 2007
TL;DR: A system for inserting new objects into existing photographs by querying a vast image-based object library, pre-computed using a publicly available Internet object database, to shield the user from all of the arduous tasks typically involved in image compositing.
Abstract: We present a system for inserting new objects into existing photographs by querying a vast image-based object library, pre-computed using a publicly available Internet object database. The central goal is to shield the user from all of the arduous tasks typically involved in image compositing. The user is only asked to do two simple things: 1) pick a 3D location in the scene to place a new object; 2) select an object to insert using a hierarchical menu. We pose the problem of object insertion as a data-driven, 3D-based, context-sensitive object retrieval task. Instead of trying to manipulate the object to change its orientation, color distribution, etc. to fit the new image, we simply retrieve an object of a specified class that has all the required properties (camera pose, lighting, resolution, etc) from our large object library. We present new automatic algorithms for improving object segmentation and blending, estimating true 3D object size and orientation, and estimating scene lighting conditions. We also present an intuitive user interface that makes object insertion fast and simple even for the artistically challenged.

287 citations


Journal ArticleDOI
TL;DR: An approach to category learning and recognition that is based on recent computational advances is described, in which objects are represented by a hierarchy of fragments that are extracted during learning from observed examples.

251 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: The experimental results demonstrate the robustness of SIFT features to expression, accessory and pose variations and a simple non-statistical matching strategy combined with local and global similarity on key-points clusters to solve face recognition problems.
Abstract: Scale invariant feature transform (SIFT) proposed by Lowe has been widely and successfully applied to object detection and recognition. However, the representation ability of SIFT features in face recognition has rarely been investigated systematically. In this paper, we proposed to use the person-specific SIFT features and a simple non-statistical matching strategy combined with local and global similarity on key-points clusters to solve face recognition problems. Large scale experiments on FERET and CAS-PEAL face databases using only one training sample per person have been carried out to compare it with other non person-specific features such as Gabor wavelet feature and local binary pattern feature. The experimental results demonstrate the robustness of SIFT features to expression, accessory and pose variations.

225 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: This work presents a discriminative shape-based algorithm for object category localization and recognition that learns object models in a weakly-supervised fashion, without requiring the specification of object locations nor pixel masks in the training data.
Abstract: We present a discriminative shape-based algorithm for object category localization and recognition. Our method learns object models in a weakly-supervised fashion, without requiring the specification of object locations nor pixel masks in the training data. We represent object models as cliques of fully-interconnected parts, exploiting only the pairwise geometric relationships between them. The use of pairwise relationships enables our algorithm to successfully overcome several problems that are common to previously-published methods. Even though our algorithm can easily incorporate local appearance information from richer features, we purposefully do not use them in order to demonstrate that simple geometric relationships can match (or exceed) the performance of state-of-the-art object recognition algorithms.

181 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: This work presents the results of applying three commonly used object recognition/detection algorithms (color histogram matching, SIFT matching, and boosted Haar-like features) to the dataset, and analyzes the successes and failures of these algorithms against product type and imaging conditions, both in terms of recognition rate and localization accuracy.
Abstract: The problem of using pictures of objects captured under ideal imaging conditions (here referred to as in vitro) to recognize objects in natural environments (in situ) is an emerging area of interest in computer vision and pattern recognition. Examples of tasks in this vein include assistive vision systems for the blind and object recognition for mobile robots; the proliferation of image databases on the web is bound to lead to more examples in the near future. Despite its importance, there is still a need for a freely available database to facilitate study of this kind of training/testing dichotomy. In this work one of our contributions is a new multimedia database of 120 grocery products, GroZi-120. For every product, two different recordings are available: in vitro images extracted from the web, and in situ images extracted from camcorder video collected inside a grocery store. As an additional contribution, we present the results of applying three commonly used object recognition/detection algorithms (color histogram matching, SIFT matching, and boosted Haar-like features) to the dataset. Finally, we analyze the successes and failures of these algorithms against product type and imaging conditions, both in terms of recognition rate and localization accuracy, in order to suggest ways forward for further research in this domain.

Proceedings ArticleDOI
26 Dec 2007
TL;DR: An interesting computational property of the object hierarchy is observed: comparing the recognition rate when using models of objects at different levels, the higher more inclusive levels exhibit higher recall but lower precision when compared with the class specific level.
Abstract: We investigated the computational properties of natural object hierarchy in the context of constellation object class models, and its utility for object class recognition. We first observed an interesting computational property of the object hierarchy: comparing the recognition rate when using models of objects at different levels, the higher more inclusive levels (e.g., closed-frame vehicles or vehicles) exhibit higher recall but lower precision when compared with the class specific level (e.g., bus). These inherent differences suggest that combining object classifiers from different hierarchical levels into a single classifier may improve classification, as it appears like these models capture different aspects of the object. We describe a method to combine these classifiers, and analyze the conditions under which improvement can be guaranteed. When given a small sample of a new object class, we describe a method to transfer knowledge across the tree hierarchy, between related objects. Finally, we describe extensive experiments using object hierarchies obtained from publicly available datasets, and show that the combined classifiers significantly improve recognition results.

Proceedings ArticleDOI
17 Jun 2007
TL;DR: This paper proposes an approach for object class localization which goes beyond bounding boxes, as it also determines the outline of the object, and directly generates, evaluates and clusters shape masks.
Abstract: This paper proposes an approach for object class localization which goes beyond bounding boxes, as it also determines the outline of the object. Unlike most current localization methods, our approach does not require any hypothesis parameter space to be defined. Instead, it directly generates, evaluates and clusters shape masks. Thus, the presented framework produces more informative results for object class localization. For example, it easily learns and detects possible object viewpoints and articulations, which are often well characterized by the object outline. We evaluate the proposed approach on the challenging natural-scene Graz-02 object classes dataset. The results demonstrate the extended localization capabilities of our method.

Proceedings Article
03 Dec 2007
TL;DR: In this article, a probabilistic model is proposed to transfer the labels from the retrieval set to the input image, in an appropriate representation, to obtain hypotheses for object identities and locations.
Abstract: Current object recognition systems can only recognize a limited number of object categories; scaling up to many categories is the next challenge. We seek to build a system to recognize and localize many different object categories in complex scenes. We achieve this through a simple approach: by matching the input image, in an appropriate representation, to images in a large training set of labeled images. Due to regularities in object identities across similar scenes, the retrieved matches provide hypotheses for object identities and locations. We build a probabilistic model to transfer the labels from the retrieval set to the input image. We demonstrate the effectiveness of this approach and study algorithm component contributions using held-out test sets from the LabelMe database.

Proceedings ArticleDOI
17 Jun 2007
TL;DR: This work proposes a novel framework for visual object recognition where object classes are represented by assemblies of partial surface models obeying loose local geometric constraints, and it outperforms the state-of-the-art algorithms for object detection and localization.
Abstract: Today's category-level object recognition systems largely focus on fronto-parallel views of objects with characteristic texture patterns. To overcome these limitations, we propose a novel framework for visual object recognition where object classes are represented by assemblies of partial surface models (PSMs) obeying loose local geometric constraints. The PSMs themselves are formed of dense, locally rigid assemblies of image features. Since our model only enforces local geometric consistency, both at the level of model parts and at the level of individual features within the parts, it is robust to viewpoint changes and intra-class variability. The proposed approach has been implemented, and it outperforms the state-of-the-art algorithms for object detection and localization recently compared in [14] on the Pascal 2005 VOC Challenge Cars Test 1 data.

Proceedings Article
06 Jan 2007
TL;DR: This paper presents a novel method for identifying and tracking objects in multiresolution digital video of partially cluttered environments and uses a learned "attentive" interest map on a low resolution data stream to direct a high resolution "fovea".
Abstract: Human object recognition in a physical 3-d environment is still far superior to that of any robotic vision system. We believe that one reason (out of many) for this--one that has not heretofore been significantly exploited in the artificial vision literature--is that humans use a fovea to fixate on, or near an object, thus obtaining a very high resolution image of the object and rendering it easy to recognize. In this paper, we present a novel method for identifying and tracking objects in multiresolution digital video of partially cluttered environments. Our method is motivated by biological vision systems and uses a learned "attentive" interest map on a low resolution data stream to direct a high resolution "fovea." Objects that are recognized in the fovea can then be tracked using peripheral vision. Because object recognition is run only on a small foveal image, our system achieves performance in real-time object recognition and tracking that is well beyond simpler systems.

Proceedings ArticleDOI
26 Dec 2007
TL;DR: A new method to recognize 3D range images by matching local surface descriptors by comparing with the training set, which is evaluated on both synthetic and real 3D data with complex shapes.
Abstract: Recognition of 3D objects from different viewpoints is a difficult problem. In this paper, we propose a new method to recognize 3D range images by matching local surface descriptors. The input 3D surfaces are first converted into a set of local shape descriptors computed on surface patches defined by detected salient features. We compute the similarities between input 3D images by matching their descriptors with a pyramid kernel function. The similarity matrix of the images is used to train for classification using SVM, and new images can be recognized by comparing with the training set. The approach is evaluated on both synthetic and real 3D data with complex shapes.

Patent
15 May 2007
TL;DR: In this paper, an object recognition device detects a position of a vehicle based on a running path obtained by GPS, vehicle speed, steering angle, etc., and also detects the position of the vehicle using a result of recognition of an object obtained using a captured image of a camera.
Abstract: An object recognition device detects a position of a vehicle based on a running path obtained by GPS, vehicle speed, steering angle, etc., and also detects the position of the vehicle based on a result of recognition of an object obtained using a captured image of a camera. The device computes a positioning accuracy in detecting the vehicle position, which accuracy mostly deteriorates as a movement distance of the vehicle increases. Positional data of the object on the road to be recognized is stored in a map database beforehand. A recognition range of the road of the object to be recognized is set based on the detected position of the vehicle, a position of the object stored in the map database, and the computed positioning accuracy. The object is recognized for the set recognition range by processing of the captured image of the camera.

Proceedings ArticleDOI
17 Jun 2007
TL;DR: A new framework is delineated that integrates object recognition, motion estimation, and semantic-level recognition for the reliable recognition of hierarchical human-object interactions and the performance of the final activity recognition is superior to that of previous approaches.
Abstract: The paper presents a system that recognizes humans interacting with objects. We delineate a new framework that integrates object recognition, motion estimation, and semantic-level recognition for the reliable recognition of hierarchical human-object interactions. The framework is designed to integrate recognition decisions made by each component, and to probabilistically compensate for the failure of the components with the use of the decisions made by the other components. As a result, human-object interactions in an airport-like environment, such as 'a person carrying a baggage', 'a person leaving his/her baggage', or 'a person snatching another's baggage', are recognized. The experimental results show that not only the performance of the final activity recognition is superior to that of previous approaches, but also the accuracy of the object recognition and the motion estimation increases using feedback from the semantic layer. Several real examples illustrate the superior performance in recognition and semantic description of occurring events.

Journal ArticleDOI
12 Sep 2007-PLOS ONE
TL;DR: The viewpoint-independence of cross-modal object identification points to its mediation by a high-level abstract representation, and the correlation between spatial imagery scores and cross- modal performance suggest that construction of this high- level representation is linked to the ability to perform spatial transformations.
Abstract: Background Previous research suggests that visual and haptic object recognition are viewpoint-dependent both within- and cross-modally. However, this conclusion may not be generally valid as it was reached using objects oriented along their extended y-axis, resulting in differential surface processing in vision and touch. In the present study, we removed this differential by presenting objects along the z-axis, thus making all object surfaces more equally available to vision and touch. Methodology/Principal Findings Participants studied previously unfamiliar objects, in groups of four, using either vision or touch. Subsequently, they performed a four-alternative forced-choice object identification task with the studied objects presented in both unrotated and rotated (180° about the x-, y-, and z-axes) orientations. Rotation impaired within-modal recognition accuracy in both vision and touch, but not cross-modal recognition accuracy. Within-modally, visual recognition accuracy was reduced by rotation about the x- and y-axes more than the z-axis, whilst haptic recognition was equally affected by rotation about all three axes. Cross-modal (but not within-modal) accuracy correlated with spatial (but not object) imagery scores. Conclusions/Significance The viewpoint-independence of cross-modal object identification points to its mediation by a high-level abstract representation. The correlation between spatial imagery scores and cross-modal performance suggest that construction of this high-level representation is linked to the ability to perform spatial transformations. Within-modal viewpoint-dependence appears to have a different basis in vision than in touch, possibly due to surface occlusion being important in vision but not touch.

Proceedings ArticleDOI
05 Nov 2007
TL;DR: A face recognition system based on recent method which concerned with both representation and recognition using artificial neural networks is presented and produces promising results for face verification and face recognition.
Abstract: Advances in face recognition have come from considering various aspects of this specialized perception problem. Earlier methods treated face recognition as a standard pattern recognition problem; later methods focused more on the representation aspect, after realizing its uniqueness using domain knowledge; more recent methods have been concerned with both representation and recognition, so a robust system with good generalization capability can be built by adopting state-of-the-art techniques from learning, computer vision, and pattern recognition. A face recognition system based on recent method which concerned with both representation and recognition using artificial neural networks is presented. This paper initially provides the overview of the proposed face recognition system, and explains the methodology used. It then evaluates the performance of the system by applying two (2) photometric normalization techniques: histogram equalization and homomorphic filtering, and comparing with euclidean distance, and normalized correlation classifiers. The system produces promising results for face verification and face recognition

Book ChapterDOI
27 Aug 2007
TL;DR: This work proposes to overcome the pose problem by automatically reconstructing a 3D face model from multiple non-frontal frames in a video, generating a frontal view from the derived 3D model, and using a commercial 2D face recognition engine to recognize the synthesized frontal view.
Abstract: Face recognition in video has gained wide attention due to its role in designing surveillance systems One of the main advantages of video over still frames is that evidence accumulation over multiple frames can provide better face recognition performance However, surveillance videos are generally of low resolution containing faces mostly in non-frontal poses Consequently, face recognition in video poses serious challenges to state-of-the-art face recognition systems Use of 3D face models has been suggested as a way to compensate for low resolution, poor contrast and non-frontal pose We propose to overcome the pose problem by automatically (i) reconstructing a 3D face model from multiple non-frontal frames in a video, (ii) generating a frontal view from the derived 3D model, and (iii) using a commercial 2D face recognition engine to recognize the synthesized frontal view A factorization-based structure from motion algorithm is used for 3D face reconstruction The proposed scheme has been tested on CMU's Face In Action (FIA) video database with 221 subjects Experimental results show a 40% improvement in matching performance as a result of using the 3D models

Patent
Kim Jung Bae1, Haitao Wang1
28 Nov 2007
TL;DR: In this paper, a template matching process is performed in a predetermined region having the object feature point candidates as the center of the template matching, which reduces the processing time needed for template matching by using a difference between pixel values of neighboring frames.
Abstract: A method and apparatus for tracking an object, and a method and apparatus for calculating object pose information are provided. The method of tracking the object obtains object feature point candidates by using a difference between pixel values of neighboring frames. A template matching process is performed in a predetermined region having the object feature point candidates as the center. Accordingly, it is possible to reduce a processing time needed for the template matching process. The method of tracking the object is robust in terms of sudden changes in lighting and partial occlusion. In addition, it is possible to track the object in real time. In addition, since the pose of the object, the pattern of the object, and the occlusion of the object are determined, detailed information on action patterns of the object can be obtained in real time.

Proceedings ArticleDOI
10 Dec 2007
TL;DR: A design and implementation of knowledge based visual 3D object recognition system with multi-cue integration using particle filter technique and the system is able to generate vision-guided humanoid behaviors without considering visual processing functions.
Abstract: A vision based object recognition subsystem on knowledge-based humanoid robot system is presented. Humanoid robot system for real world service application must integrate an object recognition subsystem and a motion planning subsystem in both mobility and manipulation tasks. These requirements involve the vision system capable of self-localization for navigation tasks and object recognition for manipulation tasks, while communicating with the motion planning subsystem. In this paper, we describe a design and implementation of knowledge based visual 3D object recognition system with multi-cue integration using particle filter technique. The particle filter provides very robust object recognition performance and knowledge based approach enables robot to perform both object localization and self localization with movable/fixed information. Since this object recognition subsystem share knowledge with a motion planning subsystem, we are able to generate vision-guided humanoid behaviors without considering visual processing functions. Finally, in order to demonstrate the generality of the system, we demonstrated several vision-based humanoid behavior experiments in a daily life environment.

Proceedings ArticleDOI
17 Jun 2007
TL;DR: The semantic hierarchy algorithm starts by constructing a minimal feature hierarchy and proceeds by adding semantically equivalent representatives to each node, using the entire hierarchy as a context for determining the identity and locations of added features.
Abstract: This paper describes the construction and use of a novel representation for the recognition of objects and their parts, the semantic hierarchy. Its advantages include improved classification performance, accurate detection and localization of object parts and sub-parts, and explicitly identifying the different appearances of each object part. The semantic hierarchy algorithm starts by constructing a minimal feature hierarchy and proceeds by adding semantically equivalent representatives to each node, using the entire hierarchy as a context for determining the identity and locations of added features. Part detection is obtained by a bottom-up top-down cycle. Unlike previous approaches, the semantic hierarchy learns to represent the set of possible appearances of object parts at all levels, and their statistical dependencies. The algorithm is fully automatic and is shown experimentally to substantially improve the recognition of objects and their parts.

Patent
04 Oct 2007
TL;DR: In this article, an attention module, an object recognition module, and an online labeling module are configured to detect unknown objects from an image and alert a user if the extracted object is an unknown object so that it can be labeled.
Abstract: Described is a bio-inspired vision system for object recognition. The system comprises an attention module, an object recognition module, and an online labeling module. The attention module is configured to receive an image representing a scene and find and extract an object from the image. The attention module is also configured to generate feature vectors corresponding to color, intensity, and orientation information within the extracted object. The object recognition module is configured to receive the extracted object and the feature vectors and associate a label with the extracted object. Finally, the online labeling module is configured to alert a user if the extracted object is an unknown object so that it can be labeled.

Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper compares, on a novel data collection of 10 geometric object classes, various shape-based features with appearance-based descriptors such as SIFT, which includes a direct comparison of feature statistics as well as results within standard recognition frameworks, which are partly intuitive, but sometimes surprising.
Abstract: Recent work in object categorization often uses local image descriptors such as SIFT to learn and detect object categories. Such descriptors explicitly code local appearance and have shown impressive results on objects with sufficient local appearance statistics. However, many important object classes such as tools, cups and other man-made artifacts seem to require features that capture the respective shape and geometric layout of those object classes. Therefore this paper compares, on a novel data collection of 10 geometric object classes, various shape-based features with appearance-based descriptors such as SIFT. The analysis includes a direct comparison of feature statistics as well as results within standard recognition frameworks, which are partly intuitive, but sometimes surprising.

Proceedings ArticleDOI
17 Jun 2007
TL;DR: This work shows how to construct virtual training examples for multi-view recognition using a simple model of objects (nearly planar facades centered at fixed 3D positions) and shows how the models can be learned from a few labeled images for each class.
Abstract: Our goal is to circumvent one of the roadblocks to using existing approaches for single-view recognition for achieving multi-view recognition, namely, the need for sufficient training data for many viewpoints. We show how to construct virtual training examples for multi-view recognition using a simple model of objects (nearly planar facades centered at fixed 3D positions). We also show how the models can be learned from a few labeled images for each class.

Proceedings ArticleDOI
Donghyun Kim1, Kwanho Kim1, Joo-Young Kim1, Seungjin Lee1, Hoi-Jun Yoo1 
16 Sep 2007
TL;DR: An 81.6 GOPS object recognition processor is developed by using NoC and visual image processing (VIP) memory that achieves 15.9 fps SIFT feature extraction at 200 MHz.
Abstract: An 81.6 GOPS object recognition processor is developed by using NoC and visual image processing (VIP) memory. SIFT (scale invariant feature transform) object recognition requires huge computing power and data transactions among tasks. The chip integrates 10 SIMD PEs for data/task level parallelism while the NoC facilitates inter-PE communications. The VIP memory searches local maximum pixel inside a 3times3 window in a single cycle providing 65.6 GOPS. The proposed processor achieves 15.9 fps SIFT feature extraction at 200 MHz.