scispace - formally typeset
Search or ask a question

Showing papers by "Matthew Turk published in 2011"


Journal ArticleDOI
TL;DR: This work presents a carefully designed dataset of video sequences of planar textures with ground truth, which includes various geometric changes, lighting conditions, and levels of motion blur, and presents a comprehensive quantitative evaluation of detector-descriptor-based visual camera tracking based on this testbed.
Abstract: Applications for real-time visual tracking can be found in many areas, including visual odometry and augmented reality. Interest point detection and feature description form the basis of feature-based tracking, and a variety of algorithms for these tasks have been proposed. In this work, we present (1) a carefully designed dataset of video sequences of planar textures with ground truth, which includes various geometric changes, lighting conditions, and levels of motion blur, and which may serve as a testbed for a variety of tracking-related problems, and (2) a comprehensive quantitative evaluation of detector-descriptor-based visual camera tracking based on this testbed. We evaluate the impact of individual algorithm parameters, compare algorithms for both detection and description in isolation, as well as all detector-descriptor combinations as a tracking solution. In contrast to existing evaluations, which aim at different tasks such as object recognition and have limited validity for visual tracking, our evaluation is geared towards this application in all relevant factors (performance measures, testbed, candidate algorithms). To our knowledge, this is the first work that comprehensively compares these algorithms in this context, and in particular, on video streams.

441 citations


Proceedings ArticleDOI
05 Jan 2011
TL;DR: A mobile augmented reality (AR) translation system that requires the user to simply tap on the word of interest once in order to produce a translation, presented as an AR overlay, and offers a particularly easy-to-use and simple method for translation.
Abstract: We present a mobile augmented reality (AR) translation system, using a smartphone's camera and touchscreen, that requires the user to simply tap on the word of interest once in order to produce a translation, presented as an AR overlay. The translation seamlessly replaces the original text in the live camera stream, matching background and foreground colors estimated from the source images. For this purpose, we developed an efficient algorithm for accurately detecting the location and orientation of the text in a live camera stream that is robust to perspective distortion, and we combine it with OCR and a text-to-text translation engine. Our experimental results, using the ICDAR 2003 dataset and our own set of video sequences, quantify the accuracy of our detection and analyze the sources of failure among the system's components. With the OCR and translation running in a background thread, the system runs at 26 fps on a current generation smartphone (Nokia N900) and offers a particularly easy-to-use and simple method for translation, especially in situations in which typing or correct pronunciation (for systems with speech input) is cumbersome or impossible.

104 citations


Journal ArticleDOI
TL;DR: The motivations for organizing this special section were to better address the challenges of face recognition in real-world scenarios, to promote systematic research and evaluation of promising methods and systems, to provide a snapshot of where the authors are in this domain, and to stimulate discussion about future directions.
Abstract: The motivations for organizing this special section were to better address the challenges of face recognition in real-world scenarios, to promote systematic research and evaluation of promising methods and systems, to provide a snapshot of where we are in this domain, and to stimulate discussion about future directions. We solicited original contributions of research on all aspects of real-world face recognition, including: the design of robust face similarity features and metrics; robust face clustering and sorting algorithms; novel user interaction models and face recognition algorithms for face tagging; novel applications of web face recognition; novel computational paradigms for face recognition; challenges in large scale face recognition tasks, e.g., on the Internet; face recognition with contextual information; face recognition benchmarks and evaluation methodology for moderately controlled or uncontrolled environments; and video face recognition. We received 42 original submissions, four of which were rejected without review; the other 38 papers entered the normal review process. Each paper was reviewed by three reviewers who are experts in their respective topics. More than 100 expert reviewers have been involved in the review process. The papers were equally distributed among the guest editors. A final decision for each paper was made by at least two guest editors assigned to it. To avoid conflict of interest, no guest editor submitted any papers to this special section.

69 citations


Proceedings ArticleDOI
05 Jan 2011
TL;DR: This work demonstrates a recognition application, based upon the SURF feature descriptor algorithm, which fuses bag-of-words and structural verification techniques and achieves accurate (> 90%) and real-time performance when searching databases containing thousands of images.
Abstract: Recent advances in computer vision have significantly reduced the difficulty of object classification and recognition. Robust feature detector and descriptor algorithms are particularly useful, forming the basis for many recognition and classification applications. These algorithms have been used in divergent bag-of-words and structural matching approaches. This work demonstrates a recognition application, based upon the SURF feature descriptor algorithm, which fuses bag-of-words and structural verification techniques. The resulting system is applied to the domain of car recognition and achieves accurate (> 90%) and real-time performance when searching databases containing thousands of images.

62 citations


Proceedings ArticleDOI
01 Nov 2011
TL;DR: A fast automatic text detection algorithm devised for a mobile augmented reality (AR) translation system on a mobile phone and a method that exploits the redundancy of the information contained in the video stream to remove false alarms is presented.
Abstract: We present a fast automatic text detection algorithm devised for a mobile augmented reality (AR) translation system on a mobile phone. In this application, scene text must be detected, recognized, and translated into a desired language, and then the translation is displayed overlaid properly on the real-world scene. In order to offer a fast automatic text detector, we focused our initial search to find a single letter. Detecting one letter provides useful information that is processed with efficient rules to quickly find the reminder of a word. This approach allows for detecting all the contiguous text regions in an image quickly. We also present a method that exploits the redundancy of the information contained in the video stream to remove false alarms. Our experimental results quantify the accuracy and efficiency of the algorithm and show the strengths and weaknesses of the method as well as its speed (about 160 ms on a recent generation smartphone, not optimized). The algorithm is well suited for real-time, real-world applications.

39 citations


Proceedings ArticleDOI
05 Jan 2011
TL;DR: This work proposes a method to automatically select a minimal set of images, focused at different depths, such that all objects in a given scene are in focus in at least one image, and aims to minimize both the amount of time spent metering the scene and capturing the images, and the total amount of high-resolution data that is captured.
Abstract: All-in-focus imaging is a computational photography technique that produces images free of defocus blur by capturing a stack of images focused at different distances and merging them into a single sharp result. Current approaches assume that images have been captured offline, and that a reasonably powerful computer is available to process them. In contrast, we focus on the problem of how to capture such input stacks in an efficient and scene-adaptive fashion. Inspired by passive autofocus techniques, which select a single best plane of focus in the scene, we propose a method to automatically select a minimal set of images, focused at different depths, such that all objects in a given scene are in focus in at least one image. We aim to minimize both the amount of time spent metering the scene and capturing the images, and the total amount of high-resolution data that is captured. The algorithm first analyzes a set of low-resolution sharpness measurements of the scene while continuously varying the focus distance of the lens. From these measurements, we estimate the final lens positions required to capture all objects in the scene in acceptable focus. We demonstrate the use of our technique in a mobile computational photography scenario, where it is essential to minimize image capture time (as the camera is typically handheld) and processing time (as the computation and energy resources are limited).

33 citations


Proceedings ArticleDOI
29 Dec 2011
TL;DR: An algorithm dubbed Suppression via Disk Covering (SDC) is described to efficiently select a set of strong, spatially distributed key-points, and it is shown that selecting keypoint in this way significantly improves visual tracking.
Abstract: We describe an algorithm dubbed Suppression via Disk Covering (SDC) to efficiently select a set of strong, spatially distributed key-points, and we show that selecting keypoint in this way significantly improves visual tracking. We also describe two efficient implementation schemes for the popular Adaptive Non-Maximal Suppression algorithm, and show empirically that SDC is significantly faster while providing the same improvements with respect to tracking robustness. In our particular application, using SDC to filter the output of an inexpensive (but, by itself, less reliable) keypoint detector (FAST) results in higher tracking robustness at significantly lower total cost than using a computationally more expensive detector.

26 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: This paper reviews several different existing algorithms for orientation assignment and proposes two novel, efficient methods, one capable of multiple orientations and performs comparable to SIFT’s orientation assignment while being significantly cheaper.
Abstract: Detection and description of local image features has proven to be a powerful paradigm for a variety of applications in computer vision. Often, this process includes an orientation assignment step to render the overall process invariant to in-plane rotation. In this paper, we review several different existing algorithms and propose two novel, efficient methods for orientation assignment. The first method exhibits a very good speedperformance trade-off; the second is capable of multiple orientations and performs comparable to SIFT’s orientation assignment while being significantly cheaper. Additionally, we improve one of the existing orientation assignment methods by generalizing it. All algorithms are evaluated empirically under a variety of conditions and in combination with six keypoint detectors.

22 citations


Proceedings ArticleDOI
05 Jan 2011
TL;DR: The results show that by incorporating the inertial sensors the authors can considerably speed up the process of detecting and matching key-points between two images, which is the most time consuming step of the pose estimation.
Abstract: We present a multisensory method for estimating the transformation of a mobile phone between two images taken from its camera. Pose estimation is a necessary step for applications such as 3D reconstruction and panorama construction, but detecting and matching robust features can be computationally expensive. In this paper we propose a method for combining the inertial sensors (accelerometers and gyroscopes) of a mobile phone with its camera to provide a fast and accurate pose estimation. We use the inertial based pose to warp two images into the same perspective frame. We then employ an adaptive FAST feature detector and image patches, normalized with respect to illumination, as feature descriptors. After the warping the images are approximately aligned with each other so the search for matching key-points also becomes faster and in certain cases more reliable. Our results show that by incorporating the inertial sensors we can considerably speed up the process of detecting and matching key-points between two images, which is the most time consuming step of the pose estimation.

15 citations


Proceedings ArticleDOI
06 Nov 2011
TL;DR: This work proposes a method to separate multiple illuminants from a single image using a distinct sinusoidal pattern, strategically selected given the relative position of each light with respect to the camera, such that the observed sinusoids become independent of the scene geometry.
Abstract: A class of techniques in computer vision and graphics is based on capturing multiple images of a scene under different illumination conditions. These techniques explore variations in illumination from image to image to extract interesting information about the scene. However, their applicability to dynamic environments is limited due to the need for robust motion compensation algorithms. To overcome this issue, we propose a method to separate multiple illuminants from a single image. Given an image of a scene simultaneously illuminated by multiple light sources, our method generates individual images as if they had been illuminated by each of the light sources separately. To facilitate the illumination separation process, we encode each light source with a distinct sinusoidal pattern, strategically selected given the relative position of each light with respect to the camera, such that the observed sinusoids become independent of the scene geometry. The individual illuminants are then demultiplexed by analyzing local frequencies. We show applications of our approach in image-based relighting, photometric stereo, and multiflash imaging.

5 citations


Proceedings ArticleDOI
07 Aug 2011
TL;DR: This study attempts to characterize the actions performed by users while framing photos using a point-and-shoot camera, in preparation for taking a photograph, including adjusting the camera's orientation and point of view and triggering zoom and autofocus controls.
Abstract: With the recent popularization of digital cameras and cameraphones, everyone is now a photographer, and the devices provide new opportunities for improving the process and final results. While there has been research on what kinds of subjects users prefer to photograph, and what they do with the images once they are captured [Van House et al. 2005], no formal studies on the process of framing an image using a camera have been performed. To fill this gap, our study attempts to characterize the actions performed by users while framing photos using a point-and-shoot camera, in preparation for taking a photograph. This includes adjusting the camera's orientation and point of view and triggering zoom and autofocus controls.