scispace - formally typeset
Search or ask a question
Author

Matthew Turk

Bio: Matthew Turk is an academic researcher from Toyota Technological Institute at Chicago. The author has contributed to research in topics: Augmented reality & Facial recognition system. The author has an hindex of 55, co-authored 198 publications receiving 30972 citations. Previous affiliations of Matthew Turk include Massachusetts Institute of Technology & University of California.


Papers
More filters
01 Jan 2004
TL;DR: This dissertation's main contribution is “HandVu,” a computer vision system that recognizes hand gestures in real-time, and demonstrates the feasibility of computer vision as the sole input modality to a wearable computer, providing “deviceless” interaction capabilities.
Abstract: Current user interfaces are unsuited to harness the full power of computers. Mobile devices like cell phones and technologies such as virtual reality demand a richer set of interaction modalities to overcome situational constraints and to fully leverage human expressiveness. Hand gesture recognition lets humans use their most versatile instrument—their hands—in more natural and effective ways than currently possible. While most gesture recognition gear is cumbersome and expensive, gesture recognition with computer vision is non-invasive and more flexible. Yet, it faces difficulties due to the hand's complexity, lighting conditions, background artifacts, and user differences. The contributions of this dissertation have helped to make computer vision a viable technology to implement hand gesture recognition for user interface purposes. To begin with, we investigated arm postures in front of the human body in order to avoid anthropometrically unfavorable gestures and to establish a “comfort zone” in which humans prefer to operate their hands. The dissertation's main contribution is “HandVu,” a computer vision system that recognizes hand gestures in real-time. To achieve this, it was necessary to advance the reliability of hand detection to allow for robust system initialization in most environments and lighting conditions. After initialization, a “Flock of Features” exploits optical flow and color information to track the hand's location despite rapid movements and concurrent finger articulations. Lastly, robust appearance-based recognition of key hand configurations completes HandVu and facilitates input of discrete commands to applications. We demonstrate the feasibility of computer vision as the sole input modality to a wearable computer, providing “deviceless” interaction capabilities. We also present new and improved interaction techniques in the context of a multimodal interface to a mobile augmented reality system. HandVu allows us to exploit hand gesture capabilities that have previously been untapped, for example, in areas where data gloves are not a viable option. This dissertation's goal is to contribute to the mosaic of available interface modalities and to widen the human-computer interface channel. Leveraging more of our expressiveness and our physical abilities offers new and advantageous ways to communicate with machines.

42 citations

Patent
03 Nov 2014
TL;DR: In this paper, the authors present a method for automatic motion model selection in the form of a video frame captured by a camera device into memory and estimating a type of motion from a previously received video frame held in memory to the new video frame.
Abstract: Various embodiments each include at least one of systems, methods, devices, and software for environment mapping with automatic motion model selection. One embodiment in the form of a method includes receiving a video frame captured by a camera device into memory and estimating a type of motion from a previously received video frame held in memory to the received video frame. When the type of motion is the same as motion type of a current keyframe group held in memory, the method includes adding the received video frame to the current keyframe group. Conversely, when the type of motion is not the same motion type of the current keyframe group held in memory, the method includes creating a new keyframe group in memory and adding the received video frame to the new keyframe group.

42 citations

Proceedings ArticleDOI
01 Jan 2003
TL;DR: A new method for face alignment called active wavelet networks (AWN) is proposed, which replaces the AAM texture model by a wavelet network representation, which shows more robustness against partial occlusions and some illumination changes.
Abstract: The active appearance model (AAM) algorithm has proved to be a successful method for face alignment and synthesis. By elegantly combining both shape and texture models, AAM allows fast and robust deformable image matching. However, the method is sensitive to partial occlusions and illumination changes. In such cases, the PCA-based texture model causes the reconstruction error to be globally spread over the image. In this paper, we propose a new method for face alignment called active wavelet networks (AWN), which replaces the AAM texture model by a wavelet network representation. Since we consider spatially localized wavelets for modeling texture, our method shows more robustness against partial occlusions and some illumination changes.

41 citations

Book ChapterDOI
01 Jan 2005
TL;DR: An overview of a family of research projects at Carnegie Mellon and Karlsruhe University to overcome some of these human-computer communication barriers by exploiting the complementary nature of these alternate modalities in interpreting user intent in a user interface is presented.
Abstract: While human-to-human communication takes advantage of an abundance of information and cues, human-computer interaction is limited to only a few input modalities (usually only keyboard and mouse) and provides little flexibility as to choice of communication modality. In this paper, we present an overview of a family of research projects we are undertaking at Carnegie Mellon and Karlsruhe University to overcome some of these human-computer communication barriers. Multimodal interfaces are to include not only typing, but speech, lip-reading, eye-tracking, face recognition and tracking, and gesture and handwriting recognition. Initial experiments aimed at exploiting the complementary nature of these alternate modalities in interpreting user intent in a user interface are discussed.

40 citations

Proceedings ArticleDOI
01 Nov 2011
TL;DR: A fast automatic text detection algorithm devised for a mobile augmented reality (AR) translation system on a mobile phone and a method that exploits the redundancy of the information contained in the video stream to remove false alarms is presented.
Abstract: We present a fast automatic text detection algorithm devised for a mobile augmented reality (AR) translation system on a mobile phone. In this application, scene text must be detected, recognized, and translated into a desired language, and then the translation is displayed overlaid properly on the real-world scene. In order to offer a fast automatic text detector, we focused our initial search to find a single letter. Detecting one letter provides useful information that is processed with efficient rules to quickly find the reminder of a word. This approach allows for detecting all the contiguous text regions in an image quickly. We also present a method that exploits the redundancy of the information contained in the video stream to remove false alarms. Our experimental results quantify the accuracy and efficiency of the algorithm and show the strengths and weaknesses of the method as well as its speed (about 160 ms on a recent generation smartphone, not optimized). The algorithm is well suited for real-time, real-world applications.

39 citations


Cited by
More filters
Journal ArticleDOI
22 Dec 2000-Science
TL;DR: An approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set and efficiently computes a globally optimal solution, and is guaranteed to converge asymptotically to the true structure.
Abstract: Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs-30,000 auditory nerve fibers or 10(6) optic nerve fibers-a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.

13,652 citations

Journal ArticleDOI
TL;DR: A face recognition algorithm which is insensitive to large variation in lighting direction and facial expression is developed, based on Fisher's linear discriminant and produces well separated classes in a low-dimensional subspace, even under severe variations in lighting and facial expressions.
Abstract: We develop a face recognition algorithm which is insensitive to large variation in lighting direction and facial expression. Taking a pattern classification approach, we consider each pixel in an image as a coordinate in a high-dimensional space. We take advantage of the observation that the images of a particular face, under varying illumination but fixed pose, lie in a 3D linear subspace of the high dimensional image space-if the face is a Lambertian surface without shadowing. However, since faces are not truly Lambertian surfaces and do indeed produce self-shadowing, images will deviate from this linear subspace. Rather than explicitly modeling this deviation, we linearly project the image into a subspace in a manner which discounts those regions of the face with large deviation. Our projection method is based on Fisher's linear discriminant and produces well separated classes in a low-dimensional subspace, even under severe variation in lighting and facial expressions. The eigenface technique, another method based on linearly projecting the image space to a low dimensional subspace, has similar computational requirements. Yet, extensive experimental results demonstrate that the proposed "Fisherface" method has error rates that are lower than those of the eigenface technique for tests on the Harvard and Yale face databases.

11,674 citations

Journal ArticleDOI
21 Oct 1999-Nature
TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.
Abstract: Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.

11,500 citations

Journal ArticleDOI
TL;DR: This work considers the problem of automatically recognizing human faces from frontal views with varying expression and illumination, as well as occlusion and disguise, and proposes a general classification algorithm for (image-based) object recognition based on a sparse representation computed by C1-minimization.
Abstract: We consider the problem of automatically recognizing human faces from frontal views with varying expression and illumination, as well as occlusion and disguise. We cast the recognition problem as one of classifying among multiple linear regression models and argue that new theory from sparse signal representation offers the key to addressing this problem. Based on a sparse representation computed by C1-minimization, we propose a general classification algorithm for (image-based) object recognition. This new framework provides new insights into two crucial issues in face recognition: feature extraction and robustness to occlusion. For feature extraction, we show that if sparsity in the recognition problem is properly harnessed, the choice of features is no longer critical. What is critical, however, is whether the number of features is sufficiently large and whether the sparse representation is correctly computed. Unconventional features such as downsampled images and random projections perform just as well as conventional features such as eigenfaces and Laplacianfaces, as long as the dimension of the feature space surpasses certain threshold, predicted by the theory of sparse representation. This framework can handle errors due to occlusion and corruption uniformly by exploiting the fact that these errors are often sparse with respect to the standard (pixel) basis. The theory of sparse representation helps predict how much occlusion the recognition algorithm can handle and how to choose the training images to maximize robustness to occlusion. We conduct extensive experiments on publicly available databases to verify the efficacy of the proposed algorithm and corroborate the above claims.

9,658 citations

01 Jan 1999
TL;DR: In this article, non-negative matrix factorization is used to learn parts of faces and semantic features of text, which is in contrast to principal components analysis and vector quantization that learn holistic, not parts-based, representations.
Abstract: Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.

9,604 citations