scispace - formally typeset
Search or ask a question

Showing papers by "Rajeev Sharma published in 2001"


Proceedings ArticleDOI
15 Nov 2001
TL;DR: An experiment is conducted with 23 subjects that evaluates selection strategies for interaction with large screen displays and results for Different target sizes and positions are reported in terms of accuracy, selection times and user preference.
Abstract: Progress in computer vision and speech recognition technologies has recently enabled multimodal interfaces that use speech and gestures. These technologies o er promising alternatives to existing interfaces because they emulate the natural way in which humans communicate. However, no systematic work has been reported that formally evaluates the new speech/gesture interfaces. This paper is concerned with formal experimental evaluation of new human-computer interactions enabled by speech and hand gestures.The paper describes an experiment conducted with 23 subjects that evaluates selection strategies for interaction with large screen displays. The multimodal interface designed for this experiment does not require the user to be in physical contact with any device. Video cameras and long range microphones are used as input for the system. Three selection strategies are evaluated and results for Different target sizes and positions are reported in terms of accuracy, selection times and user preference. Design implications for vision/speech based interfaces are inferred from these results. This study also raises new question and topics for future research.

39 citations


Journal Article
TL;DR: In this paper, a structured approach for studying patterns of multimodal language in the context of a 2D-display control is proposed, where gestures from observable kinematical primitives to their semantics are considered pertinent to a linguistic structure.
Abstract: In recent years because of the advances in computer vision research, free hand gestures have been explored as means of human-computer interaction (HCI). Together with improved speech processing technology it is an important step toward natural multimodal HCI. However, inclusion of non-predefined continuous gestures into a multimodal framework is a challenging problem. In this paper, we propose a structured approach for studying patterns of multimodal language in the context of a 2D-display control. We consider systematic analysis of gestures from observable kinematical primitives to their semantics as pertinent to a linguistic structure. Proposed semantic classification of co-verbal gestures distinguishes six categories based on their spatio-temporal deixis. We discuss evolution of a computational framework for gesture and speech integration which was used to develop an interactive testbed (iMAP). The testbed enabled elicitation of adequate, non-sequential, multimodal patterns in a narrative mode of HCI. Conducted user studies illustrate significance of accounting for the temporal alignment of gesture and speech parts in semantic mapping. Furthermore, co-occurrence analysis of gesture/speech production suggests syntactic organization of gestures at the lexical level.

38 citations


Book ChapterDOI
11 May 2001
TL;DR: A structured approach for studying patterns of multimodal language in the context of a 2D-display control and co-occurrence analysis of gesture/speech production suggests syntactic organization of gestures at the lexical level.
Abstract: In recent years because of the advances in computer vision research, free hand gestures have been explored as means of human-computer interaction (HCI). Together with improved speech processing technology it is an important step toward natural multimodal HCI. However, inclusion of nonpredefined continuous gestures into a multimodal framework is a challenging problem. In this paper, we propose a structured approach for studying patterns of multimodal language in the context of a 2D-display control. We consider systematic analysis of gestures from observable kinematical primitives to their semantics as pertinent to a linguistic structure. Proposed semantic classification of co-verbal gestures distinguishes six categories based on their spatio-temporal deixis. We discuss evolution of a computational framework for gesture and speech integration which was used to develop an interactive testbed (iMAP). The testbed enabled elicitation of adequate, non-sequential, multimodal patterns in a narrative mode of HCI. Conducted user studies illustrate significance of accounting for the temporal alignment of gesture and speech parts in semantic mapping. Furthermore, co-occurrence analysis of gesture/speech production suggests syntactic organization of gestures at the lexical level.

33 citations


Proceedings ArticleDOI
08 Jul 2001
TL;DR: The framework features an appearance based approach to represent the spatial information and hidden Markov models (HMM) to encode the temporal dynamics of the time varying visual patterns, providing a unified spatio-temporal approach to common detection, tracking and classification problems.
Abstract: We propose a framework for detecting, tracking and analyzing non-rigid motion based on learned motion patterns. The framework features an appearance based approach to represent the spatial information and hidden Markov models (HMM) to encode the temporal dynamics of the time varying visual patterns. The low level spatial feature extraction is fused with the temporal analysis, providing a unified spatio-temporal approach to common detection, tracking and classification problems. This is a promising approach for many classes of human motion patterns. Visual tracking is achieved by extracting the most probable sequence of target locations from a video stream using a combination of random sampling and the forward procedure from HMM theory. The method allows us to perform a set of important tasks such as activity recognition, gait-analysis and keyframe extraction. The efficacy of the method is shown on both natural and synthetic test sequences.

29 citations


01 Jan 2001
TL;DR: This work shows how kinematic structure can be inferred from monocular views without making any a priori assumptions about the scene except that it consists of piecewise rigid segments constrained by jointed motion.
Abstract: We extract and initialize kinematic models from monocular visual data from the ground up without any manual initialization, adaptation or prior model knowledge. Visual analysis, classification andtracking of articulated motion is challenging due to the difficulties involved in separating noise and spurious variability caused by appearance, size and view point fluctuations from the task-relevant variations. By incorporating powerful domain knowledge, model based approachesare able to overcome this problem to a great extent and are actively explored by many researchers. However, model acquisition, initialization and adaptationare still relatively underinvestigated problems. In this work we show how kinematic structure can be inferred from monocular views without making any a priori assumptions about the scene except that it consists of piecewise rigid segments constrained by jointed motion. The efficacy of the method is demonstrated on synthetic as well as natural image sequences.

18 citations


Journal ArticleDOI
TL;DR: A system of marker coding that, together with an efficient image processing technique, provides a practical method for tracking the marked objects in real-time and demonstrates the utility of the marker-based tracking technique in an Augmented Reality application.
Abstract: Augmented reality requires understanding of the scene to know when, where and what to display as a response to changes in the surrounding world. This understanding often involves tracking and recognition of multiple objects and locations in real-time. Technologies frequently used for multiple object tracking, such as electromagnetic trackers are very limited in range, as well as constraining. The use of Computer Vision to identify and track multiple objects is very promising. However, the requirements for traditional object recognition using appearance-based or model-based vision are very complex and their performance is far from real-time. An alternative is to use a set of markers or fiducials for object tracking and recognition. In this paper we present a system of marker coding that, together with an efficient image processing technique, provides a practical method for tracking the marked objects in real-time. The technique is based on clustering of candidate regions in space using a minimum spanning tree. The markers in the codes also allow the estimation of the three dimensional pose of the objects. We demonstrate the utility of the marker-based tracking technique in an Augmented Reality application. The application involves superimposing graphics over real industrial parts that are tracked using fiducials and manipulated by a human in order to complete an assembly. The system aids in the evaluation of the different assembly sequence possibilities.

17 citations


Proceedings ArticleDOI
TL;DR: A probabilistic model is used to fuse the color and motion information to localize the body parts and employ a multiple hypothesis tracking (MHT) algorithm to track these features simultaneously, which is capable of tracking multiple objects with limited occlusions and is suitable for resolving any data association uncertainty.
Abstract: Tracking body parts of multiple people in a video sequence is very useful for face/gesture recognition systems as well as human computer interaction (HCI) interfaces. This paper describes a framework for tracking multiple objects (e.g., hands and faces of multiple people) in a video stream. We use a probabilistic model to fuse the color and motion information to localize the body parts and employ a multiple hypothesis tracking (MHT) algorithm to track these features simultaneously. The MHT algorithm is capable of tracking multiple objects with limited occlusions and is suitable for resolving any data association uncertainty. We incorporated a path coherence function along with MHT to reduce the negative effects of spurious measurements that produce unconvincing tracks and needless computations. The performance of the framework has been validated using experiments on synthetic and real sequence of images.

14 citations


Posted Content
TL;DR: In this article, a structured approach for studying patterns of multimodal language in the context of a 2D-display control is proposed, where gestures from observable kinematical primitives to their semantics are considered pertinent to a linguistic structure.
Abstract: In recent years because of the advances in computer vision research, free hand gestures have been explored as means of human-computer interaction (HCI). Together with improved speech processing technology it is an important step toward natural multimodal HCI. However, inclusion of non-predefined continuous gestures into a multimodal framework is a challenging problem. In this paper, we propose a structured approach for studying patterns of multimodal language in the context of a 2D-display control. We consider systematic analysis of gestures from observable kinematical primitives to their semantics as pertinent to a linguistic structure. Proposed semantic classification of co-verbal gestures distinguishes six categories based on their spatio-temporal deixis. We discuss evolution of a computational framework for gesture and speech integration which was used to develop an interactive testbed (iMAP). The testbed enabled elicitation of adequate, non-sequential, multimodal patterns in a narrative mode of HCI. Conducted user studies illustrate significance of accounting for the temporal alignment of gesture and speech parts in semantic mapping. Furthermore, co-occurrence analysis of gesture/speech production suggests syntactic organization of gestures at the lexical level.

7 citations