scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 1997"


BookDOI
01 Jan 1997
TL;DR: This chapter discusses Speech Models and Speech Synthesis, which aims to bridge the gap between Speech Science and SpeechApplications and Speech applications and speech models and speech Synthesis.
Abstract: 1. Section Introduction: Signal Processing and Source Modelling 2.Synthesizing Allophonic Glottaliztion 3. Text-to-Speech SynthesisWith Dynamic Control of Speech 4. Modifiction of the the AperiodicComponent of Speech Signals for Synthesis 5. On the Use of aSinusoidal Model for Speech Synthesis in Text-to-Speech 6. SectionIntroduction: The Analysis of Text in Text-to-Speech Synthesis 7.Language-Independent Data-Oriented Grapheme-to Phoneme Conversion 8.All-Prosodic Speech Synthesis 9. A Model of Timing for Non-SegmentalPhonological Structure 10. a Complete Linguistic analysis for anItalian Text-to-Speech System 11. Discourse Structural Constraints onAccent in Narrative 12. Homograph Disambiguation in Text-to-SpeechSynthesis 13. Section Introduction: Talking Heads in Speech 1= Synthesis 14. Section Introduction: Articulatory Synthesis and VisualSpeech: Bridging the Gap Between Speech Science and SpeechApplications 15. Speech Models and Speech Synthesis 16. A 3D Modelof the Lips and of the Jaw for Visual Speech Synthesis 17. AFramework for Synthesis of Segments based on Pseudo-articulatoryparameters 18. Biomechanical and Physiologically based Speech

243 citations



Proceedings Article
01 Jan 1997
TL;DR: An original technique based on the use of a speech synthesizer for the alignment of a text on its corresponding speech signal, which seems to be a powerful tool for the automatic constitution of large phonetically and prosodically labeled speech databases.
Abstract: This paper presents an original technique for solving the phonetic segmentation problem. It is based on the use of a speech synthesizer for the alignment of a text on its corresponding speech signal. A high-quality digital speech synthesizer is used to create a synthetic reference speech pattern used in the alignment process. This approach has the great advantage on other approaches that no training stage (hence no labeled database) is needed. The system has been mainly evaluated on French read utterances. Other evaluations have been made on other languages like English, German, Romanian and Spanish. Following these experiments, the system seems to be a powerful tool for the automatic constitution of large phonetically and prosodically labeled speech databases. The availability of such corpora will be a key point for the development of improved speech synthesis and recognition systems.

70 citations


Patent
28 Feb 1997
TL;DR: The content of a speech sample is recognized using a computer system by evaluating the speech sample against a nonparametric set of training observations, for example, utterances from one or more human speakers as mentioned in this paper.
Abstract: The content of a speech sample is recognized using a computer system by evaluating the speech sample against a nonparametric set of training observations, for example, utterances from one or more human speakers. The content of the speech sample is recognized based on the evaluation results. The speech recognition process also may rely on a comparison between the speech sample and a parametric model of the training observations.

23 citations


Proceedings ArticleDOI
23 Jun 1997
TL;DR: The perceptual boundaries of speech reading and multimedia technology, which are the constraints that effect speech reading performance, are investigated and conclusions on the relationship between viseme groupings, accuracy of viseme recognition, and presentation rate are drawn.
Abstract: In the future, multimedia technology will be able to provide video frame rates equal to or better than 30 frames per second (FPS). Until that time the hearing impaired community will be using band limited communication systems over unshielded twisted pair copper wiring. As a result, multimedia communication systems will use a coder/decoder (CODEC) to compress the video and audio signals for transmission. For these systems to be usable by the hearing impaired community, the algorithms within the CODEC have to be designed to account for the perceptual boundaries of the hearing impaired. We investigate the perceptual boundaries of speech reading and multimedia technology, which are the constraints that effect speech reading performance. We analyze and draw conclusions on the relationship between viseme groupings, accuracy of viseme recognition, and presentation rate. These results are critical in the design of multimedia systems for the hearing impaired.

22 citations



01 Jan 1997
TL;DR: Using optical flow methods borrowed from the computer vision literature, a method is presented for the construction of a videorealistic text-to-audiovisual speech synthesizer that is able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a Videorealistic talking face.
Abstract: We present a method for the construction of a videorealistic text-to-audiovisual speech synthesizer. A visual corpus of a subject enunciating a set of key words is initially recorded. The key words are chosen so that they collectively contain most of the American English viseme images, which are subsequently identified and extracted from the data by hand. Next, using optical flow methods borrowed from the computer vision literature, we compute realistic transitions between every viseme to every other viseme. The images along these transition paths are generated using a morphing method. Finally, we exploit phoneme and timing information extracted from a text-tospeech synthesizer to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a videorealistic talking face.

20 citations


01 Jan 1997
TL;DR: The spatio-temporal characteristics of the closure/opening movements for the realisation of these consonantal targets were studied relative to the lip height (LH) parameter together with the temporal relationships between the characteristics of this articulatory movement and the co-produced acoustic signal.
Abstract: In order to identify the Italian consonantal visemes, to verify the results of perceptive tests and elaborate rules for bimodal synthesis and recognition, the 3D (lip height, lip width, lower lip protrusion) lip target shapes for all the 21 Italian consonants were determined. Moreover, the spatio-temporal characteristics of the closure/opening movements for the realisation of these consonantal targets were studied relative to the lip height (LH) parameter together with the temporal relationships between the characteristics of this articulatory movement and the co-produced acoustic signal.

18 citations


Book
11 Apr 1997
TL;DR: A system for automatic face recognition into security systems and expert conciliation for multi modal person authentication systems by Bayesian statistics.
Abstract: Robust eye centre extraction using the Hough Transform.- Localising facial features with matched filters.- Shape normalisation for face recognition.- Generalized likelihood ratio-based face detection and extraction of mouth features.- Tracking facial feature points with Gabor wavelets and shape models.- Face detection by direct convexity estimation.- Analysis and encoding of lip movements.- Lip-shape dependent face verification.- Statistical chromaticity models for lip tracking with B-splines.- A fully automatic approach to facial feature detection and tracking.- Automatic Video-based Person Authentication using the RBF network.- Using gait as a biometric, via phase-weighted magnitude spectra.- Identity authentication using fingerprints.- An algorithm for recognising walkers.- Metrological remote identification of a human body by stereoscopic camera techniques.- Discriminant analysis for recognition of human face images.- Image representations for visual learning.- Automatic profile identification.- Recognition of facial images with low resolution using a Hopfield memory model.- Exclusion of photos and new segmentation algorithms for the automatic face recognition.- Face authentication using morphological dynamic link architecture.- Non-intrusive person authentication for access control by visual tracking and face recognition.- Profile authentication using a chamfer matching algorithm.- Subband approach for automatic speaker recognition: Optimal division of the frequency domain.- Optimizing feature set for speaker verification.- VQ score normalisation for text-dependent and text-independent speaker recognition.- A two stage procedure for phone based speaker verification.- Speech/speaker recognition using a HMM/GMM hybrid model.- Recent advances in speaker recognition.- Speaker identification using harmonic structure of LP-residual spectrum.- A speaker identification agent.- Text-independent speaker identification based on spectral weighting functions.- Parameter discrimination analysis in speaker identification using self organizing map.- Text independent speaker verification using multiple-state predictive neural networks.- "Watch these lips" - Adding to acoustic signals to improve speaker recognition.- Expert conciliation for multi modal person authentication systems by Bayesian statistics.- SESAM: A biometric person identification system using sensor fusion.- Person authentication by fusing face and speech information.- Acoustic-labial speaker verification.- Combining evidence in multimodal personal identity recognition systems.- A viseme-based approach to labiometrics for automatic lipreading.- Development of an audio-visual database system for human identification.- Video compression and person authentication.- Lock-control system using face identification.- A system for automatic face recognition.- Time Encoded Signal Processing and Recognition for reduced data, high performance Speaker Verification architectures.- The CAVE speaker verification project - Experiments on the YOHO and SESP corpora.- The FERET September 1996 database and evaluation procedure.- The M2VTS multimodal face database (Release 1.00).- One-shot 3D-shape and texture acquisition of facial data.- A multiple-baseline stereo for precise human face acquisition.- User perspectives on the security of access data, operator handover procedures and 'insult rate' for speaker verification in automated telephone services.- Integrating face recognition into security systems.

14 citations


Proceedings Article
01 Sep 1997
TL;DR: This paper presents a method for the extraction of articulatory parameters from direct processing of raw images of the lips using an HMMbased visual speech recogniser and recognition scores obtained are compared to reference scores.
Abstract: This paper presents a method for the extraction of articulatory parameters from direct processing of raw images of the lips. The system architecture is made of three independent parts. First, a new greyscale mouth image is centred and downsampled. Second, the image is aligned and projected onto a basis of artificial images. These images are the eigenvectors computed from a PCA applied on a set of 23 reference lip shapes. Then, a multilinear interpolation predicts articulatory parameters from the image projection coefficients onto the eigenvectors. In addition, the projection coefficients and the predicted parameters were evaluated by an HMMbased visual speech recogniser. Recognition scores obtained with our method are compared to reference scores and discussed.

13 citations




Book ChapterDOI
01 Jan 1997
TL;DR: Study of articulatory kinematics suggest that closer attention to the spectral effects of articulator movement will be an important element in improving how well how well synthesis systems capture the salient temporal correlates of stress and phrasing.
Abstract: Basic research in speech science over the last half century has benefited greatly from our endeavors to synthesize speech by machine. For example, developing programs for simulating the time course of fundamental frequency variation over sentences and longer utterances has been an indispensable research tool in our basic understanding of intonation. Synthesis systems, in turn, have directly benefited from being able to incorporate the models of linguistic control of F 0 originally built to test one or another theory of intonation. Models of temporal control are another area that can see important cross-fertilization of results and ideas between basic and applied research in synthesis. Current synthesis systems treat timing control by computing context-sensitive durations for phonetic segments, a method that integrates the use of statistical tools and large speech databases with the insights of several decades of smaller controlled laboratory experiments. Studies of articulatory kinematics suggest that closer attention to the spectral effects of articulator movement will be an important element in improving how well our synthesis systems capture the salient temporal correlates of stress and phrasing.

Book ChapterDOI
12 Mar 1997
TL;DR: The articulatory approach to the preprocessing of mouth images in automatic lipreading gives some description of a mouth shape in phonetic terms and gives a reliable evaluation of labiometric parameters that could not be automatically measured on natural lips without prior make-up.
Abstract: There are two main approaches to the preprocessing of mouth images in automatic lipreading The stochastic approach makes wide use of learning techniques providing image features poorly interpretable The articulatory approach aims at measuring as accurately as possible anatomical and/or geometrical parameters which can be interpreted in phonetic terms We call this approach “labiometrics” Although the method here proposed involves image processing techniques generally used in the stochastic approach, it is articulatory-oriented indeed; Not only it gives some description of a mouth shape in phonetic terms (ie, visemes), but it mostly gives a reliable evaluation of labiometric parameters that could not be automatically measured on natural lips without prior make-up Moreover, the stochastic component of our approach is based on a limited set of training images, so that its computation cost remains pretty low

Proceedings ArticleDOI
23 Jun 1997
TL;DR: An approach of extracting visemes from both image and acoustic domains is presented and the mouth shapes, represented by feature points on inner lip contours, are extracted through face tracking and mouth image analysis.
Abstract: Unlike other image templates, visemes have identities in two different media. In audio domain, they are often related to basic linguistic units such as phonemes. In image domain, they are defined by the images of human articulators, such as mouth shapes, chin movements, etc. In this paper, an approach of extracting visemes from both image and acoustic domains is presented. In image domain, the mouth shapes, represented by feature points on inner lip contours, are extracted through face tracking and mouth image analysis. In acoustic domain, viseme segments are obtained automatically by aligning phoneme strings to audio signals through a Viterbi alignment process.

Proceedings Article
Steven J. Phillips1, Anne Rogers1
01 Jan 1997
TL;DR: In this paper, the authors show how to harness shared memory multiprocessors, which are becoming increasingly common, to increase the speed significantly, and therefore the accuracy or vocabulary size, of a speech recognizer.
Abstract: Computer speech recognition has been very successful in limited domains and for isolated word recognition. However, widespread use of large-vocabulary continuous-speech recognizers is limited by the speed of current recognizers, which cannot reach acceptable error rates while running in real time. This paper shows how to harness shared memory multiprocessors, which are becoming increasingly common, to increase the speed significantly, and therefore the accuracy or vocabulary size, of a speech recognizer. To cover the necessary background, we begin with a tutorial on speech recognition. We then describe the parallelization of an existing high-quality speech recognizer, achieving a speedup of a factor of 3, 5, and 6 on 4-, 8-, and 12-processors respectively for the benchmark North American business news (NAB) recognition task.


Proceedings Article
01 Jan 1997
TL;DR: The acoustic signal is filtered into several spectral bands, and independent recognition is achieved in each band, and the system recombines the results given by each recognizer and delivers a unique solution.
Abstract: The problem addressed by this paper is to enhance the continuous speech recognizers robustness to noise. For this purpose, the acoustic signal is filtered into several spectral bands, and independent recognition is achieved in each band. Then, the system recombines the results given by each recognizer and delivers a unique solution. The main advantage of this method is to consider the signal only in the bands which are relevant, and to ignore spectral bands which are corrupted by noise. We are developping a speaker-independent continuous speech recognizer based on this principle.


Proceedings Article
01 Jan 1997
TL;DR: A re-entry modeling of missing phonemes which are lost during search process using a multiple pronunciation dictionary where pronunciations are added using HMM-state confusion characteristics to improve word recognition rates.
Abstract: In our previous work, we proposed a re-entry modeling of missing phonemes which are lost during search process. In the re-entry modeling, the recognition results are postprocessed and originally recognized phoneme sequences are converted to new phoneme sequences using HMM-state confusion characteristics spanning several phonemes. We con rmed that HMM-state confusions are e ective for the re-entry modeling. In this paper, we propose a re-entry modeling during recognition using a multiple pronunciation dictionary where pronunciations are added using HMM-state confusion characteristics. The pronunciations are added considering partof-speech (POS) dependency of confusion characteristics. As a result of continuous recognition experiments, we con rmed that the following two points are e ective to improve word recognition rates: (1) confusions are expressed by HMM-state sequences, (2) pronunciations are added considering part-of-speech dependency of confusion characteristics.


Proceedings Article
01 Jan 1997
TL;DR: This paper quantifies the geometry of speech turbulence as re ected in the fragmentation of the time signal by using fractal models and describes an algorithm for estimating the short-time fractal dimension of speech signals based on multiscale morphological ltering and discusses its potential for phonetic classi cation.
Abstract: The dynamics of air ow during speech production may often result into some small or large degree of turbulence. In this paper, we quantify the geometry of speech turbulence as re ected in the fragmentation of the time signal by using fractal models. We describe an e cient algorithm for estimating the short-time fractal dimension of speech signals based on multiscale morphological ltering and discuss its potential for phonetic classi cation. We also report experimental results on using the shorttime fractal dimension of speech signals at multiple scales as additional features in an automatic speech recognition system using hidden Markov models, which provides a modest improvement in speech recognition performance.