Topic
Viseme
About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: In this article, various aspects of the mechanisms of speech are studied, such as the perception of speech, the sounds of speech and word recognition, and articulation, the pronunciation of words and the sound of speech.
Abstract: Various aspects of the mechanisms of speech are studied. One series of studies has concentrated on the perception of speech, the sounds of speech, and word recognition. Various models for speech recognition have been created. Another set of studies has focused on articulation, the pronunciation of words and the sounds of speech. This area has also been explored in considerable detail.
•
10 Jun 2021
TL;DR: In this paper, a method for generating a head model animation from a voice signal using an artificial intelligence model; and an electronic device for implementing same, is presented, which comprises the steps of: acquiring characteristics information of a voice signals from the voice signal; by using the artificial intelligence models, acquiring, from the characteristics information, a phoneme stream corresponding to the voice signals, and a viseme stream correspond to the phoneme streams; and generating a Head Model animation by applying the animation curve to the visemes of the merged phoneme and viseme streams.
Abstract: Disclosed are: a method for generating a head model animation from a voice signal using an artificial intelligence model; and an electronic device for implementing same. The disclosed method for generating a head model animation from a voice signal, carried out by the electronic device, comprises the steps of: acquiring characteristics information of a voice signal from the voice signal; by using the artificial intelligence model, acquiring, from the characteristics information, a phoneme stream corresponding to the voice signal, and a viseme stream corresponding to the phoneme stream; by using the artificial intelligence model, acquiring an animation curve of visemes included in the viseme stream; merging the phoneme stream with the viseme stream; and generating a head model animation by applying the animation curve to the visemes of the merged phoneme and viseme stream.
•
TL;DR: This paper proposes a new strategy of speech synthesis that uses intermediate-sized units corresponding to half syllables, called `demisyllables' in order to produce computer-generated speech.
Abstract: Synthesis of English speech by computer can be accomplished in several different ways, depending on the size of the speech units that are used to produce voice output. The most widely used units for speech synthesis are phonemes (i.e., small speech units corresponding to individual phonetic items). An alternate method of producing computer-generated speech is to concatenate entire words of English in a method called `word-concatenation' synthesis. A third strategy, the one described in this paper, is to use intermediate-sized units corresponding to half syllables, called `demisyllables'
••
01 Jan 2020
TL;DR: This paper presents a system that recognizes the lip movement for lip-reading system using Viola–Jones algorithm and DCT to extract the mouth features.
Abstract: This paper presents a system that recognizes the lip movement for lip-reading system. Four lip gestures are recognized: rounded open, wide open, small open and closed. These gestures are used to describe visually the speech. Firstly, we detect the mouth region from frame using Viola–Jones algorithm. Then, we use DCT to extract the mouth features. The recognition is performed by a HMM which achieves a high performance of 84.99%.
••
01 Jan 2004TL;DR: This work describes a maximum a posteriori decoding strategy for feature-based recognizers and derive two normalization critera useful for a segment-based Viterbi or A* search.
Abstract: Most speech recognizers use an observation space which is based on a temporal sequence of spectral “frames.” There is another class of recognizer which further processes these frames to produce a segment-based network, and represents each segment by a fixed-dimensional “feature.” In such feature-based recognizers the observation space takes the form of a temporal graph of feature vectors, so that any single segmentation of an utterance will use a subset of all possible feature vectors. In this work we describe a maximum a posteriori decoding strategy for feature-based recognizers and derive two normalization critera useful for a segment-based Viterbi or A* search. We show how a segment-based recognizer is able to obtain good results on the tasks of phonetic and word recognition.