http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2011/Schuller11-RRE.pdf

Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Text-to-Speech Synthesis provides a complete, end-to-end account of the process of generating speech by computer. Giving an in-depth explanation of all aspects of current speech synthesis technology, it assumes no specialized prior knowledge. Introductory chapters on linguistics, phonetics, signal processing and speech signals lay the foundation, with subsequent material explaining how this knowledge is put to use in building practical systems that generate speech. Including coverage of the very latest techniques such as unit selection, hidden Markov model synthesis, and statistical text analysis, explanations of the more traditional techniques such as format synthesis and synthesis by rule are also provided. Weaving together the various strands of this multidisciplinary field, the book is designed for graduate students in electrical engineering, computer science, and linguistics. It is also an ideal reference for practitioners in the fields of human communication interaction and telephony.

/pdf/text-to-speech-synthesis-2y7hiwoykk.pdf

Text-to-Speech Synthesis

People naturally move their heads when they speak, and our study shows that this rhythmic head motion conveys linguistic information. Three-dimensional head and face motion and the acoustics of a talker producing Japanese sentences were recorded and analyzed. The head movement correlated strongly with the pitch (fundamental frequency) and amplitude of the talker's voice. In a perception study, Japanese subjects viewed realistic talking-head animations based on these movement recordings in a speech-in-noise task. The animations allowed the head motion to be manipulated without changing other characteristics of the visual or acoustic speech. Subjects correctly identified more syllables when natural head motion was present in the animation than when it was eliminated or distorted. These results suggest that nonverbal gestures such as head movements play a more direct role in the perception of speech than previously known.

Visual Prosody and Speech Intelligibility Head Movement Improves Auditory Speech Perception

This paper describes a novel parameter generation algorithm for an HMM-based speech synthesis technique. The conventional algorithm generates a parameter trajectory of static features that maximizes the likelihood of a given HMM for the parameter sequence consisting of the static and dynamic features under an explicit constraint between those two features. The generated trajectory is often excessively smoothed due to the statistical processing. Using the over-smoothed speech parameters usually causes muffled sounds. In order to alleviate the over-smoothing effect, we propose a generation algorithm considering not only the HMM likelihood maximized in the conventional algorithm but also a likelihood for a global variance (GV) of the generated trajectory. The latter likelihood works as a penalty for the over-smoothing, i.e., a reduction of the GV of the generated trajectory. The result of a perceptual evaluation demonstrates that the proposed algorithm causes considerably large improvements in the naturalness of synthetic speech.

A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are described.

/pdf/speech-synthesis-based-on-hidden-markov-models-45y6v1rk5u.pdf

Speech Synthesis Based on Hidden Markov Models

As we articulate speech, we usually move the head and exhibit various facial expressions. This visual aspect of speech aids understanding and helps communicating additional information, such as the speaker's mood. We analyze quantitatively head and facial movements that accompany speech and investigate how they relate to the text's prosodic structure. We recorded several hours of speech and measured the locations of the speakers' main facial features as well as their head poses. The text was evaluated with a prosody prediction tool, identifying phrase boundaries and pitch accents. Characteristic for most speakers are simple motion patterns that are repeatedly applied in synchrony with the main prosodic events. Direction and strength of head movements vary widely from one speaker to another, yet their timing is typically well synchronized with the spoken text. Understanding quantitatively the correlations between head movements and spoken text is important for synthesizing photo-realistic talking heads. Talking heads appear much more engaging when they exhibit realistic motion patterns.

/pdf/visual-prosody-facial-movements-accompanying-speech-54umgkliun.pdf

Visual prosody: facial movements accompanying speech

In this paper detectors for accents, phrase boundaries, and sentence modality are described which derive prosodic features only from the speech signal and its fundamental frequency to support other modules of a speech understanding system in an early analysis stage, or in cases where no word hypotheses are available. A new method for interpolating and decomposing the fundamental frequency is suggested. The detectors' underlying Gaussian distribution classifiers were trained and tested with approximately 50 minutes of spontaneous speech, yielding recognition rates of 78 percent for accents, 80 percent for phrase boundaries, and 85 percent for sentence modality.

/pdf/detection-of-accents-phrase-boundaries-and-sentence-modality-39b7qekigx.pdf

Detection of accents, phrase boundaries, and sentence modality in German

https://hal.archives-ouvertes.fr/hal-00592584/document

Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis

We describe the results of large scale perception experiments showing improvements in synthesising two distinct kinds of prominence: standard pitch-accent and strong emphatic accents. Previously prominence assignment has been mainly evaluated by computing accuracy on a prominence-labelled test set. By contrast we integrated an automatic pitch-accent classifie r into the unit selection target cost and showed that listeners preferred these synthesised sentences. We also describe an improved recording script for collecting emphatic accents, and show that generating emphatic accents leads to further improvements in the fict ion genre over incorporating pitch accent only. Finally, we show diffe rences in the effe cts of prominence between childdirected speech and news and fict ion genres.

/pdf/modelling-prominence-and-emphasis-improves-unit-selection-1w0wbyhspz.pdf

Modelling Prominence and Emphasis Improves Unit-Selection Synthesis

The AT&T text-to-speech (TTS) synthesis system has been used as a framework for experimenting with a perceptuallyguided data-driven approach to speech synthesis, with primary focus on data-driven elements in the \back end". Statistical training techniques applied to a large corpus are used to make decisions about predicted speech events and selected speech inventory units. Our recent advances in automatic phonetic and prosodic labeling and a new faster harmonic plus noise model (HNM) and unit preselection implementations have signi cantly improved TTS quality and speeded up both development time and runtime.

Volker Strom

Papers

Visual prosody: facial movements accompanying speech

Detection of accents, phrase boundaries, and sentence modality in German

Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis

Modelling Prominence and Emphasis Improves Unit-Selection Synthesis

Corpus-based techniques in the AT&t nextgen synthesis system.