Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together Multimodal machine learning aims to build models that can process and relate information from multiple modalities It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research

Multimodal Machine Learning: A Survey and Taxonomy

This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This paper also contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years. Advantages and disadvantages of statistical parametric synthesis are highlighted as well as identifying where we expect the key developments to appear in the immediate future.

https://www.sp.nitech.ac.jp/~bonanza/Paper/EMIME/zen_specom.pdf

Statistical Parametric Speech Synthesis

A statistical parametric speech synthesis system based on hidden Markov models (HMMs) has grown in popularity over the last few years. This system simultaneously models spectrum, excitation, and duration of speech using context-dependent HMMs and generates speech waveforms from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named HMM-based speech synthesis system (HTS) to provide a research and development platform for the speech synthesis community. In December 2006, HTS version 2.0 was released. This version includes a number of new features which are useful for both speech synthesis researchers and developers. This paper describes HTS version 2.0 in detail, as well as future release plans.

https://www.sp.nitech.ac.jp/~zen/english/index.php?plugin=attach&refer=Publications%2FInternational%20conferences&openfile=zen-ssw6.pdf

The HMM-based speech synthesis system (HTS) version 2.0.

This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are described.

/pdf/speech-synthesis-based-on-hidden-markov-models-45y6v1rk5u.pdf

Speech Synthesis Based on Hidden Markov Models

People convey their emotional state in their face and voice. We present an audio-visual dataset uniquely suited for the study of multi-modal emotion expression and perception. The dataset consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). 7,442 clips of 91 actors with diverse ethnicbackgrounds were rated by multiple raters in three modalities: audio, visual, and audio-visual. Categorical emotion labels andreal-value intensity values for the perceived emotion were collected using crowd-sourcing from 2,443 raters. The human recognition of intended emotion for the audio-only, visual-only, and audio-visual data are 40.9, 58.2 and 63.6 percent respectively. Recognition rates are highest for neutral, followed by happy, anger, disgust, fear, and sad. Average intensity levels of emotion are rated highest forvisual-only perception. The accurate recognition of disgust and fear requires simultaneous audio-visual cues, while anger andhappiness can be well recognized based on evidence from a single modality. The large dataset we introduce can be used to probe other questions concerning the audio-visual perception of emotion.

/pdf/crema-d-crowd-sourced-emotional-multimodal-actors-dataset-5egzc5gowo.pdf

CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset

An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and language factorization, attempts to factorize speaker-/language-specific characteristics in the data and then model them using separate transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum-likelihood linear regression. Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language.

/pdf/statistical-parametric-speech-synthesis-based-on-speaker-and-4s9qvs51ql.pdf

Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

We describe an approach to crowdsource the evaluation of TTS systems by preference tests and report on lessons learnt from running 127 real-life crowdsourced tests. We show that at least one type of cheating becomes more prevalent over time if left unchecked and develop metrics to exclude cheaters. We demonstrate that their exclusion improves test outcomes. Index Terms: TTS, speech synthesis, listening test, preference test, crowdsourcing, cheating

/pdf/crowdsourcing-preference-tests-and-how-to-detect-cheating-2vteozxbp8.pdf

Crowdsourcing Preference Tests, and How to Detect Cheating.

This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality. When the recordings show significant changes in quality, or when moving towards non-speech vocalizations or singing, the vocoder still significantly outperforms speaker-dependent vocoders, but operates at a lower average relative MUSHRA of 75%. These results are shown to be consistent across languages, regardless of them being seen during training (e.g. English or Japanese) or unseen (e.g. Wolof, Swahili, Ahmaric).

/pdf/towards-achieving-robust-universal-neural-vocoding-4k1k6la0js.pdf

Towards achieving robust universal neural vocoding

New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer

This paper proposes a new F0 model for speech synthesis basedon the parameterization of the logF0 contour of the syllables.This parameterization consists of the N -order discrete cosinetransform (DCT) plus some additional parameters such as thegradient of the syllable average pitch. A statistical model of thesyllable pitch contour is then created by clustering the param-eterized vectors with a decision tree. Similar statistical modelsare also created for other linguistic levels other than the syllable.For synthesis, the statistical model of each level is used to deﬁnea log-likelihood function for the input text. These functions arethen weighted and added into a global log-likelihood functionwhich is then maximized with respect to the DCT coefﬁcients ofthe syllable model. The ﬁnal logF0 contour is obtained from theinverse transformation of the syllable DCT coefﬁcients. A sub-jective test showed a clear preference for the proposed modelagainst our previous HMM-based baseline.Index Terms: speech synthesis, HMM-based synthesis,prosody, discrete cosine transform

Javier Latorre

Papers

Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

Crowdsourcing Preference Tests, and How to Detect Cheating.

Towards achieving robust universal neural vocoding

New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer

Multilevel parametric-base F0 model for speech synthesis