Data Mining Practical Machine Learning Tools and Techniques

Digital processing of speech signals

http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2011/Schuller11-RRE.pdf

Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Driven by high-throughput sequencing techniques, modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of vast volumes of heterogeneous information in the deep understanding of biological systems and the development of predictive models. How data from multiple sources (called multi-view data) are incorporated in a learning system is a key step for successful analysis. In this article, we provide a comprehensive review on omics and clinical data integration techniques, from a machine learning perspective, for various analyses such as prediction, clustering, dimension reduction and association. We shall show that Bayesian models are able to use prior information and model measurements with various distributions; tree-based methods can either build a tree with all features or collectively make a final decision based on trees learned from each view; kernel methods fuse the similarity matrices learned from individual views together for a final similarity matrix or learning model; network-based fusion methods are capable of inferring direct and indirect associations in a heterogeneous network; matrix factorization models have potential to learn interactions among features from different views; and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.

A review on machine learning principles for multi-view biological data integration.

https://link.springer.com/content/pdf/10.1057%2Fjors.1990.171.pdf

Computer Intensive Methods for Testing Hypotheses: An Introduction

Automatic dubbing is an extension of speech-to-speech translation such that the resulting target speech is carefully aligned in terms of duration, lip movements, timbre, emotion, prosody, etc. of the speaker in order to achieve audiovisual coherence. Dubbing quality strongly depends on isochrony, i.e., arranging the translation of the original speech to optimally match its sequence of phrases and pauses. To this end, we present improvements to the prosodic alignment component of our recently introduced dubbing architecture. We present empirical results for four dubbing directions – English to French, Italian, German and Spanish – on a publicly available collection of TED Talks. Compared to previous work, our enhanced prosodic alignment model significantly improves prosodic alignment accuracy and provides segmentation perceptibly better or on par with manually annotated reference segmentation.

/pdf/improvements-to-prosodic-alignment-for-automatic-dubbing-1o002zv0xu.pdf

Improvements to Prosodic Alignment for Automatic Dubbing

We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speech output with background noise and reverberation extracted from the original audio We report and discuss results of a first subjective evaluation of automatic dubbing of excerpts of TED Talks from English into Italian, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement

/pdf/from-speech-to-speech-translation-to-automatic-dubbing-3vkndnku73.pdf

From Speech-to-Speech Translation to Automatic Dubbing

We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).

Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech.

In this paper we introduce the idea of cross-lingual emotion transplantation. The aim is to lean the nuances of emotional speech in a source language for which we have enough data to adapt an acceptable quality emotional model by means of CSMAPLR adaptation, and then convert the adaptation function so it can be applied to a target language in a different target speaker while maintaining the speaker identity but adding emotional information. The conversion between languages is done at state level by measuring the KLD distance between the Gaussian distributions of all the states and linking the closest ones. Finally, as the cross-lingual transplantation of spectral emotions mainly anger was found out to introduce significant amounts of spectral noise, we show the results of applying three different techniques related to adaptation parameters that can be used to reduce the noise. The results are measured in an objective fashion by means of a bi-dimensional PCA projection of the KLD distances between the considered models neutral models of both languages, reference emotion for both languages and transplanted emotional model for the target language.

/pdf/towards-cross-lingual-emotion-transplantation-3gawqj78em.pdf

Towards Cross-Lingual Emotion Transplantation

https://assets.amazon.science/7c/5d/f8f2df964b86abff2c1c734ad22a/scipub-1208.pdf

Roberto Barra-Chicote

Papers

Improvements to Prosodic Alignment for Automatic Dubbing

From Speech-to-Speech Translation to Automatic Dubbing

Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech.

Towards Cross-Lingual Emotion Transplantation

Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech