scispace - formally typeset
Open AccessJournal ArticleDOI

Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS

TLDR
This letter proposes a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks, and shows that the proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
Abstract
Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this letter, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

read more

Citations
More filters
Journal ArticleDOI

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

TL;DR: This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.
Posted Content

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

TL;DR: A comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discuss their promise and limitations can be found in this paper.
Journal ArticleDOI

Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis

TL;DR: Both objective and subjective evaluations validate the effectiveness of the proposed phrase break prediction framework, that consistently improves voice quality in a Mongolian text-to-speech synthesis system.
Posted Content

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

TL;DR: A novel neural TTS model, denoted as GraphSpeech, is proposed that is formulated under graph neural network framework that consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.
Posted Content

A Survey on Neural Speech Synthesis.

TL;DR: A comprehensive survey on neural text-to-speech (TTS) can be found in this paper, focusing on the key components in neural TTS, including text analysis, acoustic models and vocoders.
References
More filters
Journal ArticleDOI

Multitask Learning

TL;DR: Multi-task Learning (MTL) as mentioned in this paper is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias.
Book

Multitask learning

Rich Caruana
TL;DR: Multitask learning as discussed by the authors is an approach to inductive transfer that improves learning for one task by using the information contained in the training signals of other related tasks, and it does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.
Proceedings ArticleDOI

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

TL;DR: Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.
Posted Content

A Survey on Multi-Task Learning

TL;DR: Multi-task learning (MTL) as mentioned in this paper is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks.
Proceedings ArticleDOI

Statistical parametric speech synthesis using deep neural networks

TL;DR: This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters.
Related Papers (5)