Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS

doi:10.1109/LSP.2020.3016564

Open AccessJournal ArticleDOI

Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS

Rui Liu, +4 more

- 13 Aug 2020 -

IEEE Signal Processing Letters

- Vol. 27, pp 1470-1474

TLDR

This letter proposes a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks, and shows that the proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

Abstract:

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this letter, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Berrak Sisman, +3 more

- 01 Jan 2021 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.

...read moreread less

Posted Content

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

Berrak Sisman, +3 more

- 09 Aug 2020 -

arXiv: Audio and Speech Processing

TL;DR: A comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discuss their promise and limitations can be found in this paper.

...read moreread less

Journal ArticleDOI

Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis

Rui Liu, +5 more

- 01 Jan 2021 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: Both objective and subjective evaluations validate the effectiveness of the proposed phrase break prediction framework, that consistently improves voice quality in a Mongolian text-to-speech synthesis system.

...read moreread less

Posted Content

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

Rui Liu, +2 more

- 23 Oct 2020 -

arXiv: Learning

TL;DR: A novel neural TTS model, denoted as GraphSpeech, is proposed that is formulated under graph neural network framework that consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.

...read moreread less

Posted Content

A Survey on Neural Speech Synthesis.

Xu Tan, +3 more

- 29 Jun 2021 -

arXiv: Audio and Speech Processing

TL;DR: A comprehensive survey on neural text-to-speech (TTS) can be found in this paper, focusing on the key components in neural TTS, including text analysis, acoustic models and vocoders.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Multitask Learning

Rich Caruana

TL;DR: Multi-task Learning (MTL) as mentioned in this paper is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias.

...read moreread less

Book

Multitask learning

Rich Caruana

TL;DR: Multitask learning as discussed by the authors is an approach to inductive transfer that improves learning for one task by using the information contained in the training signals of other related tasks, and it does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.

...read moreread less

Proceedings ArticleDOI

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Jonathan Shen, +12 more

TL;DR: Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.

...read moreread less

Posted Content

A Survey on Multi-Task Learning

Yu Zhang, +1 more

- 25 Jul 2017 -

arXiv: Learning

TL;DR: Multi-task learning (MTL) as mentioned in this paper is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks.

...read moreread less

Proceedings ArticleDOI

Statistical parametric speech synthesis using deep neural networks

Heiga Ze, +2 more

TL;DR: This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters.

...read moreread less

Collapse

Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS

Citations

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

A Survey on Neural Speech Synthesis.

References

Multitask Learning

Multitask learning

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

A Survey on Multi-Task Learning

Statistical parametric speech synthesis using deep neural networks

Related Papers (5)

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Neural Machine Translation by Jointly Learning to Align and Translate

Attention is All you Need

Tacotron: Towards End-to-End Speech Synthesis

Statistical parametric speech synthesis using deep neural networks