Fine-grained style control in Transformer-based Text-to-speech Synthesis.

Open AccessPosted Content

Fine-grained style control in Transformer-based Text-to-speech Synthesis.

- 12 Oct 2021 -

TLDR

In this paper, a fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS) system is proposed by extracting a time sequence of local style tokens (LST) from the reference speech.

Abstract:

In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Yogesh Kumar, +2 more

- 29 Sep 2022 -

Multimedia Tools and Applications

Journal ArticleDOI

An Emotion Speech Synthesis Method Based on VITS

Wei Zhao, +1 more

- 09 Feb 2023 -

Applied Sciences

TL;DR: In this article , a new system named Emo-VITS is proposed, which is based on the highly expressive speech synthesis module VITS, to realize the emotion control of text-to-speech synthesis.

...read moreread less

Proceedings ArticleDOI

Acoustic or Pattern? Speech Spoofing Countermeasure based on Image Pre-training Models

Jingze Lu, +4 more

TL;DR: This work applies image pre-trained CNN model to detect spoofed utterances, counterintuitively, and concatenate Jitter and Shimmer features to the output embedding to achieve top-level performance on the ASVspoof 2019 dataset.

...read moreread less

Journal ArticleDOI

Deep Learning Attention Mechanism in Medical Image Analysis: Basics and Beyonds

TL;DR: In this paper , a comprehensive literature survey is conducted to analyze the keywords and literature, and then the development and technical characteristics of the attention mechanism are introduced and the remaining challenges, potential solutions, and future research directions are also discussed.

...read moreread less

Proceedings ArticleDOI

HILvoice:Human-in-the-Loop Style Selection for Elder-Facing Speech Synthesis

TL;DR: In this article , a holistic framework is proposed to select a speaking style preferred by the older adults rather than the default neutral setting, which has slower speaking rate, which coincides with previous studies on auditory perception of older adults.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings ArticleDOI

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, +3 more

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

...read moreread less