scispace - formally typeset
Open AccessPosted Content

Fine-grained style control in Transformer-based Text-to-speech Synthesis.

TLDR
In this paper, a fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS) system is proposed by extracting a time sequence of local style tokens (LST) from the reference speech.
Abstract
In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available.

read more

Citations
More filters
Journal ArticleDOI

An Emotion Speech Synthesis Method Based on VITS

Wei Zhao, +1 more
- 09 Feb 2023 - 
TL;DR: In this article , a new system named Emo-VITS is proposed, which is based on the highly expressive speech synthesis module VITS, to realize the emotion control of text-to-speech synthesis.
Proceedings ArticleDOI

Acoustic or Pattern? Speech Spoofing Countermeasure based on Image Pre-training Models

TL;DR: This work applies image pre-trained CNN model to detect spoofed utterances, counterintuitively, and concatenate Jitter and Shimmer features to the output embedding to achieve top-level performance on the ASVspoof 2019 dataset.
Journal ArticleDOI

Deep Learning Attention Mechanism in Medical Image Analysis: Basics and Beyonds

TL;DR: In this paper , a comprehensive literature survey is conducted to analyze the keywords and literature, and then the development and technical characteristics of the attention mechanism are introduced and the remaining challenges, potential solutions, and future research directions are also discussed.
Proceedings ArticleDOI

HILvoice:Human-in-the-Loop Style Selection for Elder-Facing Speech Synthesis

TL;DR: In this article , a holistic framework is proposed to select a speaking style preferred by the older adults rather than the default neutral setting, which has slower speaking rate, which coincides with previous studies on auditory perception of older adults.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings ArticleDOI

Librispeech: An ASR corpus based on public domain audio books

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Related Papers (5)