Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

doi:10.1109/ISCMI51676.2020.9311564

Home
/
Papers
/
Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

Proceedings Article•DOI•

Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

14 Nov 2020-

TL;DR: This work aims to look at some of the issues and limitations present in the current works of Tacotron-2, and attempts to further improve its performance by modifying its architecture.

read less

Abstract: Text to Speech (TTS) is a form of speech synthesis where the text is converted into a spoken human-like voice output. The state of the art methods for TTS employs a neural network based approach. This work aims to look at some of the issues and limitations present in the current works, specifically Tacotron-2, and attempts to further improve its performance by modifying its architecture. The modified model uses Transformer network as a Spectrogram Prediction Network (SPN) and WaveGlow as an Audio Generation Network (AGN). For the modified model, performance improvements are seen in terms of the speech output generated for corresponding texts, the inference time taken for audio generation, and a Mean Opinion Score (MOS) of 4.10 (out of 5) is obtained.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Spoofing-Aware Speaker Verification with Unsupervised Domain Adaptation

[...]

Xuechen Liu, Md. Sahidullah, Tomi Kinnunen

21 Mar 2022

TL;DR: This paper begins the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module by employing three unsupervised domain adaptation techniques to optimize the back-end using the audio data in the training partition of the ASVspoof 2019 dataset.

...read moreread less

Abstract: In this paper, we initiate the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module. We start from the standard ASV framework of the ASVspoof 2019 baseline and approach the problem from the back-end classifier based on probabilistic linear discriminant analysis. We employ three unsupervised domain adaptation techniques to optimize the back-end using the audio data in the training partition of the ASVspoof 2019 dataset. We demonstrate notable improvements on both logical and physical access scenarios, especially on the latter where the system is attacked by replayed audios, with a maximum of 36.1% and 5.3% relative improvement on bonafide and spoofed cases, respectively. We perform additional studies such as per-attack breakdown analysis, data composition, and integration with a countermeasure system at score-level with Gaussian back-end.

...read moreread less

3 citations

Proceedings Article•DOI•

Bi-Sep: A Multi-Resolution Cross-Domain Monaural Speech Separation Framework

[...]

Kuan-Hsun Ho, Jeih-weih Hung, Berlin Chen

01 Dec 2022

TL;DR: In this article , a Bi-Projection Fusion (BPF) mechanism was proposed to merge the information between two domains to improve the fine-grained vision of time-domain methods.

...read moreread less

Abstract: In recent years, deep neural network (DNN)-based time-domain methods for monaural speech separation have substantially improved under an anechoic condition. However, the performance of these methods degrades when facing harsher conditions, such as noise or reverberation. Although adopting Short-Time Fourier Transform (STFT) for feature extraction of these neural methods helps stabilize the performance in non-anechoic situations, it inherently loses the fine-grained vision, which is one of the particularities of time-domain methods. Therefore, this study explores incorporating time and STFT-domain features to retain their beneficial characteristics. Furthermore, we leverage a Bi-Projection Fusion (BPF) mechanism to merge the information between two domains. To evaluate the effectiveness of our proposed method, we conduct experiments in an anechoic setting on the WSJ0-2mix dataset and noisy/reverberant settings on WHAM!/WHAMR! dataset. The experiment shows that with a cost of ignorable degradation on anechoic dataset, the proposed method manages to promote the performance of existing neural models when facing more complicated environments.

...read moreread less

1 citations

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Unit selection in a concatenative speech synthesis system using a large speech database

[...]

Andrew Hunt, Alan W. Black¹•Institutions (1)

The Chinese University of Hong Kong¹

07 May 1996

TL;DR: In this paper, a state transition network is proposed to select and concatenate phonemes from a large speech database to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information.

...read moreread less

Abstract: One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning.

...read moreread less

1,207 citations

Proceedings Article•DOI•

Tacotron: Towards End-to-End Speech Synthesis

[...]

Yuxuan Wang¹, RJ Skerry-Ryan¹, Daisy Stanton¹, Yonghui Wu¹, Ron Weiss¹, Navdeep Jaitly², Zongheng Yang³, Ying Xiao⁴, Zhifeng Chen¹, Samy Bengio¹, Quoc V. Le¹, Yannis Agiomyrgiannakis¹, Robert A. J. Clark⁵, Rif A. Saurous¹ - Show less +10 more•Institutions (5)

Google¹, University of Toronto², University of California, Berkeley³, Palantir Technologies⁴, University of Edinburgh⁵

20 Aug 2017

TL;DR: Tacotron as mentioned in this paper is an end-to-end generative text to speech model that synthesizes speech directly from characters, given pairs, the model can be trained completely from scratch with random initialization.

...read moreread less

Abstract: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

...read moreread less

1,144 citations

Posted Content•

Density estimation using Real NVP

[...]

Laurent Dinh¹, Jascha Sohl-Dickstein², Samy Bengio²•Institutions (2)

Université de Montréal¹, Google²

27 May 2016-arXiv: Learning

TL;DR: This work extends the space of probabilistic models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space.

...read moreread less

Abstract: Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.

...read moreread less

908 citations

Proceedings Article•DOI•

Statistical parametric speech synthesis using deep neural networks

[...]

Heiga Ze¹, Andrew W. Senior¹, Mike Schuster¹•Institutions (1)

Google¹

26 May 2013

TL;DR: This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters.

...read moreread less

Abstract: Conventional approaches to statistical parametric speech synthesis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limitations, e.g. decision trees are inefficient to model complex context dependencies. This paper examines an alternative scheme that is based on a deep neural network (DNN). The relationship between input texts and their acoustic realizations is modeled by a DNN. The use of the DNN can address some limitations of the conventional approach. Experimental results show that the DNN-based systems outperformed the HMM-based systems with similar numbers of parameters.

...read moreread less

880 citations

"Natural Text-to-Speech Synthesis by..." refers methods in this paper

...In the past couple of decades, the dominantly used methods for TTS were Concatenative Synthesis [1], [2] and Statistical Parametric Speech Synthesis [3]–[5]....
[...]

Journal Article•DOI•

Normalizing Flows: An Introduction and Review of Current Methods

[...]

Ivan Kobyzev, Simon J. D. Prince, Marcus A. Brubaker

01 Nov 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The goal of this survey article is to give a coherent and comprehensive review of the literature around the construction and use of Normalizing Flows for distribution learning to provide context and explanation of the models.

...read moreread less

Abstract: Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation can be efficient and exact. The goal of this survey article is to give a coherent and comprehensive review of the literature around the construction and use of Normalizing Flows for distribution learning. We aim to provide context and explanation of the models, review current state-of-the-art literature, and identify open questions and promising future directions.

...read moreread less

683 citations