scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

TL;DR: This work aims to look at some of the issues and limitations present in the current works of Tacotron-2, and attempts to further improve its performance by modifying its architecture.
Abstract: Text to Speech (TTS) is a form of speech synthesis where the text is converted into a spoken human-like voice output. The state of the art methods for TTS employs a neural network based approach. This work aims to look at some of the issues and limitations present in the current works, specifically Tacotron-2, and attempts to further improve its performance by modifying its architecture. The modified model uses Transformer network as a Spectrogram Prediction Network (SPN) and WaveGlow as an Audio Generation Network (AGN). For the modified model, performance improvements are seen in terms of the speech output generated for corresponding texts, the inference time taken for audio generation, and a Mean Opinion Score (MOS) of 4.10 (out of 5) is obtained.
Citations
More filters
Proceedings ArticleDOI
21 Mar 2022
TL;DR: This paper begins the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module by employing three unsupervised domain adaptation techniques to optimize the back-end using the audio data in the training partition of the ASVspoof 2019 dataset.
Abstract: In this paper, we initiate the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module. We start from the standard ASV framework of the ASVspoof 2019 baseline and approach the problem from the back-end classifier based on probabilistic linear discriminant analysis. We employ three unsupervised domain adaptation techniques to optimize the back-end using the audio data in the training partition of the ASVspoof 2019 dataset. We demonstrate notable improvements on both logical and physical access scenarios, especially on the latter where the system is attacked by replayed audios, with a maximum of 36.1% and 5.3% relative improvement on bonafide and spoofed cases, respectively. We perform additional studies such as per-attack breakdown analysis, data composition, and integration with a countermeasure system at score-level with Gaussian back-end.

3 citations

Proceedings ArticleDOI
01 Dec 2022
TL;DR: In this article , a Bi-Projection Fusion (BPF) mechanism was proposed to merge the information between two domains to improve the fine-grained vision of time-domain methods.
Abstract: In recent years, deep neural network (DNN)-based time-domain methods for monaural speech separation have substantially improved under an anechoic condition. However, the performance of these methods degrades when facing harsher conditions, such as noise or reverberation. Although adopting Short-Time Fourier Transform (STFT) for feature extraction of these neural methods helps stabilize the performance in non-anechoic situations, it inherently loses the fine-grained vision, which is one of the particularities of time-domain methods. Therefore, this study explores incorporating time and STFT-domain features to retain their beneficial characteristics. Furthermore, we leverage a Bi-Projection Fusion (BPF) mechanism to merge the information between two domains. To evaluate the effectiveness of our proposed method, we conduct experiments in an anechoic setting on the WSJ0-2mix dataset and noisy/reverberant settings on WHAM!/WHAMR! dataset. The experiment shows that with a cost of ignorable degradation on anechoic dataset, the proposed method manages to promote the performance of existing neural models when facing more complicated environments.

1 citations

References
More filters
Posted Content
Colin Raffel1, Minh-Thang Luong1, Peter J. Liu1, Ron Weiss1, Douglas Eck1 
TL;DR: This work proposes an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time and validates the approach on sentence summarization, machine translation, and online speech recognition problems.
Abstract: Recurrent neural network models with an attention mechanism have proven to be extremely effective on a wide variety of sequence-to-sequence problems. However, the fact that soft attention mechanisms perform a pass over the entire input sequence when producing each element in the output sequence precludes their use in online settings and results in a quadratic time complexity. Based on the insight that the alignment between input and output sequence elements is monotonic in many problems of interest, we propose an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time. We validate our approach on sentence summarization, machine translation, and online speech recognition problems and achieve results competitive with existing sequence-to-sequence models.

131 citations

Posted Content
TL;DR: In this paper, a flow-based generative model for real-time audio synthesis is proposed, which requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms.
Abstract: Most modern text-to-speech architectures use a WaveNet vocoder for synthesizing high-fidelity waveform audio, but there have been limitations, such as high inference time, in its practical application due to its ancestral sampling scheme. The recently suggested Parallel WaveNet and ClariNet have achieved real-time audio synthesis capability by incorporating inverse autoregressive flow for parallel sampling. However, these approaches require a two-stage training pipeline with a well-trained teacher network and can only produce natural sound by using probability distillation along with auxiliary loss terms. We propose FloWaveNet, a flow-based generative model for raw audio synthesis. FloWaveNet requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms, and it is inherently parallel due to the characteristics of generative flow. The model can efficiently sample raw audio in real-time, with clarity comparable to previous two-stage parallel models. The code and samples for all models, including our FloWaveNet, are publicly available.

32 citations

Trending Questions (1)
What is the state of the art for speech to text?

The provided paper is about improving the performance of Text-to-Speech (TTS) synthesis using a modified model. It does not mention the state of the art for speech to text.