Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

doi:10.1109/ISCMI51676.2020.9311564

Home
/
Papers
/
Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

Proceedings Article•DOI•

Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

14 Nov 2020-

TL;DR: This work aims to look at some of the issues and limitations present in the current works of Tacotron-2, and attempts to further improve its performance by modifying its architecture.

read less

Abstract: Text to Speech (TTS) is a form of speech synthesis where the text is converted into a spoken human-like voice output. The state of the art methods for TTS employs a neural network based approach. This work aims to look at some of the issues and limitations present in the current works, specifically Tacotron-2, and attempts to further improve its performance by modifying its architecture. The modified model uses Transformer network as a Spectrogram Prediction Network (SPN) and WaveGlow as an Audio Generation Network (AGN). For the modified model, performance improvements are seen in terms of the speech output generated for corresponding texts, the inference time taken for audio generation, and a Mean Opinion Score (MOS) of 4.10 (out of 5) is obtained.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Spoofing-Aware Speaker Verification with Unsupervised Domain Adaptation

[...]

Xuechen Liu, Md. Sahidullah, Tomi Kinnunen

21 Mar 2022

TL;DR: This paper begins the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module by employing three unsupervised domain adaptation techniques to optimize the back-end using the audio data in the training partition of the ASVspoof 2019 dataset.

...read moreread less

Abstract: In this paper, we initiate the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module. We start from the standard ASV framework of the ASVspoof 2019 baseline and approach the problem from the back-end classifier based on probabilistic linear discriminant analysis. We employ three unsupervised domain adaptation techniques to optimize the back-end using the audio data in the training partition of the ASVspoof 2019 dataset. We demonstrate notable improvements on both logical and physical access scenarios, especially on the latter where the system is attacked by replayed audios, with a maximum of 36.1% and 5.3% relative improvement on bonafide and spoofed cases, respectively. We perform additional studies such as per-attack breakdown analysis, data composition, and integration with a countermeasure system at score-level with Gaussian back-end.

...read moreread less

3 citations

Proceedings Article•DOI•

Bi-Sep: A Multi-Resolution Cross-Domain Monaural Speech Separation Framework

[...]

Kuan-Hsun Ho, Jeih-weih Hung, Berlin Chen

01 Dec 2022

TL;DR: In this article , a Bi-Projection Fusion (BPF) mechanism was proposed to merge the information between two domains to improve the fine-grained vision of time-domain methods.

...read moreread less

Abstract: In recent years, deep neural network (DNN)-based time-domain methods for monaural speech separation have substantially improved under an anechoic condition. However, the performance of these methods degrades when facing harsher conditions, such as noise or reverberation. Although adopting Short-Time Fourier Transform (STFT) for feature extraction of these neural methods helps stabilize the performance in non-anechoic situations, it inherently loses the fine-grained vision, which is one of the particularities of time-domain methods. Therefore, this study explores incorporating time and STFT-domain features to retain their beneficial characteristics. Furthermore, we leverage a Bi-Projection Fusion (BPF) mechanism to merge the information between two domains. To evaluate the effectiveness of our proposed method, we conduct experiments in an anechoic setting on the WSJ0-2mix dataset and noisy/reverberant settings on WHAM!/WHAMR! dataset. The experiment shows that with a cost of ignorable degradation on anechoic dataset, the proposed method manages to promote the performance of existing neural models when facing more complicated environments.

...read moreread less

1 citations

References

PDF

Open Access

More filters

Posted Content•

Online and Linear-Time Attention by Enforcing Monotonic Alignments

[...]

Colin Raffel¹, Minh-Thang Luong¹, Peter J. Liu¹, Ron Weiss¹, Douglas Eck¹ - Show less +1 more•Institutions (1)

Google¹

03 Apr 2017-arXiv: Learning

TL;DR: This work proposes an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time and validates the approach on sentence summarization, machine translation, and online speech recognition problems.

...read moreread less

Abstract: Recurrent neural network models with an attention mechanism have proven to be extremely effective on a wide variety of sequence-to-sequence problems. However, the fact that soft attention mechanisms perform a pass over the entire input sequence when producing each element in the output sequence precludes their use in online settings and results in a quadratic time complexity. Based on the insight that the alignment between input and output sequence elements is monotonic in many problems of interest, we propose an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time. We validate our approach on sentence summarization, machine translation, and online speech recognition problems and achieve results competitive with existing sequence-to-sequence models.

...read moreread less

131 citations

Posted Content•

FloWaveNet : A Generative Flow for Raw Audio

[...]

Sungwon Kim¹, Sang-Gil Lee¹, Jongyoon Song, Jaehyeon Kim, Sungroh Yoon¹ - Show less +1 more•Institutions (1)

Seoul National University¹

06 Nov 2018-arXiv: Sound

TL;DR: In this paper, a flow-based generative model for real-time audio synthesis is proposed, which requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms.

...read moreread less

Abstract: Most modern text-to-speech architectures use a WaveNet vocoder for synthesizing high-fidelity waveform audio, but there have been limitations, such as high inference time, in its practical application due to its ancestral sampling scheme. The recently suggested Parallel WaveNet and ClariNet have achieved real-time audio synthesis capability by incorporating inverse autoregressive flow for parallel sampling. However, these approaches require a two-stage training pipeline with a well-trained teacher network and can only produce natural sound by using probability distillation along with auxiliary loss terms. We propose FloWaveNet, a flow-based generative model for raw audio synthesis. FloWaveNet requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms, and it is inherently parallel due to the characteristics of generative flow. The model can efficiently sample raw audio in real-time, with clarity comparable to previous two-stage parallel models. The code and samples for all models, including our FloWaveNet, are publicly available.

...read moreread less

32 citations