Home
/
Authors
/
Hieu-Thi Luong

Author

Hieu-Thi Luong

Other affiliations: Ho Chi Minh City University of Science, Graduate University for Advanced Studies

Bio: Hieu-Thi Luong is an academic researcher from National Institute of Informatics. The author has contributed to research in topics: Speech synthesis & Acoustic model. The author has an hindex of 10, co-authored 22 publications receiving 279 citations. Previous affiliations of Hieu-Thi Luong include Ho Chi Minh City University of Science & Graduate University for Advanced Studies.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Adapting and controlling DNN-based speech synthesis using input codes

[...]

Hieu-Thi Luong¹, Shinji Takaki², Gustav Eje Henter², Junichi Yamagishi²•Institutions (2)

Ho Chi Minh City University of Science¹, National Institute of Informatics²

19 Jun 2017

TL;DR: Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.

...read moreread less

Abstract: Methods for adapting and controlling the characteristics of output speech are important topics in speech synthesis. In this work, we investigated the performance of DNN-based text-to-speech systems that in parallel to conventional text input also take speaker, gender, and age codes as inputs, in order to 1) perform multi-speaker synthesis, 2) perform speaker adaptation using small amounts of target-speaker adaptation data, and 3) modify synthetic speech characteristics based on the input codes. Using a large-scale, studio-quality speech corpus with 135 speakers of both genders and ages between tens and eighties, we performed three experiments: 1) First, we used a subset of speakers to construct a DNN-based, multi-speaker acoustic model with speaker codes. 2) Next, we performed speaker adaptation by estimating code vectors for new speakers via backpropagation from a small amount of adaptation material. 3) Finally, we experimented with manually manipulating input code vectors to alter the gender and/or age characteristics of the synthesised speech. Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.

...read moreread less

71 citations

Journal Article•DOI•

Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

[...]

Yi Zhao¹, Shinji Takaki², Hieu-Thi Luong², Junichi Yamagishi², Daisuke Saito¹, Nobuaki Minematsu¹ - Show less +2 more•Institutions (2)

University of Tokyo¹, National Institute of Informatics²

24 Sep 2018-IEEE Access

TL;DR: In this article, a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), was used to reduce the mismatched characteristics between natural and generated acoustic features.

...read moreread less

Abstract: WaveNet, which learns directly from speech waveform samples, has been used as an alternative to vocoders and achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. However, the WaveNet vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. So far, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation occurs, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose new frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. The GAN generator performs as an acoustic model and its outputs are used as the local condition parameters of the WaveNet. We also extend the GAN frameworks and use the discretized-mixture-of-logistics (DML) loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.

...read moreread less

53 citations

Journal Article•DOI•

NAUTILUS: A Versatile Voice Cloning System

[...]

Hieu-Thi Luong¹, Junichi Yamagishi¹•Institutions (1)

Graduate University for Advanced Studies¹

30 Oct 2020-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this paper, a novel speech synthesis system, called NAUTILUS, is proposed that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.

...read moreread less

Abstract: We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm. Moreover, depending on the data circumstance of the target speaker, the cloning strategy can be adjusted to take advantage of additional data and modify the behaviors of text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the situation. We test the performance of the proposed framework by using deep convolution layers to model the encoders, decoders and WaveNet vocoder. Evaluations show that it achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech. Moreover, it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency, which will be useful for many applications.

...read moreread less

36 citations

Posted Content•

Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

[...]

Yi Zhao¹, Shinji Takaki², Hieu-Thi Luong², Junichi Yamagishi², Daisuke Saito¹, Nobuaki Minematsu¹ - Show less +2 more•Institutions (2)

University of Tokyo¹, National Institute of Informatics²

31 Jul 2018-arXiv: Audio and Speech Processing

TL;DR: Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.

...read moreread less

Abstract: Recent neural networks such as WaveNet and sampleRNN that learn directly from speech waveform samples have achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. Such neural networks are being used as an alternative to vocoders and hence they are often called neural vocoders. The neural vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. However, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation happens, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. We also extend the GAN frameworks and use the discretized mixture logistic loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated discretized-mixture-of-logistics (DML) loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.

...read moreread less

33 citations

Proceedings Article•DOI•

Investigating Accuracy of Pitch-accent Annotations in Neural Network-based Speech Synthesis and Denoising Effects.

[...]

Hieu-Thi Luong¹, Xin Wang¹, Junichi Yamagishi¹, Nobuyuki Nishizawa•Institutions (1)

National Institute of Informatics¹

02 Sep 2018

TL;DR: While an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.

...read moreread less

Abstract: We investigated the impact of noisy linguistic features on the performance of a Japanese speech synthesis system based on neural network that uses WaveNet vocoder. We compared an ideal system that uses manually corrected linguistic features including phoneme and prosodic information in training and test sets against a few other systems that use corrupted linguistic features. Both subjective and objective results demonstrate that corrupted linguistic features, especially those in the test set, affected the ideal system's performance significantly in a statistical sense due to a mismatched condition between the training and test sets. Interestingly, while an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.

...read moreread less

18 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet

[...]

Edward J. Vajda

01 Dec 2000-Language

789 citations

Posted Content•

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

[...]

Yuxuan Wang¹, Daisy Stanton¹, Yu Zhang¹, RJ Skerry-Ryan¹, Eric Battenberg², Joel Shor¹, Ying Xiao¹, Fei Ren, Ye Jia¹, Rif A. Saurous¹ - Show less +6 more•Institutions (2)

Google¹, Baidu²

23 Mar 2018-arXiv: Computation and Language

TL;DR: "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

...read moreread less

Abstract: In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

...read moreread less

421 citations

Proceedings Article•

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

[...]

Yuxuan Wang¹, Daisy Stanton¹, Yu Zhang¹, RJ Skerry-Ryan¹, Eric Battenberg², Joel Shor¹, Ying Xiao¹, Fei Ren, Ye Jia¹, Rif A. Saurous¹ - Show less +6 more•Institutions (2)

Google¹, Baidu²

03 Jul 2018

TL;DR: In this article, a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, is proposed.

...read moreread less

300 citations

Journal Article•DOI•

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

[...]

Berrak Sisman¹, Junichi Yamagishi², Simon King³, Haizhou Li⁴•Institutions (4)

Singapore University of Technology and Design¹, National Institute of Informatics², University of Edinburgh³, National University of Singapore⁴

01 Jan 2021-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.

...read moreread less

Abstract: Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory and practice, we are now able to produce human-like voice quality with high speaker similarity. In this article, we provide a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discuss their promise and limitations. We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research.

...read moreread less

187 citations

Proceedings Article•DOI•

An investigation of multi-speaker training for wavenet vocoder

[...]

Tomoki Hayashi¹, Akira Tamamori¹, Kazuhiro Kobayashi¹, Kazuya Takeda¹, Tomoki Toda¹ - Show less +1 more•Institutions (1)

Nagoya University¹

01 Dec 2017

TL;DR: The experimental results demonstrate that 1) the multispeaker WaveNet vocoder still outperforms STRAIGHT in generating known speakers' voices but it is comparable to STRAight in generating unknown speaker's voices, and 2) the multi-speaker training is effective for developing the Wave net vocoder capable of speech modification.

...read moreread less

Abstract: In this paper, we investigate the effectiveness of multi-speaker training for WaveNet vocoder In our previous work, we have demonstrated that our proposed speaker-dependent (SD) WaveNet vocoder, which is trained with a single speaker's speech data, is capable of modeling temporal waveform structure, such as phase information, and makes it possible to generate more naturally sounding synthetic voices compared to conventional high-quality vocoder, STRAIGHT However, it is still difficult to generate synthetic voices of various speakers using the SD-WaveNet due to its speaker-dependent property Towards the development of speaker-independent WaveNet vocoder, we apply multi-speaker training techniques to the WaveNet vocoder and investigate its effectiveness The experimental results demonstrate that 1) the multispeaker WaveNet vocoder still outperforms STRAIGHT in generating known speakers' voices but it is comparable to STRAIGHT in generating unknown speakers' voices, and 2) the multi-speaker training is effective for developing the WaveNet vocoder capable of speech modification

...read moreread less

130 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

Collapse