Home
/
Authors
/
Nobuyuki Nishizawa

Author

Nobuyuki Nishizawa

Bio: Nobuyuki Nishizawa is an academic researcher from University of Tokyo. The author has contributed to research in topics: Speech synthesis & Filter bank. The author has an hindex of 5, co-authored 21 publications receiving 84 citations.

Topics: Speech synthesis, Filter bank, Formant, Test set, Vocal tract ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Investigating Accuracy of Pitch-accent Annotations in Neural Network-based Speech Synthesis and Denoising Effects.

[...]

Hieu-Thi Luong¹, Xin Wang¹, Junichi Yamagishi¹, Nobuyuki Nishizawa•Institutions (1)

National Institute of Informatics¹

02 Sep 2018

TL;DR: While an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.

...read moreread less

Abstract: We investigated the impact of noisy linguistic features on the performance of a Japanese speech synthesis system based on neural network that uses WaveNet vocoder. We compared an ideal system that uses manually corrected linguistic features including phoneme and prosodic information in training and test sets against a few other systems that use corrupted linguistic features. Both subjective and objective results demonstrate that corrupted linguistic features, especially those in the test set, affected the ideal system's performance significantly in a statistical sense due to a mismatched condition between the training and test sets. Interestingly, while an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.

...read moreread less

18 citations

Proceedings Article•DOI•

Training Multi-Speaker Neural Text-to-Speech Systems Using Speaker-Imbalanced Speech Corpora.

[...]

Hieu-Thi Luong¹, Hieu-Thi Luong², Xin Wang¹, Junichi Yamagishi, Nobuyuki Nishizawa - Show less +1 more•Institutions (2)

National Institute of Informatics¹, Graduate University for Advanced Studies²

15 Sep 2019

TL;DR: Using an ensemble multi-speaker model, in which each subsystem is trained on a subset of available data, can further improve the quality of the synthetic speech especially for underrepresented speakers whose training data is limited.

...read moreread less

Abstract: When the available data of a target speaker is insufficient to train a high quality speaker-dependent neural text-to-speech (TTS) system, we can combine data from multiple speakers and train a multi-speaker TTS model instead. Many studies have shown that neural multi-speaker TTS model trained with a small amount data from multiple speakers combined can generate synthetic speech with better quality and stability than a speaker-dependent one. However when the amount of data from each speaker is highly unbalanced, the best approach to make use of the excessive data remains unknown. Our experiments showed that simply combining all available data from every speaker to train a multi-speaker model produces better than or at least similar performance to its speaker-dependent counterpart. Moreover by using an ensemble multi-speaker model, in which each subsystem is trained on a subset of available data, we can further improve the quality of the synthetic speech especially for underrepresented speakers whose training data is limited.

...read moreread less

16 citations

Posted Content•

Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

[...]

Hieu-Thi Luong¹, Xin Wang¹, Junichi Yamagishi¹, Nobuyuki Nishizawa•Institutions (1)

National Institute of Informatics¹

02 Aug 2018-arXiv: Audio and Speech Processing

TL;DR: In this article, the authors investigated the impact of noisy linguistic features on the performance of a Japanese speech synthesis system based on neural network that uses WaveNet vocoder, and found that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.

...read moreread less

16 citations

Posted Content•

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

[...]

Hieu-Thi Luong¹, Xin Wang¹, Junichi Yamagishi¹, Nobuyuki Nishizawa•Institutions (1)

National Institute of Informatics¹

01 Apr 2019-arXiv: Audio and Speech Processing

TL;DR: In this article, a multi-speaker text-to-speech model was proposed to generate synthetic speech with better quality and stability than a speaker-dependent one, when the available data of a target speaker is insufficient to train a high quality speakerdependent neural TTS system.

...read moreread less

10 citations

Proceedings Article•

Large-scale Subjective Evaluations of Speech Rate Control Methods for HMM-based Speech Synthesizers

[...]

Tsuneo Kato¹, Makoto Yamada, Nobuyuki Nishizawa, Keiichiro Oura², Keiichi Tokuda² - Show less +1 more•Institutions (2)

Doshisha University¹, Nagoya Institute of Technology²

01 Jan 2011

TL;DR: The results indicated that the proportional shrinking had significant advantages for fast rate, whereas HMMs trained from slow speech sounds had a slight advantage for slow rate.

...read moreread less

Abstract: Three speech rate control methods for HMM-based speech synthesis were compared by large-scale subjective evaluations. The methods are 1) synthesizing speech sounds based on HMMs trained from corpora at a target speech rate, 2) stretching or shrinking utterance durations proportionally in waveform generation, and 3) determining state durations based on ML criterion under a restriction of utterance duration. The results indicated that the proportional shrinking had significant advantages for fast rate, whereas HMMs trained from slow speech sounds had a slight advantage for slow rate. We also found an advantage of proportionally shrunk speech from a synthesizer trained from slow speech corpora.

...read moreread less

6 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis

[...]

Xin Wang¹, Shinji Takaki¹, Junichi Yamagishi¹•Institutions (1)

National Institute of Informatics¹

12 May 2019

TL;DR: This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet.

...read moreread less

Abstract: Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive (AR) structure. Although faster non-AR models were recently reported, they may be prohibitively complicated due to the use of a distilling training method and the blend of other disparate training criteria. This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform. Our experiments demonstrated that the proposed model generated waveforms at least 100 times faster than the AR WaveNet and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet. Ablation test results showed that both the sine-wave excitation signal and the spectrum-based training criteria were essential to the performance of the proposed model.

...read moreread less

107 citations

Journal Article•DOI•

Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis

[...]

Xin Wang¹, Shinji Takaki², Junichi Yamagishi¹•Institutions (2)

National Institute of Informatics¹, Nagoya Institute of Technology²

01 Jan 2020-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It was demonstrated that the NSF models generated waveforms at least 100 times faster than the authors' WaveNet-vocoder, and the quality of the synthetic speech from the best NSF model was comparable to that from WaveNet on a large single-speaker Japanese speech corpus.

...read moreread less

Abstract: Neural waveform models have demonstrated better performance than conventional vocoders for statistical parametric speech synthesis. One of the best models, called WaveNet, uses an autoregressive (AR) approach to model the distribution of waveform sampling points, but it has to generate a waveform in a time-consuming sequential manner. Some new models that use inverse-autoregressive flow (IAF) can generate a whole waveform in a one-shot manner but require either a larger amount of training time or a complicated model architecture plus a blend of training criteria. As an alternative to AR and IAF-based frameworks, we propose a neural source-filter (NSF) waveform modeling framework that is straightforward to train and fast to generate waveforms. This framework requires three components to generate waveforms: a source module that generates a sine-based signal as excitation, a non-AR dilated-convolution-based filter module that transforms the excitation into a waveform, and a conditional module that pre-processes the input acoustic features for the source and filter modules. This framework minimizes spectral-amplitude distances for model training, which can be efficiently implemented using short-time Fourier transform routines. As an initial NSF study, we designed three NSF models under the proposed framework and compared them with WaveNet using our deep learning toolkit. It was demonstrated that the NSF models generated waveforms at least 100 times faster than our WaveNet-vocoder, and the quality of the synthetic speech from the best NSF model was comparable to that from WaveNet on a large single-speaker Japanese speech corpus.

...read moreread less

104 citations

Journal Article•DOI•

Measuring a decade of progress in Text-to-Speech

[...]

Simon King¹•Institutions (1)

University of Edinburgh¹

30 Jun 2014

TL;DR: The Blizzard Challenge as discussed by the authors offers a unique insight into progress in text-to-speech synthesis over the last decade, by using a very large listening test to compare the performance of a wide range of systems that have been constructed using a common corpus of speech recordings.

...read moreread less

Abstract: The Blizzard Challenge offers a unique insight into progress in text-to-speech synthesis over the last decade. By using a very large listening test to compare the performance of a wide range of systems that have been constructed using a common corpus of speech recordings, it is possible to make some direct comparisons between competing techniques. By reviewing over a hundred papers describing all entries to the Challenge since 2005, we can make a useful summary of the most successful techniques adopted by participating teams, as well as drawing some conclusions about where the Blizzard Challenge has succeeded, and where there are still open problems in cross-system comparisons of text-to-speech synthesisers.

...read moreread less

97 citations

Proceedings Article•DOI•

Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language

[...]

Yusuke Yasuda¹, Xin Wang¹, Shinji Takaki¹, Junichi Yamagishi¹•Institutions (1)

National Institute of Informatics¹

12 May 2019

TL;DR: The results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, they show important stepping stones towards end-to-end Japanese speech synthesis.

...read moreread less

Abstract: End-to-end speech synthesis is a promising approach that directly converts raw text to speech. Although it was shown that Tacotron2 outperforms classical pipeline systems with regards to naturalness in English, its applicability to other languages is still unknown. Japanese could be one of the most difficult languages for which to achieve end-to-end speech synthesis, largely due to its character diversity and pitch accents. Therefore, state-of-the-art systems are still based on a traditional pipeline framework that requires a separate text analyzer and duration model. Towards end-to-end Japanese speech synthesis, we extend Tacotron to systems with self-attention to capture long-term dependencies related to pitch accents and compare their audio quality with classical pipeline systems under various conditions to show their pros and cons. In a large-scale listening test, we investigated the impacts of the presence of accentual-type labels, the use of force or predicted alignments, and acoustic features used as local condition parameters of the Wavenet vocoder. Our results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, we show important stepping stones towards end-to-end Japanese speech synthesis.

...read moreread less

90 citations

Posted Content•

Expressive TTS Training with Frame and Style Reconstruction Loss

[...]

Rui Liu¹, Berrak Sisman¹, Guanglai Gao², Haizhou Li³•Institutions (3)

Singapore University of Technology and Design¹, Inner Mongolia University², National University of Singapore³

04 Aug 2020-arXiv: Sound

TL;DR: This study is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness, and marks a departure from the style token paradigm.

...read moreread less

Abstract: We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system to improve the expressiveness of speech. One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data. It doesn't attempt to model prosody explicitly either, but rather encodes the association between input text and its prosody styles using a Tacotron-based TTS framework. Our proposed idea marks a departure from the style token paradigm where prosody is explicitly modeled by a bank of prosody embeddings. The proposed training strategy adopts a combination of two objective functions: 1) frame level reconstruction loss, that is calculated between the synthesized and target spectral features; 2) utterance level style reconstruction loss, that is calculated between the deep style features of synthesized and target speech. The proposed style reconstruction loss is formulated as a perceptual loss to ensure that utterance level speech style is taken into consideration during training. Experiments show that the proposed training strategy achieves remarkable performance and outperforms a state-of-the-art baseline in both naturalness and expressiveness. To our best knowledge, this is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness.

...read moreread less

45 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15

Collapse