scispace - formally typeset
Search or ask a question
Author

Tomoki Toda

Bio: Tomoki Toda is an academic researcher from Nagoya University. The author has contributed to research in topics: Speech synthesis & Speech processing. The author has an hindex of 48, co-authored 461 publications receiving 10650 citations. Previous affiliations of Tomoki Toda include Nippon Telegraph and Telephone & National Institute of Information and Communications Technology.


Papers
More filters
Journal ArticleDOI
TL;DR: In this article, a Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers, and a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory is proposed.
Abstract: In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.

914 citations

Proceedings ArticleDOI
22 Sep 2008
TL;DR: The 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia as discussed by the authors, was held at the University of Queensland, Queensland, Australia.
Abstract: INTERSPEECH2008: 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia.

796 citations

Journal ArticleDOI
TL;DR: In this article, the authors proposed a parameter generation algorithm for an HMM-based speech synthesis technique. But the generated trajectory is often excessively smoothed due to the statistical processing. And the over-smoothing effect usually causes muffled sounds.
Abstract: This paper describes a novel parameter generation algorithm for an HMM-based speech synthesis technique. The conventional algorithm generates a parameter trajectory of static features that maximizes the likelihood of a given HMM for the parameter sequence consisting of the static and dynamic features under an explicit constraint between those two features. The generated trajectory is often excessively smoothed due to the statistical processing. Using the over-smoothed speech parameters usually causes muffled sounds. In order to alleviate the over-smoothing effect, we propose a generation algorithm considering not only the HMM likelihood maximized in the conventional algorithm but also a likelihood for a global variance (GV) of the generated trajectory. The latter likelihood works as a penalty for the over-smoothing, i.e., a reduction of the GV of the generated trajectory. The result of a perceptual evaluation demonstrates that the proposed algorithm causes considerably large improvements in the naturalness of synthetic speech.

469 citations

Journal ArticleDOI
09 Apr 2013
TL;DR: This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech.
Abstract: This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are described.

424 citations

Proceedings ArticleDOI
20 Aug 2017
TL;DR: A speaker-dependent WaveNet vocoder is proposed, a method of synthesizing speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as auxiliary features of WaveNet.
Abstract: In this study, we propose a speaker-dependent WaveNet vocoder, a method of synthesizing speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as auxiliary features of WaveNet. It is expected that WaveNet can learn a sample-by-sample correspondence between speech waveform and acoustic features. The advantage of the proposed method is that it does not require (1) explicit modeling of excitation signals and (2) various assumptions, which are based on prior knowledge specific to speech. We conducted both subjective and objective evaluation experiments on CMUARCTIC database. From the results of the objective evaluation, it was demonstrated that the proposed method could generate high-quality speech with phase information recovered, which was lost by a mel-cepstrum vocoder. From the results of the subjective evaluation, it was demonstrated that the sound quality of the proposed method was significantly improved from mel-cepstrum vocoder, and the proposed method could capture source excitation information more accurately.

308 citations


Cited by
More filters
01 Jan 2009

7,241 citations

Posted Content
TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.
Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

4,002 citations

12 Sep 2016
TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

3,248 citations